[OpenAFS-devel] "Lost contact with file server" problems

Harald Barth haba@pdc.kth.se
Mon, 29 Aug 2005 12:32:23 +0200 (MEST)


> Thanks! Now I can try something ;-) I'll give it a run at my cluster, so 
> we should know in a few days if it's fixed. Does this patch modify the 
> kernel module, the afsd or both?

this rx code ends up client side in the kernel module and in the
(userland) fileserver. I don't think there is much at all that afsd
does after the client has started. Looks to me that the patch changes
the behaviour in rx_NewCall() which decides what happens if the call
(which probably got recycled in rx.c line 1104) has error set. This
should convince the client to throw away the call (where later?). So
we're talking about the kernel module, aren't we? I don't know if the
error condition propagates to the other end of the rx-connection and
the patch is helpful in the fileserver, too.

Hm. What about doing something about call->error sooner around the
check for RX_STATE_DALLY? Somethink like if (call->error)
rxi_ResetCall()? Or is that too soon?

rx.c line 1102:
    for (;;) {
        for (i = 0; i < RX_MAXCALLS; i++) {
            if (call) {
                MUTEX_ENTER(&call->lock);
                if (call->state == RX_STATE_DALLY) {
                    rxi_ResetCall(call, 0);
                    (*call->callNumber)++;
                    break;
                }
                MUTEX_EXIT(&call->lock);
            } else {
                call = rxi_NewCall(conn, i);
                break;
            }
        }

I will test this patch as soon as I can, too. First day back from my
vacation today. Whee! (So NOT).

Harald.