[OpenAFS-devel] "Lost contact with file server" problems

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Thu, 1 Sep 2005 08:06:18 +0200 (CEST)


Hi Jeff and Derrick!

On Mon, 29 Aug 2005, Jeffrey Hutzelman wrote:

> On Monday, August 29, 2005 12:32:23 +0200 Harald Barth <haba@pdc.kth.se> 
> wrote:
>
>> 
>>> Thanks! Now I can try something ;-) I'll give it a run at my cluster, so
>>> we should know in a few days if it's fixed. Does this patch modify the
>>> kernel module, the afsd or both?
>> 
>> this rx code ends up client side in the kernel module and in the
>> (userland) fileserver. I don't think there is much at all that afsd
>> does after the client has started. Looks to me that the patch changes
>> the behaviour in rx_NewCall() which decides what happens if the call
>> (which probably got recycled in rx.c line 1104) has error set. This
>> should convince the client to throw away the call (where later?). So
>> we're talking about the kernel module, aren't we? I don't know if the
>> error condition propagates to the other end of the rx-connection and
>> the patch is helpful in the fileserver, too.
>> 
>> Hm. What about doing something about call->error sooner around the
>> check for RX_STATE_DALLY? Somethink like if (call->error)
>> rxi_ResetCall()? Or is that too soon?
>
> No; you're missing the point.  The issue is not that we're recycling a call 
> that _does_ have an error; either rxi_NewCall or rxi_ResetCall will make sure 
> call->error is set to 0.  The problem I'm trying to correct is that when we 
> attach a new call to a _connection_ which is already in an error state, the 
> error doesn't propagate down to the new call.
>
>
> In any event, as Harald points out, this code ends up in pretty much anything 
> that uses Rx.  However, the component you're going to care about is the 
> kernel module, since it's your clients that are originating new calls on 
> connections already in error.
>
Our cluster has been busy now for three days without a single problem, so 
this is a good sign that your patch fixed this connection loss problem. 
I'll keep it running with network trace until the weekend and then I'll 
mark this problem SOLVED.

Would this patch go into 1.4?

Ciao,
 					Roland