[OpenAFS-devel] "Lost contact with file server" problems

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Mon, 22 Aug 2005 14:10:25 +0200 (CEST)


Hi again!

On Mon, 22 Aug 2005, Roland Kuhn wrote:

> Hi Jeffrey!
>
> On Mon, 22 Aug 2005, Jeffrey Altman wrote:
>
>> Roland Kuhn wrote:
>>> Hi folks!
>>> 
>>> On Sun, 21 Aug 2005, Derrick J Brashear wrote:
>>> 
>>>> it needs to include the first error packet, e.g. the window where it
>>>> loses contact, to be useful
>>>> 
>>> Okay, it happened again, and I have a full trace:
>>> 
>>> http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace.cap
>>> http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace-end.cap
>>> 
>>> The latter contains only the last 81 frames and begins a few frames
>>> before the request which fails. The former is 10MB in size. If you need
>>> more history, I also have the last 1GB of the connection available.
>>> 192.168.18.2 is the server, 192.168.18.39 the client. The access is for
>>> big files typically.
>>> 
>>> Ciao,
>>>                     Roland
>> 
>> The Abort code is RXKADEXPIRED (19270409L).   Would you verify that you
>> still have a valid token and that your system clocks are in sync?
>> 
> The clocks are perfectly synchronized and I'm pretty sure that the batch jobs 
> have valid tokens, otherwise I would see other failures as well. Also, 
> wouldn't it be very nasty to effectively disable a complete client because 
> one connection has no valid token?
>
> The other thing is: it is the _client_ which sends the first ABORT in 
> response to a challenge....
>
I've also captured the 'self-healing' of the client state, although I'm 
not able to make something of it myself. The full trace is at

http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs.cap

It seems that 118 minutes after the failure the client makes a get-time 
call which succeeds, and then everything is happy again.

Ciao,
 					Roland