[OpenAFS-devel] "Lost contact with file server" problems
Roland Kuhn
rkuhn@e18.physik.tu-muenchen.de
Mon, 22 Aug 2005 14:10:25 +0200 (CEST)
Hi again!
On Mon, 22 Aug 2005, Roland Kuhn wrote:
> Hi Jeffrey!
>
> On Mon, 22 Aug 2005, Jeffrey Altman wrote:
>
>> Roland Kuhn wrote:
>>> Hi folks!
>>>
>>> On Sun, 21 Aug 2005, Derrick J Brashear wrote:
>>>
>>>> it needs to include the first error packet, e.g. the window where it
>>>> loses contact, to be useful
>>>>
>>> Okay, it happened again, and I have a full trace:
>>>
>>> http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace.cap
>>> http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace-end.cap
>>>
>>> The latter contains only the last 81 frames and begins a few frames
>>> before the request which fails. The former is 10MB in size. If you need
>>> more history, I also have the last 1GB of the connection available.
>>> 192.168.18.2 is the server, 192.168.18.39 the client. The access is for
>>> big files typically.
>>>
>>> Ciao,
>>> Roland
>>
>> The Abort code is RXKADEXPIRED (19270409L). Would you verify that you
>> still have a valid token and that your system clocks are in sync?
>>
> The clocks are perfectly synchronized and I'm pretty sure that the batch jobs
> have valid tokens, otherwise I would see other failures as well. Also,
> wouldn't it be very nasty to effectively disable a complete client because
> one connection has no valid token?
>
> The other thing is: it is the _client_ which sends the first ABORT in
> response to a challenge....
>
I've also captured the 'self-healing' of the client state, although I'm
not able to make something of it myself. The full trace is at
http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs.cap
It seems that 118 minutes after the failure the client makes a get-time
call which succeeds, and then everything is happy again.
Ciao,
Roland