[OpenAFS-devel] "Lost contact with file server" problems

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Mon, 22 Aug 2005 15:07:45 +0200 (CEST)


Hi Jeffrey!

On Mon, 22 Aug 2005, Jeffrey Altman wrote:

> Roland Kuhn wrote:
>
>>>> The Abort code is RXKADEXPIRED (19270409L).   Would you verify that you
>>>> still have a valid token and that your system clocks are in sync?
>>>>
>>> The clocks are perfectly synchronized and I'm pretty sure that the
>>> batch jobs have valid tokens, otherwise I would see other failures as
>>> well. Also, wouldn't it be very nasty to effectively disable a
>>> complete client because one connection has no valid token?
>>>
>>> The other thing is: it is the _client_ which sends the first ABORT in
>>> response to a challenge....
>>>
>> I've also captured the 'self-healing' of the client state, although I'm
>> not able to make something of it myself. The full trace is at
>>
>> http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs.cap
>>
>> It seems that 118 minutes after the failure the client makes a get-time
>> call which succeeds, and then everything is happy again.
>>
>> Ciao,
>>                     Roland
>
> I simply interpret that to mean that after 118 minutes the client
> finally dumps the token and starts to make unencrypted file server
> requests.
>
But how would that explain that even other users with completely unrelated 
tokens cannot access files on that fileserver from the failed client?

> What I am seeing here is that the rx libary is detecting that the
> token is expired.   It sends an abort to the server which simply
> marks the client's connection in an error state.  Each subsequent
> request from the client on that connection is responded to with the
> expired token abort code.
>
> Now the question is what is the client doing with the RXKADEXPIRED
> error when it receives it from the server.   The answer appears to
> be "not much".   It looks to me as if the client is simply issuing
> a warning to the user that the tokens are expired.   It does not
> actually remove the tokens or reset the connection.
>
I've had syslog entries like 'kernel: afs: Tokens for user of AFS id -1 
for cell <mycell> have expired', but not from the client which actually 
failed. That one logged 'kernel: afs: failed to store file (110)', where 
110 translates into 'connection timed out', right?

Ciao,
 					Roland