[OpenAFS-devel] "Lost contact with file server" problems
Roland Kuhn
rkuhn@e18.physik.tu-muenchen.de
Mon, 22 Aug 2005 15:07:45 +0200 (CEST)
Hi Jeffrey!
On Mon, 22 Aug 2005, Jeffrey Altman wrote:
> Roland Kuhn wrote:
>
>>>> The Abort code is RXKADEXPIRED (19270409L). Would you verify that you
>>>> still have a valid token and that your system clocks are in sync?
>>>>
>>> The clocks are perfectly synchronized and I'm pretty sure that the
>>> batch jobs have valid tokens, otherwise I would see other failures as
>>> well. Also, wouldn't it be very nasty to effectively disable a
>>> complete client because one connection has no valid token?
>>>
>>> The other thing is: it is the _client_ which sends the first ABORT in
>>> response to a challenge....
>>>
>> I've also captured the 'self-healing' of the client state, although I'm
>> not able to make something of it myself. The full trace is at
>>
>> http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs.cap
>>
>> It seems that 118 minutes after the failure the client makes a get-time
>> call which succeeds, and then everything is happy again.
>>
>> Ciao,
>> Roland
>
> I simply interpret that to mean that after 118 minutes the client
> finally dumps the token and starts to make unencrypted file server
> requests.
>
But how would that explain that even other users with completely unrelated
tokens cannot access files on that fileserver from the failed client?
> What I am seeing here is that the rx libary is detecting that the
> token is expired. It sends an abort to the server which simply
> marks the client's connection in an error state. Each subsequent
> request from the client on that connection is responded to with the
> expired token abort code.
>
> Now the question is what is the client doing with the RXKADEXPIRED
> error when it receives it from the server. The answer appears to
> be "not much". It looks to me as if the client is simply issuing
> a warning to the user that the tokens are expired. It does not
> actually remove the tokens or reset the connection.
>
I've had syslog entries like 'kernel: afs: Tokens for user of AFS id -1
for cell <mycell> have expired', but not from the client which actually
failed. That one logged 'kernel: afs: failed to store file (110)', where
110 translates into 'connection timed out', right?
Ciao,
Roland