[OpenAFS-devel] "Lost contact with file server" problems

Harald Barth haba@pdc.kth.se
Thu, 08 Sep 2005 18:09:16 +0200 (MEST)


> Examine the capture you took yesterday when things were not working. 
> Look for the AFS kvno in one of the messages from the client to the server.
> Since you are using some varient of the Kerberos 5 based tokens, the kvno
> should always be reported as 213 or 256.   If it is anything else, then the
> client is confused. 

Either my snapshot is not good enough or my ethereal driving license is not
adequate.

But I see more symptoms that indicate that we may have hunted but not
completely killed that bug:

Sep  8 17:05:30 d10n03 kxd[1204]: from fjell.pdc.kth.se(130.237.221.161): lama@NADA.KTH.SE -> lama

Sep  8 17:05:32 d10n03 kernel: afs: Lost contact with file server 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses down for the server)
Sep  8 17:05:32 d10n03 kernel: afs: Lost contact with file server 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses down for the server)

Sep  8 17:05:35 d10n03 kernel: afs: Tokens for user of AFS id 12020 for cell pdc.kth.se have expired

Sep  8 17:12:50 d10n03 kernel: afs: file server 130.237.232.195 in cell pdc.kth.se is back up (multi-homed address; other same-host interfaces may still be down
)

1. User logs in which in this case probably means than an expired
   ticket is used as a token.

2. Client complains that the server which has the user's $HOME
   is all down

3. Client discovers that the token has expired

4. Some minutes later the client recovers. Problem is: Would a batch
job try to start between 17:06 and 17:11 it would crash because AFS is
not available that very moment. 

So how can I prevent that the server is flagged down because of a
expired token? Seems to me still like a timing issue - sometimes the
server is flagged down first (which gives great grief) and sometimes
the client discovers that this one connection was no big deal and
nukes the connection first.

AFS version is 1.3.87 which has patch
checkservers-set-back-deadtime-correctly-20050804 and I added patch
rx-propagate-error-20050902 to.

Roland: And you don't see this any more? In this case: Lucky you.

Harald.