[OpenAFS-devel] "Lost contact with file server" problems

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Fri, 9 Sep 2005 10:00:54 +0200

Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

Hi Harald!

On 8 Sep 2005, at 18:09, Harald Barth wrote:

>> Examine the capture you took yesterday when things were not working.
>> Look for the AFS kvno in one of the messages from the client to  
>> the server.
>> Since you are using some varient of the Kerberos 5 based tokens,  
>> the kvno
>> should always be reported as 213 or 256.   If it is anything else,  
>> then the
>> client is confused.
> Either my snapshot is not good enough or my ethereal driving  
> license is not
> adequate.
> But I see more symptoms that indicate that we may have hunted but not
> completely killed that bug:
> Sep  8 17:05:30 d10n03 kxd[1204]: from fjell.pdc.kth.se 
> ( lama@NADA.KTH.SE -> lama
> Sep  8 17:05:32 d10n03 kernel: afs: Lost contact with file server  
> in cell pdc.kth.se (all multi-homed ip addresses  
> down for the server)
> Sep  8 17:05:32 d10n03 kernel: afs: Lost contact with file server  
> in cell pdc.kth.se (all multi-homed ip addresses  
> down for the server)
> Sep  8 17:05:35 d10n03 kernel: afs: Tokens for user of AFS id 12020  
> for cell pdc.kth.se have expired
> Sep  8 17:12:50 d10n03 kernel: afs: file server in  
> cell pdc.kth.se is back up (multi-homed address; other same-host  
> interfaces may still be down
> )
> 1. User logs in which in this case probably means than an expired
>    ticket is used as a token.
AFAICT this did not happen here, no tickets involved. The batch job  
gets its token via AFS library from password.

> 2. Client complains that the server which has the user's $HOME
>    is all down
Here it didn't affect /afs but only the fileserver which hosts the  
big data files.

> 3. Client discovers that the token has expired
I've never seen the 'Tokens expired' log message in connection with  
the "Lost contact" one, they were mutually exclusive. The only  
message loggen between down and back was 'afs: failed to store file  
(110)' (110 -> Connection timed out), sometimes several times.

> 4. Some minutes later the client recovers. Problem is: Would a batch
> job try to start between 17:06 and 17:11 it would crash because AFS is
> not available that very moment.
Well, that sounds familiar, but here it took almost two hours in all  

> So how can I prevent that the server is flagged down because of a
> expired token? Seems to me still like a timing issue - sometimes the
> server is flagged down first (which gives great grief) and sometimes
> the client discovers that this one connection was no big deal and
> nukes the connection first.
> AFS version is 1.3.87 which has patch
> checkservers-set-back-deadtime-correctly-20050804 and I added patch
> rx-propagate-error-20050902 to.
> Roland: And you don't see this any more? In this case: Lucky you.
If only I had the time to try the multi-threaded fileserver, _then_ I  
would be lucky ;-) Using the single-threaded one with 40 clients  
reading simultaneously isn't fun at all :-(


TU Muenchen, Physik-Department E18, James-Franck-Str. 85747 Garching
Telefon 089/289-12592; Telefax 089/289-12570
A mouse is a device used to point at
the xterm you want to type in.
Kim Alm on a.s.r.
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P-(+) L+++ E(+) W+ !N K- w--- M 
+ !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++

content-type: application/pgp-signature; x-mac-type=70674453;
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit

Version: GnuPG v1.4.0 (Darwin)

