[OpenAFS-devel] "Lost contact with file server" problems
Roland Kuhn
rkuhn@e18.physik.tu-muenchen.de
Fri, 9 Sep 2005 10:00:54 +0200
--Apple-Mail-5--230716274
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Hi Harald!
On 8 Sep 2005, at 18:09, Harald Barth wrote:
>> Examine the capture you took yesterday when things were not working.
>> Look for the AFS kvno in one of the messages from the client to
>> the server.
>> Since you are using some varient of the Kerberos 5 based tokens,
>> the kvno
>> should always be reported as 213 or 256. If it is anything else,
>> then the
>> client is confused.
>>
>
> Either my snapshot is not good enough or my ethereal driving
> license is not
> adequate.
>
> But I see more symptoms that indicate that we may have hunted but not
> completely killed that bug:
>
> Sep 8 17:05:30 d10n03 kxd[1204]: from fjell.pdc.kth.se
> (130.237.221.161): lama@NADA.KTH.SE -> lama
>
> Sep 8 17:05:32 d10n03 kernel: afs: Lost contact with file server
> 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses
> down for the server)
> Sep 8 17:05:32 d10n03 kernel: afs: Lost contact with file server
> 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses
> down for the server)
>
> Sep 8 17:05:35 d10n03 kernel: afs: Tokens for user of AFS id 12020
> for cell pdc.kth.se have expired
>
> Sep 8 17:12:50 d10n03 kernel: afs: file server 130.237.232.195 in
> cell pdc.kth.se is back up (multi-homed address; other same-host
> interfaces may still be down
> )
>
> 1. User logs in which in this case probably means than an expired
> ticket is used as a token.
>
AFAICT this did not happen here, no tickets involved. The batch job
gets its token via AFS library from password.
> 2. Client complains that the server which has the user's $HOME
> is all down
>
Here it didn't affect /afs but only the fileserver which hosts the
big data files.
> 3. Client discovers that the token has expired
>
I've never seen the 'Tokens expired' log message in connection with
the "Lost contact" one, they were mutually exclusive. The only
message loggen between down and back was 'afs: failed to store file
(110)' (110 -> Connection timed out), sometimes several times.
> 4. Some minutes later the client recovers. Problem is: Would a batch
> job try to start between 17:06 and 17:11 it would crash because AFS is
> not available that very moment.
>
Well, that sounds familiar, but here it took almost two hours in all
cases.
> So how can I prevent that the server is flagged down because of a
> expired token? Seems to me still like a timing issue - sometimes the
> server is flagged down first (which gives great grief) and sometimes
> the client discovers that this one connection was no big deal and
> nukes the connection first.
>
> AFS version is 1.3.87 which has patch
> checkservers-set-back-deadtime-correctly-20050804 and I added patch
> rx-propagate-error-20050902 to.
>
> Roland: And you don't see this any more? In this case: Lucky you.
>
If only I had the time to try the multi-threaded fileserver, _then_ I
would be lucky ;-) Using the single-threaded one with 40 clients
reading simultaneously isn't fun at all :-(
Ciao,
Roland
--
TU Muenchen, Physik-Department E18, James-Franck-Str. 85747 Garching
Telefon 089/289-12592; Telefax 089/289-12570
--
A mouse is a device used to point at
the xterm you want to type in.
Kim Alm on a.s.r.
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P-(+) L+++ E(+) W+ !N K- w--- M
+ !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++
------END GEEK CODE BLOCK------
--Apple-Mail-5--230716274
content-type: application/pgp-signature; x-mac-type=70674453;
name=PGP.sig
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)
iD8DBQFDIUE8I4MWO8QIRP0RAgMtAKCCXgt29IyZ5JyYChprsv2bwsS1HQCgqM6P
OxBhvycThCYqibWXstNLkcI=
=6L6r
-----END PGP SIGNATURE-----
--Apple-Mail-5--230716274--