[OpenAFS] Still getting client hangups

Systems Administration sysadmin@contrailservices.com
Mon, 20 Sep 2004 14:32:29 -0600

I've been busy working on revenue stuff but the client hangups on 
OpenAFS is still plaguing me, we've tried debugging the server by using 
the LWP and PTHREADs versions of the fileserver, but that does not 
appear to be the source of the problem, as the hangup is identical with 
either binary.

The latest testcase I have has a TCP dump from the fileserver of:

14:19:39.152698 IP (tos 0x0, ttl  64, id 24562, offset 0, flags [DF], 
length: 93) turbine..afs3-fileserver > aileron..afs3-callback: [udp sum 
ok]  rx ack cid 0507035c call# 7 seq 0 ser 14 <slow-start> first 2 
serial 16 reason ping response ifmtu 1444 maxmtu 1444 rwind 32 
maxpackets 4 (65)
14:19:39.245325 IP (tos 0x0, ttl  64, id 25994, offset 0, flags [DF], 
length: 1468) turbine..afs3-fileserver > aileron..afs3-callback: [udp 
sum ok]  rx data cid 0507035c call# 7 seq 1 ser 15 <req-ack> fs reply 
fetch-data (1440)

Can anyone help me decode these entries - as I read it the client 
appears to be deaf somehow as all the packets in the dump are from the 
fileserver only.  Is there someway to trace the client library 
operations to compare the state the client thinks its in with what the 
server expects?  This looks like a classic deadlock to me where one 
side or the other gets to a state not matched by the other and no 
fallback can be performed.

I have further stress tested the installation and:
	Between the 3 fileserver hosts that are on the same Gigabit backbone 
no hangups have been observed.
	From a client running a simple loop which downloads 5k,50k,500k,5M and 
50M files, flushes the cache, then loops - no full hangups occur (yet), 
but plenty of short pauses occur where there is no sustained network 
activity between the server and client.

Additionally I now have one client that completely refuses to connect 
to my home directory at all, it always locks up accessing this volume, 
but other AFS volumes are ok.

Very odd.