[OpenAFS] Still getting client hangups
Systems Administration
sysadmin@contrailservices.com
Mon, 20 Sep 2004 14:32:29 -0600
I've been busy working on revenue stuff but the client hangups on
OpenAFS is still plaguing me, we've tried debugging the server by using
the LWP and PTHREADs versions of the fileserver, but that does not
appear to be the source of the problem, as the hangup is identical with
either binary.
The latest testcase I have has a TCP dump from the fileserver of:
14:19:39.152698 IP (tos 0x0, ttl 64, id 24562, offset 0, flags [DF],
length: 93) turbine..afs3-fileserver > aileron..afs3-callback: [udp sum
ok] rx ack cid 0507035c call# 7 seq 0 ser 14 <slow-start> first 2
serial 16 reason ping response ifmtu 1444 maxmtu 1444 rwind 32
maxpackets 4 (65)
14:19:39.245325 IP (tos 0x0, ttl 64, id 25994, offset 0, flags [DF],
length: 1468) turbine..afs3-fileserver > aileron..afs3-callback: [udp
sum ok] rx data cid 0507035c call# 7 seq 1 ser 15 <req-ack> fs reply
fetch-data (1440)
Can anyone help me decode these entries - as I read it the client
appears to be deaf somehow as all the packets in the dump are from the
fileserver only. Is there someway to trace the client library
operations to compare the state the client thinks its in with what the
server expects? This looks like a classic deadlock to me where one
side or the other gets to a state not matched by the other and no
fallback can be performed.
I have further stress tested the installation and:
Between the 3 fileserver hosts that are on the same Gigabit backbone
no hangups have been observed.
From a client running a simple loop which downloads 5k,50k,500k,5M and
50M files, flushes the cache, then loops - no full hangups occur (yet),
but plenty of short pauses occur where there is no sustained network
activity between the server and client.
Additionally I now have one client that completely refuses to connect
to my home directory at all, it always locks up accessing this volume,
but other AFS volumes are ok.
Very odd.
Ted