[OpenAFS] Linux server/client hangs and crashes
Systems Administration
sysadmin@contrailservices.com
Tue, 10 Aug 2004 18:58:08 -0600
I've been debugging this with Derrick's help over on the darwin-port
list but this is clearly a bigger issue as the hangs/freezes are now on
the linux boxen as well.
Server: Linux 2.4.26 (gentoo-hardened-sources-2.4.26-r1) -
openafs-1.2.11 - single and dual-processor PIII boxen
Client: Linux 2.4.26 (gentoo-hardned-sources-24.26-r1) - openafs-1.2.11
Linux 2.4.26 (vanilla unpatched kernel.org source) - openafs-1.2.11
Mac OS X 10.3.4 - openafs-1.2.11-062304
Is there any chance, even a slim one, that the 1.3.X releases would
behave better?
I have tried using the lwp version of fileserver, wrapped the pthreads
version with LD_ASSUME_KERNEL=2.4.1, and done tcpdump, cmdebug,
fstrace, etc. etc. ad nauseam.
Clients can access any file in the afs tree for single accesses
(usually) and file permissions are correct - when attempting to add or
remove a group of files (untarring an archive or '/bin/rm -rf
<dirname>') the client will hang. Other machines can access the same
locations and view files in the affected areas. The client machine is
hung with a process that cannot be killed. Some single file accesses
will also hang the client at random intervals, but any mass add/remove
will always cause the hangup.
The only way to get a timeout is to manually stop and restart the
fileserver which is hosting the volume - 'bos shutdown <host>'. This
causes the client to report a communications timeout, and everything
seems to go back to normal after 10 minutes when the old cache entries
expire and the client restarts a connection to the fileserver. Of
course this is only temporary, any attempt to rety the offending
command will repeat the hang.
When the client hangs cmdebug shows:
[lightning:~] ted% cmdebug localhost
** Cache entry @ 0x0d765c30 for 1.536871033.1053.3823
[ridgebacksystems.com]
locks: (none_waiting, upgrade_locked(pid:642 at:66))
2048 bytes DV 16 refcnt 5
callback 015c85c0 expires 1092186445
1 opens 0 writers
normal file
states (0x1), stat'd
Tcp dump shows:
18:49:00.636941 IP (tos 0x0, ttl 64, id 38936, offset 0, flags [none],
length: 93) lightning.internal.contrailservices.com.afs3-callback >
turbine.internal.ridgebacksystems.com.afs3-fileserver: rx ack cid
f224e05c call# 1584 seq 0 ser 3492 <client-init>,<req-ack>,<slow-start>
first 1 serial 0 reason ping ifmtu 5692 (65)
18:49:00.637266 IP (tos 0x0, ttl 64, id 43465, offset 0, flags [DF],
length: 93) turbine.internal.ridgebacksystems.com.afs3-fileserver >
lightning.internal.contrailservices.com.afs3-callback: rx ack cid
f224e05c call# 1584 seq 0 ser 168 <slow-start> first 2 serial 3492
reason ping response ifmtu 5692 (65)
18:49:09.556463 IP (tos 0x0, ttl 64, id 40073, offset 0, flags [DF],
length: 93) turbine.internal.ridgebacksystems.com.afs3-fileserver >
lightning.internal.contrailservices.com.afs3-callback: rx ack cid
f224e05c call# 1584 seq 0 ser 169 <req-ack>,<slow-start> first 2 serial
0 reason ping ifmtu 5692 (65)
18:49:09.556729 IP (tos 0x0, ttl 64, id 39218, offset 0, flags [none],
length: 93) lightning.internal.contrailservices.com.afs3-callback >
turbine.internal.ridgebacksystems.com.afs3-fileserver: rx ack cid
f224e05c call# 1584 seq 0 ser 3493 <client-init>,<slow-start> first 1
serial 169 reason ping response ifmtu 5692 (65)
rinse, lather, repeat,... forever - as far as I can see tcpdump shows
the same two lines repeating as long as I have patience to let it run
When the client times-out it reports:
afs: Lost contact with file server 192.168.1.2 in cell
ridgebacksystems.com (all multi-homed ip addresses down for the server)
When I tried to use fstrace on the fileserver - bam kernel panics right
and left - I'm trying to setup a serial console to capture these now.
--****** Automated management services for General Aviation *******--
Theodore F Vaida <ted@contrailservices.com>
President and CTO
3300 Airport Road, Building J Box E, Boulder CO 80301
phone: 303.225.4625 fax: 303.225.4627