[OpenAFS] Linux server/client hangs and crashes

Systems Administration sysadmin@contrailservices.com
Tue, 10 Aug 2004 18:58:08 -0600


I've been debugging this with Derrick's help over on the darwin-port 
list but this is clearly a bigger issue as the hangs/freezes are now on 
the linux boxen as well.

Server: Linux 2.4.26 (gentoo-hardened-sources-2.4.26-r1) - 
openafs-1.2.11 - single and dual-processor PIII boxen
Client: Linux 2.4.26 (gentoo-hardned-sources-24.26-r1)  - openafs-1.2.11
	Linux 2.4.26 (vanilla unpatched kernel.org source) - openafs-1.2.11
	Mac OS X 10.3.4 - openafs-1.2.11-062304

Is there any chance, even a slim one, that the 1.3.X releases would 
behave better?



I have tried using the lwp version of fileserver, wrapped the pthreads 
version with LD_ASSUME_KERNEL=2.4.1, and done tcpdump, cmdebug, 
fstrace, etc. etc. ad nauseam.

Clients can access any file in the afs tree for single accesses 
(usually) and file permissions are correct - when attempting to add or 
remove a group of files (untarring an archive or '/bin/rm -rf 
<dirname>') the client will hang.  Other machines can access the same 
locations and view files in the affected areas.  The client machine is 
hung with a process that cannot be killed.  Some single file accesses 
will also hang the client at random intervals, but any mass add/remove 
will always cause the hangup.

The only way to get a timeout is to manually stop and restart the 
fileserver which is hosting the volume - 'bos shutdown <host>'.  This 
causes the client to report a communications timeout, and everything 
seems to go back to normal after 10 minutes when the old cache entries 
expire and the client restarts a connection to the fileserver.  Of 
course this is only temporary, any attempt to rety the offending 
command will repeat the hang.


When the client hangs cmdebug shows:
[lightning:~] ted% cmdebug localhost
** Cache entry @ 0x0d765c30 for 1.536871033.1053.3823 
[ridgebacksystems.com]
     locks: (none_waiting, upgrade_locked(pid:642 at:66))
     2048 bytes  DV 16 refcnt 5
     callback 015c85c0   expires 1092186445
     1 opens     0 writers
     normal file
     states (0x1), stat'd

Tcp dump shows:
18:49:00.636941 IP (tos 0x0, ttl  64, id 38936, offset 0, flags [none], 
length: 93) lightning.internal.contrailservices.com.afs3-callback > 
turbine.internal.ridgebacksystems.com.afs3-fileserver:  rx ack cid 
f224e05c call# 1584 seq 0 ser 3492 <client-init>,<req-ack>,<slow-start> 
first 1 serial 0 reason ping ifmtu 5692 (65)
18:49:00.637266 IP (tos 0x0, ttl  64, id 43465, offset 0, flags [DF], 
length: 93) turbine.internal.ridgebacksystems.com.afs3-fileserver > 
lightning.internal.contrailservices.com.afs3-callback:  rx ack cid 
f224e05c call# 1584 seq 0 ser 168 <slow-start> first 2 serial 3492 
reason ping response ifmtu 5692 (65)
18:49:09.556463 IP (tos 0x0, ttl  64, id 40073, offset 0, flags [DF], 
length: 93) turbine.internal.ridgebacksystems.com.afs3-fileserver > 
lightning.internal.contrailservices.com.afs3-callback:  rx ack cid 
f224e05c call# 1584 seq 0 ser 169 <req-ack>,<slow-start> first 2 serial 
0 reason ping ifmtu 5692 (65)
18:49:09.556729 IP (tos 0x0, ttl  64, id 39218, offset 0, flags [none], 
length: 93) lightning.internal.contrailservices.com.afs3-callback > 
turbine.internal.ridgebacksystems.com.afs3-fileserver:  rx ack cid 
f224e05c call# 1584 seq 0 ser 3493 <client-init>,<slow-start> first 1 
serial 169 reason ping response ifmtu 5692 (65)

rinse, lather, repeat,... forever - as far as I can see tcpdump shows 
the same two lines repeating as long as I have patience to let it run


When the client times-out it reports:
afs: Lost contact with file server 192.168.1.2 in cell 
ridgebacksystems.com (all multi-homed ip addresses down for the server)


When I tried to use fstrace on the fileserver - bam kernel panics right 
and left - I'm trying to setup a serial console to capture these now.



--****** Automated management services for General Aviation *******--
                 Theodore F Vaida <ted@contrailservices.com>
                                          President and CTO
            3300 Airport Road, Building J Box E, Boulder CO 80301
                  phone: 303.225.4625          fax: 303.225.4627