[OpenAFS-port-darwin] startup cache scan hang

Nicholas Riley njriley@uiuc.edu
Sun, 27 May 2007 15:17:19 -0500


Hi,

Has anyone seen OpenAFS hang on startup, seemingly during a cache
scan?  I experienced hangs on over 50% of our machines after both
recent security updates - even after waiting 15 minutes or more, /afs
doesn't mount.  This is OpenAFS 1.4.4 on both Intel and PowerPC
machines (the problem seems a bit more prevalent on PowerPC).  We
don't have any similar problems on Linux or Solaris.

Here's what the system log says:

May 27 14:42:24 bender kernel[0]: Starting AFS cache scan...
[...]
May 27 14:42:28 bender kernel[0]: [256] waiting for afs_osi_ctxtp
May 27 14:42:33 bender kernel[0]: [256] waiting for afs_osi_ctxtp

And there seem to be a bunch of zombie afsds around.  The below
transcribed from the screen since my console/SSH connections hung
entirely shortly thereafter, so there may be a few errors in it.

USER    PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED    TIME  COMMAND
root    237  0.0  0.4  27692  2028  ??  U    2:42PM  0:00.24 /usr/sbin/afsd -afsdb -stat 10000 -dcache 2500 -daemons 5 -volumes 70 -dynroot -fakestat-all
root    255  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00 (afsd)
root    256  0.0  0.0  27692  2028  ??  Us   2:42PM  0:00.00 /usr/sbin/afsd -afsdb -stat 10000 -dcache 2500 -daemons 5 -volumes 70 -dynroot -fakestat-all
root    252  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00 (afsd)
root    253  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00 (afsd)
root    254  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00 (afsd)

rxdebug says:

Free packets: 130, packet reclaims: 0, calls: 60, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
rx stats: free packets 130, allocs 130, alloc-failures(rcv 0/0,send 0/0,ack 0)
   greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 0, selects 0, sendSelects 0
   packets read: data 60 ack 51 busy 0 abort 0 ackall 0 challenge 0 response 0 debug 8 params 0 unused 0 unused 0 unused 0 version 0
   other read counters: data 60, ack 51, dup 0 spurious 0 dally 0
   packets sent: data 51 ack 0 busy 0 abort 9 ackall 0 challenge 0 response 0 debug 0 params 0 unused 0 unused 0 unused 0 version 0
   other send counters: ack 0, data 102 (not resends), resends 0, pushed 0, acked&ignored 0
        (these should be small) sendFailed 0, fatalErrors 0
   3 server connections, 0 client connections, 3 peer structs, 3 call structs, 0 free call structs
Done.

and cmdebug returns none_waiting for everything.

I've been thinking of simply blowing away the cache directory before
starting AFS - would that be likely to help?  Is there any other info
that's useful in diagnosing the problem?

-- 
Nicholas Riley <njriley@uiuc.edu> | <http://www.uiuc.edu/ph/www/njriley>