[OpenAFS-devel] DB server failover on master?

Thu, 29 Dec 2011 18:33:57 -0500 (EST)

I've got a box (freebuild.mit.edu) running master from around 12 november 
that seems to not be failing over when it hits an unresponsive db server. 
In particular, if I:
cd /afs/grand.central.org
ls

The 'ls' process hangs indefinitely; rxdebug seems to indicate that I'm 
only trying to talk to 130.237.48.87 (andrew.e.kth.se) which is known to 
be not very responsive these days.

My main question is whether anyone else is running a recent master on some 
flavor of unix, to try and decide if this might be freebsd-specific 
behavior or not.  Of course, if you want to help debug, that'd be fine, 
too (some more information below).

Thanks,

Ben

The kernel stack of the 'ls' process looks like:
   2368 100179 ls
mi_switch+0x1ea
sleepq_switch+0x123
   sleepq_wait+0x4d
   _sleep+0x369
   rxi_ReadProc+0x3ef
rx_ReadProc32+0xc1
   xdrrx_getint32+0x19
   afs_xdr_char+0x41
afs_xdr_vector+0x44
   xdr_uvldbentry+0x30
   VL_GetEntryByNameU+0x7b
afs_NewVolumeByName+0x237
   afs_GetVolumeByName+0x13c
EvalMountData+0x316
   EvalMountPoint+0x93
   afs_EvalFakeStat_int+0x12b
afs_EvalFakeStat+0xe
   afs_lookup+0x101

rxdebug:
freebuild# rxdebug localhost 7001
Trying 127.0.0.1 (port 7001):
Free packets: 235/243, packet reclaims: 0, calls: 178, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
0 calls have waited for a thread
Connection from host 130.237.48.87, port 7003, Cuid a3ee818f/3bb316a8
    serial 15,  natMTU 520, flags DESTROYED, security index 0, client conn
      call 0: # 2, state not initialized
      call 1: # 2, state not initialized
      call 2: # 1, state dally, mode: receiving
      call 3: # 0, state not initialized
Connection from host 130.237.48.87, port 7003, Cuid a3ee818f/3bb316ac
    serial 6,  natMTU 520, security index 0, client conn
      call 0: # 1, state active, mode: receiving, flags: reader_wait, has_output_packets
      call 1: # 1, state active, mode: receiving, flags: reader_wait, has_output_packets
      call 2: # 0, state not initialized
      call 3: # 0, state not initialized
Done.

I got a kernel core from a previous hang (a 'cp' process), and there 
wasn't anything that looked like it was going to deadlock; nobody held the 
glock, either.