[OpenAFS-devel] DB server failover on master?
Benjamin Kaduk
kaduk@MIT.EDU
Thu, 29 Dec 2011 18:33:57 -0500 (EST)
I've got a box (freebuild.mit.edu) running master from around 12 november
that seems to not be failing over when it hits an unresponsive db server.
In particular, if I:
cd /afs/grand.central.org
ls
The 'ls' process hangs indefinitely; rxdebug seems to indicate that I'm
only trying to talk to 130.237.48.87 (andrew.e.kth.se) which is known to
be not very responsive these days.
My main question is whether anyone else is running a recent master on some
flavor of unix, to try and decide if this might be freebsd-specific
behavior or not. Of course, if you want to help debug, that'd be fine,
too (some more information below).
Thanks,
Ben
The kernel stack of the 'ls' process looks like:
2368 100179 ls
mi_switch+0x1ea
sleepq_switch+0x123
sleepq_wait+0x4d
_sleep+0x369
rxi_ReadProc+0x3ef
rx_ReadProc32+0xc1
xdrrx_getint32+0x19
afs_xdr_char+0x41
afs_xdr_vector+0x44
xdr_uvldbentry+0x30
VL_GetEntryByNameU+0x7b
afs_NewVolumeByName+0x237
afs_GetVolumeByName+0x13c
EvalMountData+0x316
EvalMountPoint+0x93
afs_EvalFakeStat_int+0x12b
afs_EvalFakeStat+0xe
afs_lookup+0x101
rxdebug:
freebuild# rxdebug localhost 7001
Trying 127.0.0.1 (port 7001):
Free packets: 235/243, packet reclaims: 0, calls: 178, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
0 calls have waited for a thread
Connection from host 130.237.48.87, port 7003, Cuid a3ee818f/3bb316a8
serial 15, natMTU 520, flags DESTROYED, security index 0, client conn
call 0: # 2, state not initialized
call 1: # 2, state not initialized
call 2: # 1, state dally, mode: receiving
call 3: # 0, state not initialized
Connection from host 130.237.48.87, port 7003, Cuid a3ee818f/3bb316ac
serial 6, natMTU 520, security index 0, client conn
call 0: # 1, state active, mode: receiving, flags: reader_wait, has_output_packets
call 1: # 1, state active, mode: receiving, flags: reader_wait, has_output_packets
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Done.
I got a kernel core from a previous hang (a 'cp' process), and there
wasn't anything that looked like it was going to deadlock; nobody held the
glock, either.