[OpenAFS-devel] fileserver crash

Mattias Amnefelt mattiasa@e.kth.se
Wed, 18 Sep 2002 22:34:18 +0200


This sunday we suffered two simultaneous fileserver crashes. Both of
them crashed while talking to the same client, and they both crashed
at the same codeline.

This was using openafs-1.2.6 on Tru64 5.0a on alpha.

The crash occured in GetClient() on line 1485 in viced/host.c

(ladebug) where
#0  0x3ff800d08f8 in _sigprocmask(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in /usr/shlib/libc.so
#1  0x3ff800d2b6c in __sigprocmask(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in /usr/shlib/libc.so
#2  0x3ff801894f0 in abort(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in /usr/shlib/libc.so
#3  0x120042334 in AssertionFailed(file=0x140004cf0="../viced/host.c", line=32) "../util/assert.c":44
>4  0x12003344c in GetClient(tcon=0x1419c0900, cp=0x2000324d858) "../viced/host.c":1485
#5  0x120029f54 in GetVolumePackage(tcon=0x1419c0900, Fid=0x2000324d988, volptr=0x2000324d860, targetptr=0x2000324d870, chkforDir=<no value>, parent=0x2000324d868, client=0x2000324d858, locktype=1, rights=0x2000324d850, anyrights=0x2000324d848) "../viced/afsfileprocs.c":4922
#6  0x12001dce4 in SAFSS_FetchStatus(tcall=0x1418d1c00, Fid=0x2000324d988, OutStatus=0x2000324d9a8, CallBack=0x2000324d978, Sync=0x2000324d960) "../viced/afsfileprocs.c":817
#7  0x12001ee84 in SRXAFS_FetchStatus(tcon=0x1419c0900, Fid=0x2000324d988, OutStatus=0x2000324d9a8, CallBack=0x2000324d978, Sync=0x2000324d960) "../viced/afsfileprocs.c":1173
#8  0x12005df18 in _RXAFS_FetchStatus(z_call=0x1418d1c00, z_xdrs=0x2000324da20) "../fsint/afsint.ss.c":174
#9  0x120063c8c in RXAFS_ExecuteRequest(z_call=0x1418d1c00) "../fsint/afsint.ss.c":1892
#10 0x12007ccb8 in rxi_ServerProc(threadID=<no value>, newcall=0x0, socketp=0x2000324dac0) "../rx/rx.c":1326
#11 0x120092c9c in rx_ServerProc() "../rx/rx_pthread.c":288
#12 0x12009246c in server_entry(argp=0x3ff805b44a0) "../rx/rx_pthread.c":94
#13 0x3ff805b5f3c in __thdBase(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in /usr/shlib/libpthread.so
(ladebug) p client
0x0
(ladebug) p tcon->nSpecific
2
(ladebug) p rxcon_client_key
1
(ladebug) p tcon->specific[1]
0x0

The way I read the code, tcon->specific[rxcon_client_key] can only
become NULL at the same time as tcon->nSpecific is 2 if rx_SetSpecific
is called to set it to NULL. This is done in h_TossStuff_r()

The log shows:

Sun Sep 15 17:40:43 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:42:20 2002 CB: new identity for host 130.237.49.75:26386, deleting

26386 is 4711 if you swap byteorder, so there seems to be a byteorder
error somewhere, but that's kinda unrelated I guess.

: datan mattiasa \$ ; rxdebug 130.237.49.75 4711 -version
Trying 130.237.49.75 (port 4711):
AFS version: arla-0.35.8pre1

So, I asume that h_GetHost_r (called from preable) manages to
h_Release_r(host) and toss it. 

What I don't understand is how it can later become reused.

/mattiasa