[OpenAFS] Re: 1.4.1-rc2 feedback

Thomas Mueller thomas.mueller@hrz.tu-chemnitz.de
Wed, 14 Dec 2005 11:26:46 +0100 (MET)


On Tue, 13 Dec 2005, Russ Allbery wrote:

> However, we're running into some problems with file server crashes on
> Solaris 8 with 1.4.0 as well.  We're currently working on figuring out
> what's going on.
> 

This Morning I upgraded to OpenAFS-1.4.1-rc2 on Scientific Linux 3.0.5,
Kernel 2.4.21-32.0.1.EL.XFSsmp (several machines).

We had some fileserver crashes (LWP fileserver) on one of those servers:

Wed Dec 14 05:51:28 2005: Server directory access is okay
Wed Dec 14 08:50:08 2005: fs:file exited on signal 6 (core dumped)
Wed Dec 14 08:50:08 2005: fs:vol exited on signal 15
Wed Dec 14 08:50:08 2005: fs:salv exited with code 0
Wed Dec 14 09:00:05 2005: fs:file exited on signal 6 (core dumped)
Wed Dec 14 09:00:05 2005: fs:vol exited on signal 15
Wed Dec 14 09:00:05 2005: fs:salv exited with code 0
Wed Dec 14 09:02:51 2005: fs:file exited on signal 6 (core dumped)
Wed Dec 14 09:02:51 2005: fs:vol exited on signal 15
Wed Dec 14 09:02:51 2005: fs:salv exited with code 0
Wed Dec 14 09:11:19 2005: fs:file exited on signal 6 (core dumped)
Wed Dec 14 09:11:19 2005: fs:vol exited on signal 15
Wed Dec 14 09:11:19 2005: fs:salv exited with code 0


Below are the stack backtraces of these four core dumps:

#0  0x001f2cdf in raise () from /lib/tls/libc.so.6
#1  0x001f44e5 in abort () from /lib/tls/libc.so.6
#2  0x0809208d in osi_Panic (msg=0x80ae620 "rx packet not free\n", a1=2,
    a2=0, a3=0) at rx_user.c:222
#3  0x0809a601 in AllocPacketBufs (class=4, num_pkts=3143832, q=0xb52c4e8c)
    at rx_packet.c:372
#4  0x0809a849 in rxi_AllocDataBuf (p=0x9d47ef0, nb=7076, class=4)
    at rx_packet.c:523
#5  0x0809afd3 in rxi_ReadPacket (socket=4, p=0x9d47ef0, host=0xb52c4f3c,
    port=0x0) at rx_packet.c:1362
#6  0x08092a31 in rxi_ListenerProc (rfds=0x9d1baa0, tnop=0xb52c4fdc,
    newcallp=0xb52c4fe0) at rx_lwp.c:298
#7  0x08092c58 in rx_ServerProc () at rx_lwp.c:374
#8  0x080a1674 in Create_Process_Part2 () at lwp.c:778
#9  0x080a1c69 in savecontext (ep=0, savearea=0x0, sp=0x0) at process.c:197
#10 0x00000000 in ?? ()


#0  0x00720cdf in raise () from /lib/tls/libc.so.6
#1  0x007224e5 in abort () from /lib/tls/libc.so.6
#2  0x0809208d in osi_Panic (msg=0x80ae620 "rx packet not free\n", a1=2,
    a2=0, a3=0) at rx_user.c:222
#3  0x0809a601 in AllocPacketBufs (class=4, num_pkts=8575128, q=0xb51cee8c)
    at rx_packet.c:372
#4  0x0809a849 in rxi_AllocDataBuf (p=0xa0149f0, nb=7076, class=4)
    at rx_packet.c:523
#5  0x0809afd3 in rxi_ReadPacket (socket=4, p=0xa0149f0, host=0xb51cef3c,
    port=0x0) at rx_packet.c:1362
#6  0x08092a31 in rxi_ListenerProc (rfds=0x9fe11c0, tnop=0xb51cefdc,
    newcallp=0xb51cefe0) at rx_lwp.c:298
#7  0x08092c58 in rx_ServerProc () at rx_lwp.c:374
#8  0x080a1674 in Create_Process_Part2 () at lwp.c:778
#9  0x080a1c69 in savecontext (ep=0, savearea=0x0, sp=0x0) at process.c:197
#10 0x00000000 in ?? ()


#0  0x00138cdf in raise () from /lib/tls/libc.so.6
#1  0x0013a4e5 in abort () from /lib/tls/libc.so.6
#2  0x0809208d in osi_Panic (msg=0x80ae620 "rx packet not free\n",
    a1=135091520, a2=153756472, a3=1) at rx_user.c:222
#3  0x0809ae29 in rxi_AllocPacketNoLock (class=1) at rx_packet.c:1159
#4  0x0809aea2 in rxi_AllocSendPacket (call=0x92a2338, want=2381976)
    at rx_packet.c:1266
#5  0x0809d9e2 in rxi_WritevAlloc (call=0x92a2338, iov=0xb5388bec,
    nio=0xb5388bd0, maxio=16, nbytes=21636) at rx_rdwr.c:952
#6  0x08058d64 in FetchData_RXStyle (volptr=0x91cff08, targetptr=0x906217c,
    Call=0x92a2338, Pos=0, Len=33796, Int64Mode=0,
    a_bytesToFetchP=0xb5388cfc, a_bytesFetchedP=0xb5388d04)
    at afsfileprocs.c:6817
#7  0x080505e6 in common_FetchData64 (acall=0x92a2338, Fid=0xb5388efc, Pos=0,
    Len=131072, OutStatus=0xb5388e9c, CallBack=0xb5388e8c, Sync=0xb5388e6c,
    type=0) at afsfileprocs.c:2176
#8  0x08050b0c in SRXAFS_FetchData (acall=0x92a2338, Fid=0xb5388efc, Pos=0,
    Len=0, OutStatus=0xb5388e9c, CallBack=0xb5388e8c, Sync=0xb5388e6c)
    at afsfileprocs.c:2303
#9  0x08088873 in _RXAFS_FetchData (z_call=0x92a2338, z_xdrs=0xb5388f4c)
    at afsint.ss.c:69
#10 0x0808e4d8 in RXAFS_ExecuteRequest (z_call=0x92a2338) at afsint.ss.c:1872
#11 0x08093a58 in rxi_ServerProc (threadID=0, newcall=0x0, socketp=0xb5388fe4)
    at rx.c:1407
#12 0x08092c44 in rx_ServerProc () at rx_lwp.c:371
#13 0x080a1674 in Create_Process_Part2 () at lwp.c:778
#14 0x080a1c69 in savecontext (ep=0, savearea=0x0, sp=0x0) at process.c:197
#15 0x00000000 in ?? ()

#0  0x00641cdf in raise () from /lib/tls/libc.so.6
#1  0x006434e5 in abort () from /lib/tls/libc.so.6
#2  0x080a2bbf in IOMGR_Select (fds=7661720, readfds=0x80c9aa0, writefds=0x0,
    exceptfds=0x0, timeout=0x0) at iomgr.c:928
#3  0x0807c434 in FSYNC_sync () at fssync.c:345
#4  0x080a1674 in Create_Process_Part2 () at lwp.c:778
#5  0x080a1c69 in savecontext (ep=0, savearea=0x2, sp=0x0) at process.c:197
#6  0x080a1cdf in returnto (savearea=0x84b1ddc) at process.c:231
#7  0x080a17f3 in Dispatcher () at lwp.c:949
#8  0x080a1c9e in savecontext (ep=0, savearea=0x84a8654, sp=0x0)
    at process.c:184
#9  0x080a1571 in LWP_MwaitProcess (wcount=1, evlist=0xbfffc6d0) at lwp.c:729
#10 0x080a14ae in LWP_WaitProcess (event=0x6 <Address 0x6 out of bounds>)
    at lwp.c:681
#11 0x0804d1c8 in main (argc=24, argv=0xbfffc730) at viced.c:1981


I switched to the pthread-fileserver and this one runs now for more than two
hours.

Let me know if you need access to the fileserver binary or the core dumps.

Thomas.