[OpenAFS] Re: 1.6.0-pre2 ptserver/vlserver dumping core
omalleys@msu.edu
omalleys@msu.edu
Tue, 01 Mar 2011 10:51:36 -0500
Quoting "Ryan C. Underwood" <nemesis-lists@icequake.net>:
>
> On Mon, Feb 28, 2011 at 11:33:24PM -0600, Andrew Deason wrote:
>>
>> > Very similar, here's an example of the new one. Seems like a
>> > double-free is going on at the end of rxi_CleanupConnection() if the
>> > glibc spewage in the log is to be trusted. I can try to build a full
>> > debug version if you think it would help. Is it possible to valgrind?
>>
>> With LWP? Probably not usefully.
>
> How does a smart hacker (unlike myself) usually debug OpenAFS server
> binaries? Would I have to flatten the threading by i.e. disabling LWP
> at compile time?
>
>> Could try printing the contents of the conn and call:
>>
>> frame 7
>> print *conn
>> up
>> print *call
>
> I will try to wrap these readably. It doesn't seem very useful though.
> I think the actual memory corruption has already happened somewhere
> else by this point.
>
> #8 0x0807c014 in rxi_CheckCall (call=0x9a71508) at rx.c:6150
> 6150 rxi_FreeCall(call);
> (gdb) print *call
>
> $1 = {queue_item_header = {prev = 0x9a73198, next = 0x80f1770}, tq =
> {prev = 0x9a71510, next = 0x9a71510}, rq = {prev = 0x9a71518, next =
> 0x9a71518}, iovq = { prev = 0x9a71520, next = 0x9a71520}, nLeft = 0,
> curvec = 1, curlen = 0, nFree = 0, currentPacket = 0x0, curpos =
> 0xb7702f34 "", channel = 0 '\000', state = 5 '\005', mode = 3 '\003',
> conn = 0x9b4ac50, callNumber = 0x9b4ac74, flags = 0, localStatus = 0
> '\000', remoteStatus = 0 '\000', error = 0, timeout = 0, rnext = 1,
> rprev = 0, rwind = 16, tfirst = 1, tnext = 1, tprev = 0, twind = 16,
> cwind = 1, nSoftAcked = 0, nextCwind = 0, nCwindAcks = 0, ssthresh = 32,
> nDgramPackets = 1, nAcks = 0, nNacks = 0, nSoftAcks = 0, nHardAcks = 0,
> congestSeq = 0, resendEvent = 0x0, timeoutEvent = 0x0, keepAliveEvent =
> 0x0, growMTUEvent = 0x0, delayedAckEvent = 0x0, delayedAbortEvent = 0x0,
> abortCode = 0, abortCount = 0, lastSendTime = 1298949656,
> lastReceiveTime = 1298949609, lastSendData = 1298949609, arrivalProc =
> 0, arrivalProcHandle = 0x9b44ab0, arrivalProcArg = 0, lastAcked = 0,
> startWait = 0, traceWait = {sec = 0, usec = 0}, traceStart = {sec = 0,
> usec = 0}, MTU = 1444, iovNBytes = 0, iovMax = 0, iovNext = 0, iov =
> 0x0, queueTime = {sec = 1298949609, usec = 101749}, startTime = {sec =
> 1298949609, usec = 101760}, bytesSent = {high = 0, low = 4}, bytesRcvd =
> {high = 0, low = 0}, tqWaiters = 0, xmitList = {0xb7702e48, 0x0 <repeats
> 254 times>}}
>
> #7 0x080786fd in rxi_CleanupConnection (conn=0x9b4ac50) at rx.c:991
> 991 rxi_FreeConnection(conn);
> (gdb) print *conn
>
> $2 = {next = 0x0, peer = 0x9a71330, epoch = 2828736158, cid =
> 3015650936, error = 0, call = {0x0, 0x0, 0x0, 0x0}, callNumber = {2, 0,
> 0, 0}, rwind = {16, 16, 16, 16}, twind = {16, 16, 16, 16}, lastBusy =
> {0, 0, 0, 0}, serial = 9, lastSerial = 0, maxSerial = 0, lastPacketSize
> = 4, lastPacketSizeSeq = 1, lastPingSize = 0, lastPingSizeSer = 0,
> challengeEvent = 0x0, delayedAbortEvent = 0x0, checkReachEvent = 0x0,
> abortCount = 0, service = 0x0, serviceId = 51, refCount = 0, flags = 2
> '\002', type = 0 '\000', secondsUntilPing = 8 '\b', securityIndex = 2
> '\002', securityObject = 0x9a70218, securityData = 0x0,
> securityHeaderSize = 0, securityMaxTrailerSize = 0, timeout = 0,
> lastSendTime = 1298949656, secondsUntilDead = 50, hardDeadTime = 0,
> idleDeadTime = 0, ackRate = 1 '\001', makeCallWaiters = 0 '\000',
> idleDeadErr = 0, secondsUntilNatPing = 0, natKeepAliveEvent = 0x0,
> msgsizeRetryErr = 0, nSpecific = 0, specific = 0x0}
>
> Here is the corresponding backtrace to the above:
>
> #0 0xb78e2424 in __kernel_vsyscall ()
> #1 0xb77a7751 in *__GI_raise (sig=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #2 0xb77aab82 in *__GI_abort () at abort.c:92
> #3 0xb77de18d in __libc_message (do_abort=2, fmt=0xb78a2738 "***
> glibc detected *** %s: %s: 0x%s ***\n") at
> ../sysdeps/unix/sysv/linux/libc_fatal.c:189
> #4 0xb77e8281 in malloc_printerr (action=<value optimized out>,
> str=0x6 <Address 0x6 out of bounds>, ptr=0x9b4ac50) at malloc.c:6267
> #5 0xb77e9ad8 in _int_free (av=<value optimized out>, p=<value
> optimized out>) at malloc.c:4795
> #6 0xb77ecbbd in *__GI___libc_free (mem=0x9b4ac50) at malloc.c:3739
> #7 0x080786fd in rxi_CleanupConnection (conn=0x9b4ac50) at rx.c:991
> #8 0x0807c014 in rxi_CheckCall (call=0x9a71508) at rx.c:6150
> #9 0x0807c4bd in rxi_GrowMTUEvent (event=0x0, arg1=0x9a71508,
> dummy=0x0) at rx.c:6382
> #10 0x08085d1d in rxevent_RaiseEvents (next=0xb76adf6c) at rx_event.c:503
> #11 0x08075e58 in rxi_ListenerProc (rfds=<value optimized out>,
> tnop=<value optimized out>, newcallp=<value optimized out>) at
> rx_lwp.c:203
> #12 0x080761aa in rx_ListenerProc (dummy=0x0) at rx_lwp.c:335
> #13 0x08086d71 in Create_Process_Part2 () at ./lwp.c:805
> #14 0xb77b6cdb in makecontext () at
> ../sysdeps/unix/sysv/linux/i386/makecontext.S:88
> #15 0x0d696910 in ?? ()
> #16 0x080874f8 in LWP_MwaitProcess (event=0x9b43b88) at ./lwp.c:756
> #17 LWP_WaitProcess (event=0x9b43b88) at ./lwp.c:708
> #18 0x0807eec8 in rx_GetCall (tno=15, cur_service=0x9a62020,
> socketp=0xbfc544ec) at rx.c:2062
> #19 0x0807f11d in rxi_ServerProc (threadID=15, newcall=0x0,
> socketp=0xbfc544ec) at rx.c:1654
> #20 0x0807614a in rx_ServerProc (unused=0x0) at rx_lwp.c:369
> #21 0x0807f7c8 in rx_StartServer (donateMe=1) at rx.c:793
> #22 0x0804a8c6 in main (argc=1, argv=0xbfc54d84) at vlserver.c:407
>
>
>> LWP is going to confuse valgrind, so you get a lot of not-useful stuff.
>> If you can get it to crash while under valgrind it might output
>> something helpful, but as I recall, most of what you get is trash. But
>> if we can look at it, we may be able to find something.
>
> I did post the logs of a ptserver and a vlserver crashing under valgrind
> but they went to moderation.
>
> --
> Ryan C. Underwood, <nemesis@icequake.net>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>
Im not sure it is relevent but there are a couple of out of range
errors in the solaris 9 nightly build (the suncc tends to be more
verbose then gcc.):
http://buildbot.openafs.org:8010/builders/solaris9-builder/builds/22/steps/compile/logs/warnings
"/home/buildbot/openafs/solaris9-sparc/build/src/afs/afs_osi_pag.c",
line 549: warning: initializer does not fit or is out of range:
0xffffffff
"./../vol/vnode.c", line 1360: warning: integer overflow detected: op "<<"
"./../viced/afsfileprocs.c", line 478: warning: initializer does not
fit or is out of range: -1
etc. Maybe it will give some hints, maybe not.