[OpenAFS] Re: 1.6.0-pre2 ptserver/vlserver dumping core

Tue, 1 Mar 2011 00:22:42 -0600

On Mon, Feb 28, 2011 at 11:33:24PM -0600, Andrew Deason wrote:
> 
> > Very similar, here's an example of the new one.  Seems like a
> > double-free is going on at the end of rxi_CleanupConnection() if the
> > glibc spewage in the log is to be trusted.  I can try to build a full
> > debug version if you think it would help.  Is it possible to valgrind?
> 
> With LWP? Probably not usefully.

How does a smart hacker (unlike myself) usually debug OpenAFS server
binaries?  Would I have to flatten the threading by i.e. disabling LWP
at compile time?

> Could try printing the contents of the conn and call:
> 
> frame 7
> print *conn
> up
> print *call

I will try to wrap these readably.  It doesn't seem very useful though.
I think the actual memory corruption has already happened somewhere
else by this point.

#8  0x0807c014 in rxi_CheckCall (call=0x9a71508) at rx.c:6150
6150                rxi_FreeCall(call);
(gdb) print *call

$1 = {queue_item_header = {prev = 0x9a73198, next = 0x80f1770}, tq =
{prev = 0x9a71510, next = 0x9a71510}, rq = {prev = 0x9a71518, next =
0x9a71518}, iovq = { prev = 0x9a71520, next = 0x9a71520}, nLeft = 0,
curvec = 1, curlen = 0, nFree = 0, currentPacket = 0x0, curpos =
0xb7702f34 "", channel = 0 '\000', state = 5 '\005', mode = 3 '\003',
conn = 0x9b4ac50, callNumber = 0x9b4ac74, flags = 0, localStatus = 0
'\000', remoteStatus = 0 '\000', error = 0, timeout = 0, rnext = 1,
rprev = 0, rwind = 16, tfirst = 1, tnext = 1, tprev = 0, twind = 16,
cwind = 1, nSoftAcked = 0, nextCwind = 0, nCwindAcks = 0, ssthresh = 32,
nDgramPackets = 1, nAcks = 0, nNacks = 0, nSoftAcks = 0, nHardAcks = 0,
congestSeq = 0, resendEvent = 0x0, timeoutEvent = 0x0, keepAliveEvent =
0x0, growMTUEvent = 0x0, delayedAckEvent = 0x0, delayedAbortEvent = 0x0,
abortCode = 0, abortCount = 0, lastSendTime = 1298949656,
lastReceiveTime = 1298949609, lastSendData = 1298949609, arrivalProc =
0, arrivalProcHandle = 0x9b44ab0, arrivalProcArg = 0, lastAcked = 0,
startWait = 0, traceWait = {sec = 0, usec = 0}, traceStart = {sec = 0,
usec = 0}, MTU = 1444, iovNBytes = 0, iovMax = 0, iovNext = 0, iov =
0x0, queueTime = {sec = 1298949609, usec = 101749}, startTime = {sec =
1298949609, usec = 101760}, bytesSent = {high = 0, low = 4}, bytesRcvd =
{high = 0, low = 0}, tqWaiters = 0, xmitList = {0xb7702e48, 0x0 <repeats
254 times>}}

#7  0x080786fd in rxi_CleanupConnection (conn=0x9b4ac50) at rx.c:991
991         rxi_FreeConnection(conn);
(gdb) print *conn

$2 = {next = 0x0, peer = 0x9a71330, epoch = 2828736158, cid =
3015650936, error = 0, call = {0x0, 0x0, 0x0, 0x0}, callNumber = {2, 0,
0, 0}, rwind = {16, 16, 16, 16}, twind = {16, 16, 16, 16}, lastBusy =
{0, 0, 0, 0}, serial = 9, lastSerial = 0, maxSerial = 0, lastPacketSize
= 4, lastPacketSizeSeq = 1, lastPingSize = 0, lastPingSizeSer = 0,
challengeEvent = 0x0, delayedAbortEvent = 0x0, checkReachEvent = 0x0,
abortCount = 0, service = 0x0, serviceId = 51, refCount = 0, flags = 2
'\002', type = 0 '\000', secondsUntilPing = 8 '\b', securityIndex = 2
'\002', securityObject = 0x9a70218, securityData = 0x0,
securityHeaderSize = 0, securityMaxTrailerSize = 0, timeout = 0,
lastSendTime = 1298949656, secondsUntilDead = 50, hardDeadTime = 0,
idleDeadTime = 0, ackRate = 1 '\001', makeCallWaiters = 0 '\000',
idleDeadErr = 0, secondsUntilNatPing = 0, natKeepAliveEvent = 0x0,
msgsizeRetryErr = 0, nSpecific = 0, specific = 0x0}

Here is the corresponding backtrace to the above:

#0  0xb78e2424 in __kernel_vsyscall ()
#1  0xb77a7751 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xb77aab82 in *__GI_abort () at abort.c:92
#3  0xb77de18d in __libc_message (do_abort=2, fmt=0xb78a2738 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:189
#4  0xb77e8281 in malloc_printerr (action=<value optimized out>, str=0x6 <Address 0x6 out of bounds>, ptr=0x9b4ac50) at malloc.c:6267
#5  0xb77e9ad8 in _int_free (av=<value optimized out>, p=<value optimized out>) at malloc.c:4795
#6  0xb77ecbbd in *__GI___libc_free (mem=0x9b4ac50) at malloc.c:3739
#7  0x080786fd in rxi_CleanupConnection (conn=0x9b4ac50) at rx.c:991
#8  0x0807c014 in rxi_CheckCall (call=0x9a71508) at rx.c:6150
#9  0x0807c4bd in rxi_GrowMTUEvent (event=0x0, arg1=0x9a71508, dummy=0x0) at rx.c:6382
#10 0x08085d1d in rxevent_RaiseEvents (next=0xb76adf6c) at rx_event.c:503
#11 0x08075e58 in rxi_ListenerProc (rfds=<value optimized out>, tnop=<value optimized out>, newcallp=<value optimized out>) at rx_lwp.c:203
#12 0x080761aa in rx_ListenerProc (dummy=0x0) at rx_lwp.c:335
#13 0x08086d71 in Create_Process_Part2 () at ./lwp.c:805
#14 0xb77b6cdb in makecontext () at ../sysdeps/unix/sysv/linux/i386/makecontext.S:88
#15 0x0d696910 in ?? ()
#16 0x080874f8 in LWP_MwaitProcess (event=0x9b43b88) at ./lwp.c:756
#17 LWP_WaitProcess (event=0x9b43b88) at ./lwp.c:708
#18 0x0807eec8 in rx_GetCall (tno=15, cur_service=0x9a62020, socketp=0xbfc544ec) at rx.c:2062
#19 0x0807f11d in rxi_ServerProc (threadID=15, newcall=0x0, socketp=0xbfc544ec) at rx.c:1654
#20 0x0807614a in rx_ServerProc (unused=0x0) at rx_lwp.c:369
#21 0x0807f7c8 in rx_StartServer (donateMe=1) at rx.c:793
#22 0x0804a8c6 in main (argc=1, argv=0xbfc54d84) at vlserver.c:407

> LWP is going to confuse valgrind, so you get a lot of not-useful stuff.
> If you can get it to crash while under valgrind it might output
> something helpful, but as I recall, most of what you get is trash. But
> if we can look at it, we may be able to find something.

I did post the logs of a ptserver and a vlserver crashing under valgrind
but they went to moderation.

-- 
Ryan C. Underwood, <nemesis@icequake.net>