[OpenAFS] fileserver crashes

Wed, 13 Oct 2004 03:43:38 -0400

I don't know if our fileserver crashes are related to what Matthew Cocker
and others are seeing, but we are indeed seeing problems here at umich.

Background information:
The machines in question are dual pentium 4 machines with
hyperthreading enabled running linux 2.4.26 (SMP) and glibc 2.3.2.  The
actual file storage is on "cheap" raid devices that use multiple IDE
drives but talk SCSI to the rest of the world.  These raids have their
own set of problems, so I would not count them as super-reliable file
storage.  We're running the "pthreads" version of the fileserver.

I think we're seeing at least 3 distinct problems with openafs 1.2.11.

The first may actually be networking.  We get these with varying
frequency in VolserLog:
Sun Oct 10 22:05:09 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1
Tue Oct 12 11:38:07 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1
Tue Oct 12 13:39:23 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1
Tue Oct 12 15:06:46 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1
Helpful message, eh?  Can't tell what volume was being dumped,
or where it was going.

Most of these probably occur while backing things up to TSM.
If I understand it right, we run buta there and there are
apparently issues with the amount of CPU it eats.  I've
wondered what % of our backups are failing, but I haven't
heard of any negative consequences here, so far.

We have at various times gotten problems with read-only replicas that
are oddly truncated.  This might or might not be the consequence of the
previous problem.

Another probably completely different problem we have concerns volumes
with really small volume IDs.  Modern AFS software creates large 10
digit volume IDs.  But we have volumes that were created long before
AFS 3.1, with small 3 digit volume IDs.  Those volumes are rapidly
disappearing as one by one, during various restarts, the fileserver and
salvager proceed to discard all the data, then the volume header.

The latest problem is of course "signal 6".  That's generally from an
assertion error, which probably writes a nice message out to stderr
(probably out to /dev/null), and might leave an equally useful core
dump, except linux defaults to no core dumps.  Oops.  We've now managed
to harvest one core dump, so here's interesting data from it:

[ in the below, I've obscured the following for privacy reasons:
	@IPADDR@ is a hexadecimal umich IP address belonging to
		some random machine about which we know little (though
		a transient portable or lab machine running windows or
		MacOS is quite possible.)
	@VICEID@ is a vice ID of some random and likely innocent person
		who might logically be using @IPADDR@.
]

(gdb) where
#0  0x400a6281 in __kill () at __kill:-1
#1  0x40021811 in pthread_kill (thread=17978, signo=0) at signals.c:65
#2  0x40021b1b in __pthread_raise (sig=1073904656) at signals.c:187
#3  0x400a5ec4 in *__GI_raise (sig=17978)
    at ../linuxthreads/sysdeps/unix/sysv/linux/raise.c:34
#4  0x400a75ed in *__GI_abort () at ../sysdeps/generic/abort.c:117
#5  0x08096603 in osi_Panic (msg=0x0, a1=0, a2=0, a3=0) at ../rx/rx_user.c:199
#6  0x08096637 in osi_AssertFailU (expr=0x0, file=0x0, line=0) at ../rx/rx_user.c:208
#7  0x080a47fc in rx_SetSpecific (conn=0x4002a510, key=1, ptr=0x0) at ../rx/rx.c:6632
#8  0x0805defc in h_TossStuff_r (host=@IPADDR@) at ../viced/host.c:765
#9  0x0805cb6d in h_Release_r (host=@IPADDR@) at ../viced/host.c:280
#10 0x0805ca71 in h_Release (host=0x0) at ../viced/host.c:258
#11 0x0805e14e in h_Enumerate (proc=0x8060fa0 <CheckHost>, param=0x0)
    at ../viced/host.c:913
#12 0x080613cd in h_CheckHosts () at ../viced/host.c:2080
#13 0x0804b6c7 in HostCheckLWP () at ../viced/viced.c:731
#14 0x4001ed03 in pthread_start_thread (arg=0xb31ffbe0) at manager.c:300

In case that's not completely obvious,
0x80a47e8 <rx_SetSpecific+312>: movl   $0x80c8a00,0x4(%esp,1)
0x80a47f0 <rx_SetSpecific+320>: movl   $0x80c7480,(%esp,1)
0x80a47f7 <rx_SetSpecific+327>: call   0x8096610 <osi_AssertFailU>
(gdb) x/8x $ebp
0xb31ffa14:     0xb31ffa44      0x080a47fc      0x080c7480      0x080c8a00
0xb31ffa24:     0x000019e5      0x400f0ce6      0x40a00010      0x08f3d84c
(gdb) x/i 0x080a47fc
0x80a47fc <rx_SetSpecific+332>: jmp    0x80a46db <rx_SetSpecific+43>
(gdb) x/s 0x080c7480
0x80c7480 <rcsid+3972>:  "pthread_mutex_lock(&conn->conn_data_lock) == 0"
(gdb) x/s 0x080c8a00
0x80c8a00 <rcsid+80>:    "../rx/rx.c"
(gdb) print/d 0x000019e5
$8 = 6629

Basically, it's dying at line 6629 in rx/rx.d, which reads thusly:
    MUTEX_ENTER(&conn->conn_data_lock);
The return code from pthread_mutex_lock would be useful, but
the assertion logic in openafs doesn't save that.

Here's that data structure:
(gdb) print *conn
$7 = {next = 0x0, peer = 0x0, conn_call_lock = {__m_reserved = -1289749536, 
    __m_count = -1291841536, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
      __spinlock = 0}}, conn_call_cv = {__c_lock = {__status = 0, __spinlock = 0}, 
    __c_waiting = 0x0}, conn_data_lock = {__m_reserved = 0, __m_count = 0, 
    __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, 
  epoch = 0, cid = 0, error = 0, call = {0x0, 0x0, 0x0, 0x0}, callNumber = {0, 0, 0, 0}, 
  serial = 0, lastSerial = 0, maxSerial = 0, challengeEvent = 0x0, 
  delayedAbortEvent = 0x0, checkReachEvent = 0x0, abortCount = 0, service = 0x0, 
  serviceId = 0, refCount = 0, flags = 0 '\0', type = 0 '\0', secondsUntilPing = 0 '\0', 
  securityIndex = 0 '\0', securityObject = 0x0, securityData = 0x0, 
  securityHeaderSize = 0, securityMaxTrailerSize = 0, timeout = 0, lastSendTime = 0, 
  secondsUntilDead = 0, hardDeadTime = 0, ackRate = 0 '\0', makeCallWaiters = 0 '\0', 
  nSpecific = 0, specific = 0x0}
(gdb) 

Unfortunately, that's not really the data structure in question.
The next up stack frame looks like this:

(gdb) x/16x $ebp
0xb31ffa44:     0xb31ffa74      0x0805defc      0x08f3d820      0x00000001
0xb31ffa54:     0x00000000      0x40022050      0x00000000      0x40027c10
0xb31ffa64:     0xb31ffa74      0x00000003      0x00000008      0x08132ba0
0xb31ffa74:     0xb31ffa94      0x0805cb6d      0x08132ba0      0x00000000

First 3 parameters here are the stack frame back link, the return address,
and the first parameter, 'conn'.  Gdb thinks that's

(gdb) print/x conn
$9 = 0x4002a510

So, yet again, source level debuggers lose.  (Admittedly, juggling -g
and -O is a hard problem.)  This is probably the real data:

(gdb) print *((struct rx_connection *) 0x08f3d820)
$10 = {next = 0x0, peer = 0x0, conn_call_lock = {__m_reserved = 0, __m_count = 0, 
    __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, 
  conn_call_cv = {__c_lock = {__status = 0, __spinlock = 105}, 
    __c_waiting = 0xffffb923}, conn_data_lock = {__m_reserved = -15030, 
    __m_count = -14533, __m_owner = 0xffffc74e, __m_kind = -13409, __m_lock = {
      __status = -9283, __spinlock = -9021}}, epoch = 4294958286, cid = 4294960122, 
  error = -7173, call = {0xffffe70b, 0xffffe75f, 0xfffff36f, 0xfffff371}, callNumber = {
    4294964109, 4294964111, 4294965251, 4294965253}, serial = 4294966684, 
  lastSerial = 4294966684, maxSerial = -612, challengeEvent = 0xffffff9a, 
  delayedAbortEvent = 0xffffff9b, checkReachEvent = 0x22e35, abortCount = 1145384520, 
  service = 0x39, serviceId = 45936, refCount = 2308, flags = 16 '\020', type = 116 't', 
  secondsUntilPing = 158 '\236', securityIndex = 8 '\b', securityObject = 0x0, 
  securityData = 0x0, securityHeaderSize = 0, securityMaxTrailerSize = 0, timeout = 0, 
  lastSendTime = 0, secondsUntilDead = 0, hardDeadTime = 0, ackRate = 0 '\0', 
  makeCallWaiters = 0 '\0', nSpecific = 0, specific = 0x0}
(gdb) 

Hmm.  Looks kinda junky too.  But that's apparently what's really
there.  The caller of
is h_TossStuff_r, and here's the client structure it was probably working on:

(gdb) print *host->FirstClient->next->next->next
$16 = {next = 0x4242d5e4, host = @IPADDR@, sid = -562995324, tcon = 0x8f3d820, CPS = {
    prlist_len = 7, prlist_val = 0x0}, ViceId = @VICEID@, expTime = 1097781124, 
  LastCall = 1097601371, VenusEpoch = 2334476595, refCount = 0, deleted = 1 '\001', 
  authClass = 2 '\002', prfail = 0 '\0', InSameNetwork = 0 '\0', lock = {
    wait_states = 0 '\0', excl_locked = 0 '\0', readers_reading = 0 '\0', 
    num_waiting = 0 '\0', mutex = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, 
      __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, read_cv = {__c_lock = {
        __status = 0, __spinlock = 0}, __c_waiting = 0x0}, write_cv = {__c_lock = {
        __status = 0, __spinlock = 0}, __c_waiting = 0x0}}}

There's the same connection structure, here called "tcon",
and it looks like the logic should be calling rx_SetSpecific() on that.

I don't see anything further in that client chain that has a non-zero tcon.
That doesn't mean there wasn't one; I kinda lost interest in looking for it
after I got to pasting this:
print *host->FirstClient->next->next->next->next->next->next->next->ne
xt->next->next->next->next->next->next->next->next->next->next->next->
next->next->next->next->next->next->next->next->next->next->next->next
->next->next->next->next->next->next->next->next->next->next->next->ne
xt->next->next->next->next->next->next->next->next->next->next->next
	(on one line not wrapped like this)
All but the first 3 were marked "deleted", and nearly all but 2 near
the start were to the same person.  2 near the start were for 32766,
"nobody".

Anyways, I expect we'll get another core dump in a day or so.

					-Marcus Watts
					UM ITCS Umich Systems Group