[OpenAFS-devel] OpenAFS 1.2.7 fileserver repeatedly crashes in rxi_AttachServerProc around 4:30 in the morning

Rainer Toebbicke rtb@pclella.cern.ch
Tue, 17 Dec 2002 12:27:59 +0100


Hello,

We recently had *three* SEGV crashes each night (on separate days) around 4:30 
in the morning. Seen on an OpenAFS 1.2.7 Solaris 2.8 fileserver in 
rxi_AttachServerProc() on queue_Remove(call).

Nothing odd in the FileLog, not aware of anything peculiar happening at 4:30.

The thread holds rx_serverPool_lock all right.
The call to be removed seems to be (have been) the only one in the queue: 
however, queue_Remove is not exactly a simplistic macro so I could be wrong there.


  print &rx_incomingCallQueue
&rx_incomingCallQueue = 0x1aa450

(struct rx_queue *) call = 0x8693e8

((struct rx_queue *) call)->prev = 0x1aa450

((struct rx_queue *) call)->prev->next = (nil)

The assembly of that line is
0x000d11f8: rxi_AttachServerProc+0x0568:        ld      [%i0 + 0x4], %l0
0x000d11fc: rxi_AttachServerProc+0x056c:        st      %l0, [%fp - 0x14]
0x000d1200: rxi_AttachServerProc+0x0570:        ld      [%fp - 0x14], %l1
0x000d1204: rxi_AttachServerProc+0x0574:        ld      [%i0], %l0
0x000d1208: rxi_AttachServerProc+0x0578:        st      %l1, [%l0 + 0x4]
0x000d120c: rxi_AttachServerProc+0x057c:        ld      [%i0], %l1
0x000d1210: rxi_AttachServerProc+0x0580:        ld      [%fp - 0x14], %l0
0x000d1214: rxi_AttachServerProc+0x0584:        st      %l1, [%l0]
0x000d1218: rxi_AttachServerProc+0x0588:        st      %g0, [%i0 + 0x4]

and the regs

g0-g3    0x00000000 0x000ab000 0x00000000 0x00000000
g4-g7    0x00000000 0x00000000 0x00000000 0xfd509d78
o0-o3    0x00000000 0xff0ee000 0x001a7b50 0x00000000
o4-o7    0x00000000 0x00000000 0xfd509880 0x000d11ac
l0-l3    0x00000000 0x001aa450 0x00000000 0x00000000
l4-l7    0x00000000 0x00000000 0x00000000 0x00000001
i0-i3    0x008693e8 0xffffffff 0x00000000 0x00000000
i4-i7    0x001aec50 0x0070aa98 0xfd5098f8 0x000cdb14
y        0x003a1c8a
ccr      0xfe401004
pc       0x000d1214:rxi_AttachServerProc+0x584  st      %l1, [%l0]
npc      0x000d1218:rxi_AttachServerProc+0x588  st      %g0, [%i0 + 0x4]


 From this I'd conclude that it's the _QR(i) part of queue_Remove that goes 
wrong because of the '...->next = (nil)' above.

Any ideas ?

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke        http://cern.ch/~rtb         rtb@mail.cern.ch  O__
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland   > |
Phone: +41 22 767 8985       Fax: +41 22 767 7155                     ( )\( )