[OpenAFS-devel] Another bosserver crash on Irix 6.5.20 (1.2.10-rc4)

Martin MOKREJŠ mmokrejs@natur.cuni.cz
Tue, 29 Jul 2003 14:57:17 +0200 (CEST)


Hi,
  the following happened to me:

bash-2.05b# bos status -server nmrindy -long
Instance ptserver, (type is simple) has core file, currently running normally.
    Process last started at Sun Jul 27 04:00:04 2003 (1 proc starts)
    Command 1 is '/usr/afs/bin/ptserver'

Instance vlserver, (type is simple) currently running normally.
    Process last started at Sun Jul 27 04:00:04 2003 (1 proc starts)
    Command 1 is '/usr/afs/bin/vlserver'

Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Sun Jul 27 04:00:04 2003 (2 proc starts)
    Command 1 is '/usr/afs/bin/fileserver'
    Command 2 is '/usr/afs/bin/volserver'
    Command 3 is '/usr/afs/bin/salvager'

bash-2.05b# bos stop  -server nmrindy -instance fs
bash-2.05b# bos stop  -server nmrindy -instance vlserver
bash-2.05b# bos stop  -server nmrindy -instance ptserver
bos: failed to change stop instance 'ptserver' (communications failure (-1))
bash-2.05b# ls -la /usr/afs/logs
total 3032
drwxr-xr-x    2 root     sys          4096 Jul 29 13:50 .
drwxr-xr-x    7 root     sys            58 Jun 19 20:01 ..
-rw-r--r--    1 root     sys           219 Jul 29 13:50 BosLog
-rw-r--r--    1 root     sys           274 Jul 27 04:00 BosLog.old
-rw-r--r--    1 root     sys          2033 Jul 29 13:50 FileLog
-rw-r--r--    1 root     sys          2033 Jul 27 04:00 FileLog.old
-rw-r--r--    1 root     sys           625 Jul 28 14:15 PtLog
-rw-r--r--    1 root     sys            68 Jul 25 12:57 PtLog.old
-rw-r--r--    1 root     sys           596 Jul 25 12:51 SalvageLog
-rw-r--r--    1 root     sys           597 Jul 25 12:45 SalvageLog.old
-rw-r--r--    1 root     sys           741 Jul 28 14:15 VLLog
-rw-r--r--    1 root     sys           325 Jul 25 12:57 VLLog.old
-rw-r--r--    1 root     sys            77 Jul 27 04:00 VolserLog
-rw-r--r--    1 root     sys            77 Jul 25 12:57 VolserLog.old
-rw-r--r--    1 root     sys       1257472 Jul 29 13:50 core
-rw-r--r--    1 root     sys       1732608 Jul 25 12:53 coreptserver
bash-2.05b# file /usr/afs/logs/core
/usr/afs/logs/core:     IRIX N32 core dump of 'bosserver'
bash-2.05b# file /usr/afs/logs/coreptserver
/usr/afs/logs/coreptserver:     IRIX N32 core dump of 'bosserver'
bash-2.05b# dbx /usr/afs/bin/bosserver /usr/afs/logs/core
dbx version 7.3.1 68542_Oct26 MR Oct 26 2000 17:50:34
Core from signal SIGBUS: Bus error
(dbx) where
>  0 rxi_FindConnection(0x3, 0xcf4de1b8, 0x1, 0xcf4de1b8, 0xcf4de1b8, 0x8315da4a, 0x1, 0x2) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":2275, 0x10022a80]
   1 rxi_ReceivePacket(0x100e1540, 0x0, 0xc3713b6f, 0xcf4de1b8, 0x0, 0x0, 0x0, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":2427, 0x10022df4]
   2 rxi_ListenerProc(0x100abe30, 0x100abdcc, 0x100abdc8, 0x0, 0xcf4de1b8, 0x0, 0x0, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":281, 0x10031994]
   3 rx_ListenerProc(0x3, 0xcf4de1b8, 0x1, 0xcf4de1b8, 0xcf4de1b8, 0x8315da4a, 0x1, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":319, 0x10031a54]
   4 Create_Process_Part2(0x3, 0x1009fca8, 0x1, 0xcf4de1b8, 0xcf4de1b8, 0x8315da4a, 0x1, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/lwp/lwp.c":740, 0x10035784]
   5 savecontext(0x0, 0x0, 0x0, 0xcf4de1b8, 0xcf4de1b8, 0x8, 0x1, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/lwp/process.c":199, 0x100366c0]
   6 <Unknown>() [< unknown >, 0xfcfdfeff]
(dbx)


The ptserver process is still running on the machine.

In BosLog there is:

Tue Jul 29 13:50:45 2003: fs:vol exited on signal 15
Tue Jul 29 13:50:45 2003: fs:file exited with code 0
Tue Jul 29 13:50:52 2003: vlserver exited on signal 15

I went to repeat the error:

bash-2.05b# dbx -p 20523 /usr/afs/bin/bosserver
dbx version 7.3.1 68542_Oct26 MR Oct 26 2000 17:50:34
Ignoring /usr/afs/bin/bosserver  in favor of -p 20523
Process 20523 (bosserver) stopped at [__select:17 +0x8,0xfaf67d4]
         Source (of /xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s) not available for Process 20523
(dbx) trace rxi_CleanupConnection
Process 20523: [3] trace rxi_CleanupConnection
(dbx) trace rxi_DestroyConnection
Process 20523: [4] trace rxi_DestroyConnection
(dbx) b rxi_DestroyConnection
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268570304
(dbx) b rxi_CleanupConnection
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268569808
(dbx) b Create_Process_Part2
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268654384
(dbx) b rx_ListenerProc
Process 20523: Appropriate symbol not found for: rx_ListenerProc
<symbol not found>
(dbx) b rxi_ListenerProc
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268637520
(dbx) b rxi_ReceivePacket
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268578016
(dbx) b rxi_FindConnection
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268577184
(dbx) c

Here I have started ptserver (bos start ...) and stopped. I though it will
hit my breakpoints in bossserver, unfortunately no ... it was just
continuing. But, after few seconds it crashed itself:

Process 20523 (bosserver) stopped on signal SIGBUS: Bus error (default) at [rxi_CheckCall:5219 +0x8,0x100279f4]
5219  deadTime = (((afs_uint32)conn->secondsUntilDead << 10) +
(dbx) where
>  0 rxi_CheckCall(0x101589c8, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":5219, 0x100279f4]
   1 rxi_ReapConnections(0x101589c8, 0x4ae356a, 0x0, 0x9eb10, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":5599, 0x10028578]
   2 rxevent_RaiseEvents(0x100abd50, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0x10139e64, 0x10139d14) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_event.c":390, 0x10032820]
   3 rxi_ListenerProc(0x100abe30, 0x100abdcc, 0x100abdc8, 0x0, 0x10164f70, 0x1, 0x0, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":189, 0x1003166c]
   4 rx_ListenerProc(0x101589c8, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":319, 0x10031a54]
   5 Create_Process_Part2(0x101589c8, 0x1009fca8, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/lwp/lwp.c":740, 0x10035784]
   6 savecontext(0x0, 0x0, 0x0, 0x1006c7f0, 0x10164f70, 0x8, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/lwp/process.c":199, 0x100366c0]
   7 <Unknown>() [< unknown >, 0xfcfdfeff]
(dbx) dump
rxi_CheckCall(0x101589c8, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":5219, 0x100279f4]
(dbx) l
>*5219      deadTime = (((afs_uint32)conn->secondsUntilDead << 10) +
  5220                  ((afs_uint32)conn->peer->rtt >> 3) +
  5221                  ((afs_uint32)conn->peer->rtt_dev << 1) + 1023) >> 10;
  5222      now = clock_Sec();
  5223      /* These are computed to the second (+- 1 second).  But that's
  5224       * good enough for these values, which should be a significant
  5225       * number of seconds. */
  5226      if (now > (call->lastReceiveTime + deadTime)) {
  5227          if (call->state == RX_STATE_ACTIVE) {
  5228            rxi_CallError(call, RX_CALL_DEAD);
(dbx) printregs
r0/zero=0x0     r1/at=0x1006c9f8
r2/v0=0xfffffffffffffffe        r3/v1=0x0
r4/a0=0x101589c8        r5/a1=0x3
r6/a2=0x0       r7/a3=0x1006c7f0
r8/a4=0x10164f70        r9/a5=0x1
r10/a6=0xc      r11/a7=0x8
r12/t0=0x0      r13/t1=0x1
r14/t2=0x0      r15/t3=0x0
r16/s0=0x10164f70       r17/s1=0x10164f70
r18/s2=0x0      r19/s3=0x1008bbd0
r20/s4=0xfffffffffffffffe       r21/s5=0x1
r22/s6=0x1008bd20       r23/s7=0x10164f80
r24/t8=0x0      r25/t9=0x100279d0
r26/k0=0x0      r27/k1=0x3b9f5e
r28/gp=0x10073300       r29/sp=0x100abc38
r30/s8/fp=0x0   r31/ra=0x10028580

mdlo=0x55730    mdhi=0x0
cause=0x10      pc=0x100279f4
fpcsr=0x00000000        sr=0x0
badvaddr=0x0    fpeir=0x0
fcc0=0x0        fcc1=0x0
fcc2=0x0        fcc3=0x0
fcc4=0x0        fcc5=0x0
fcc6=0x0        fcc7=0x0

f0=0.0000000e+00        f1=0.0000000e+00        f2=0.0000000e+00
f3=0.0000000e+00        f4=0.0000000e+00        f5=0.0000000e+00
f6=0.0000000e+00        f7=0.0000000e+00        f8=0.0000000e+00
f9=0.0000000e+00        f10=0.0000000e+00       f11=0.0000000e+00
f12=0.0000000e+00       f13=0.0000000e+00       f14=0.0000000e+00
f15=0.0000000e+00       f16=0.0000000e+00       f17=0.0000000e+00
f18=0.0000000e+00       f19=0.0000000e+00       f20=0.0000000e+00
f21=0.0000000e+00       f22=0.0000000e+00       f23=0.0000000e+00
f24=0.0000000e+00       f25=0.0000000e+00       f26=0.0000000e+00
f27=0.0000000e+00       f28=0.0000000e+00       f29=0.0000000e+00
f30=0.0000000e+00       f31=0.0000000e+00
d0=1.600000000000000e+01        d1=0.000000000000000e+00
d2=1.000000000000000e+00        d3=0.000000000000000e+00
d4=0.000000000000000e+00        d5=0.000000000000000e+00
d6=0.000000000000000e+00        d7=0.000000000000000e+00
d8=0.000000000000000e+00        d9=0.000000000000000e+00
d10=0.000000000000000e+00       d11=0.000000000000000e+00
More (n if no)?y
d12=0.000000000000000e+00       d13=0.000000000000000e+00
d14=0.000000000000000e+00       d15=0.000000000000000e+00
d16=0.000000000000000e+00       d17=0.000000000000000e+00
d18=0.000000000000000e+00       d19=0.000000000000000e+00
d20=0.000000000000000e+00       d21=0.000000000000000e+00
d22=0.000000000000000e+00       d23=0.000000000000000e+00
d24=0.000000000000000e+00       d25=0.000000000000000e+00
d26=0.000000000000000e+00       d27=0.000000000000000e+00
d28=0.000000000000000e+00       d29=0.000000000000000e+00
d30=0.000000000000000e+00       d31=0.000000000000000e+00


(dbx) showproc
Process 20523 (bosserver) stopped on signal SIGBUS: Bus error (default)
(dbx) active
Process 20523 (bosserver) is active
(dbx)

Could anyone help? Thanks!
-- 
Martin Mokrejs <mmokrejs@natur.cuni.cz>, <m.mokrejs@gsf.de>
PGP5.0i key is at http://www.natur.cuni.cz/~mmokrejs
MIPS / Institute for Bioinformatics <http://mips.gsf.de>
GSF - National Research Center for Environment and Health
Ingolstaedter Landstrasse 1, D-85764 Neuherberg, Germany
tel.: +49-89-3187 3683 , fax: +49-89-3187 3585