[OpenAFS-devel] Another bosserver crash on Irix 6.5.20 (1.2.10-rc4)
Martin MOKREJŠ
mmokrejs@natur.cuni.cz
Tue, 29 Jul 2003 14:57:17 +0200 (CEST)
Hi,
the following happened to me:
bash-2.05b# bos status -server nmrindy -long
Instance ptserver, (type is simple) has core file, currently running normally.
Process last started at Sun Jul 27 04:00:04 2003 (1 proc starts)
Command 1 is '/usr/afs/bin/ptserver'
Instance vlserver, (type is simple) currently running normally.
Process last started at Sun Jul 27 04:00:04 2003 (1 proc starts)
Command 1 is '/usr/afs/bin/vlserver'
Instance fs, (type is fs) currently running normally.
Auxiliary status is: file server running.
Process last started at Sun Jul 27 04:00:04 2003 (2 proc starts)
Command 1 is '/usr/afs/bin/fileserver'
Command 2 is '/usr/afs/bin/volserver'
Command 3 is '/usr/afs/bin/salvager'
bash-2.05b# bos stop -server nmrindy -instance fs
bash-2.05b# bos stop -server nmrindy -instance vlserver
bash-2.05b# bos stop -server nmrindy -instance ptserver
bos: failed to change stop instance 'ptserver' (communications failure (-1))
bash-2.05b# ls -la /usr/afs/logs
total 3032
drwxr-xr-x 2 root sys 4096 Jul 29 13:50 .
drwxr-xr-x 7 root sys 58 Jun 19 20:01 ..
-rw-r--r-- 1 root sys 219 Jul 29 13:50 BosLog
-rw-r--r-- 1 root sys 274 Jul 27 04:00 BosLog.old
-rw-r--r-- 1 root sys 2033 Jul 29 13:50 FileLog
-rw-r--r-- 1 root sys 2033 Jul 27 04:00 FileLog.old
-rw-r--r-- 1 root sys 625 Jul 28 14:15 PtLog
-rw-r--r-- 1 root sys 68 Jul 25 12:57 PtLog.old
-rw-r--r-- 1 root sys 596 Jul 25 12:51 SalvageLog
-rw-r--r-- 1 root sys 597 Jul 25 12:45 SalvageLog.old
-rw-r--r-- 1 root sys 741 Jul 28 14:15 VLLog
-rw-r--r-- 1 root sys 325 Jul 25 12:57 VLLog.old
-rw-r--r-- 1 root sys 77 Jul 27 04:00 VolserLog
-rw-r--r-- 1 root sys 77 Jul 25 12:57 VolserLog.old
-rw-r--r-- 1 root sys 1257472 Jul 29 13:50 core
-rw-r--r-- 1 root sys 1732608 Jul 25 12:53 coreptserver
bash-2.05b# file /usr/afs/logs/core
/usr/afs/logs/core: IRIX N32 core dump of 'bosserver'
bash-2.05b# file /usr/afs/logs/coreptserver
/usr/afs/logs/coreptserver: IRIX N32 core dump of 'bosserver'
bash-2.05b# dbx /usr/afs/bin/bosserver /usr/afs/logs/core
dbx version 7.3.1 68542_Oct26 MR Oct 26 2000 17:50:34
Core from signal SIGBUS: Bus error
(dbx) where
> 0 rxi_FindConnection(0x3, 0xcf4de1b8, 0x1, 0xcf4de1b8, 0xcf4de1b8, 0x8315da4a, 0x1, 0x2) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":2275, 0x10022a80]
1 rxi_ReceivePacket(0x100e1540, 0x0, 0xc3713b6f, 0xcf4de1b8, 0x0, 0x0, 0x0, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":2427, 0x10022df4]
2 rxi_ListenerProc(0x100abe30, 0x100abdcc, 0x100abdc8, 0x0, 0xcf4de1b8, 0x0, 0x0, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":281, 0x10031994]
3 rx_ListenerProc(0x3, 0xcf4de1b8, 0x1, 0xcf4de1b8, 0xcf4de1b8, 0x8315da4a, 0x1, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":319, 0x10031a54]
4 Create_Process_Part2(0x3, 0x1009fca8, 0x1, 0xcf4de1b8, 0xcf4de1b8, 0x8315da4a, 0x1, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/lwp/lwp.c":740, 0x10035784]
5 savecontext(0x0, 0x0, 0x0, 0xcf4de1b8, 0xcf4de1b8, 0x8, 0x1, 0x0) ["/scratch2/openafs-1.2.10-rc4/src/lwp/process.c":199, 0x100366c0]
6 <Unknown>() [< unknown >, 0xfcfdfeff]
(dbx)
The ptserver process is still running on the machine.
In BosLog there is:
Tue Jul 29 13:50:45 2003: fs:vol exited on signal 15
Tue Jul 29 13:50:45 2003: fs:file exited with code 0
Tue Jul 29 13:50:52 2003: vlserver exited on signal 15
I went to repeat the error:
bash-2.05b# dbx -p 20523 /usr/afs/bin/bosserver
dbx version 7.3.1 68542_Oct26 MR Oct 26 2000 17:50:34
Ignoring /usr/afs/bin/bosserver in favor of -p 20523
Process 20523 (bosserver) stopped at [__select:17 +0x8,0xfaf67d4]
Source (of /xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s) not available for Process 20523
(dbx) trace rxi_CleanupConnection
Process 20523: [3] trace rxi_CleanupConnection
(dbx) trace rxi_DestroyConnection
Process 20523: [4] trace rxi_DestroyConnection
(dbx) b rxi_DestroyConnection
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268570304
(dbx) b rxi_CleanupConnection
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268569808
(dbx) b Create_Process_Part2
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268654384
(dbx) b rx_ListenerProc
Process 20523: Appropriate symbol not found for: rx_ListenerProc
<symbol not found>
(dbx) b rxi_ListenerProc
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268637520
(dbx) b rxi_ReceivePacket
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268578016
(dbx) b rxi_FindConnection
no executable code found at line "/xlv47/6.5.20f/work/irix/lib/libc/libc_n32_M3/sys/select.s":268577184
(dbx) c
Here I have started ptserver (bos start ...) and stopped. I though it will
hit my breakpoints in bossserver, unfortunately no ... it was just
continuing. But, after few seconds it crashed itself:
Process 20523 (bosserver) stopped on signal SIGBUS: Bus error (default) at [rxi_CheckCall:5219 +0x8,0x100279f4]
5219 deadTime = (((afs_uint32)conn->secondsUntilDead << 10) +
(dbx) where
> 0 rxi_CheckCall(0x101589c8, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":5219, 0x100279f4]
1 rxi_ReapConnections(0x101589c8, 0x4ae356a, 0x0, 0x9eb10, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":5599, 0x10028578]
2 rxevent_RaiseEvents(0x100abd50, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0x10139e64, 0x10139d14) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_event.c":390, 0x10032820]
3 rxi_ListenerProc(0x100abe30, 0x100abdcc, 0x100abdc8, 0x0, 0x10164f70, 0x1, 0x0, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":189, 0x1003166c]
4 rx_ListenerProc(0x101589c8, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx_lwp.c":319, 0x10031a54]
5 Create_Process_Part2(0x101589c8, 0x1009fca8, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/lwp/lwp.c":740, 0x10035784]
6 savecontext(0x0, 0x0, 0x0, 0x1006c7f0, 0x10164f70, 0x8, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/lwp/process.c":199, 0x100366c0]
7 <Unknown>() [< unknown >, 0xfcfdfeff]
(dbx) dump
rxi_CheckCall(0x101589c8, 0x3, 0x0, 0x1006c7f0, 0x10164f70, 0x1, 0xc, 0x8) ["/scratch2/openafs-1.2.10-rc4/src/rx/rx.c":5219, 0x100279f4]
(dbx) l
>*5219 deadTime = (((afs_uint32)conn->secondsUntilDead << 10) +
5220 ((afs_uint32)conn->peer->rtt >> 3) +
5221 ((afs_uint32)conn->peer->rtt_dev << 1) + 1023) >> 10;
5222 now = clock_Sec();
5223 /* These are computed to the second (+- 1 second). But that's
5224 * good enough for these values, which should be a significant
5225 * number of seconds. */
5226 if (now > (call->lastReceiveTime + deadTime)) {
5227 if (call->state == RX_STATE_ACTIVE) {
5228 rxi_CallError(call, RX_CALL_DEAD);
(dbx) printregs
r0/zero=0x0 r1/at=0x1006c9f8
r2/v0=0xfffffffffffffffe r3/v1=0x0
r4/a0=0x101589c8 r5/a1=0x3
r6/a2=0x0 r7/a3=0x1006c7f0
r8/a4=0x10164f70 r9/a5=0x1
r10/a6=0xc r11/a7=0x8
r12/t0=0x0 r13/t1=0x1
r14/t2=0x0 r15/t3=0x0
r16/s0=0x10164f70 r17/s1=0x10164f70
r18/s2=0x0 r19/s3=0x1008bbd0
r20/s4=0xfffffffffffffffe r21/s5=0x1
r22/s6=0x1008bd20 r23/s7=0x10164f80
r24/t8=0x0 r25/t9=0x100279d0
r26/k0=0x0 r27/k1=0x3b9f5e
r28/gp=0x10073300 r29/sp=0x100abc38
r30/s8/fp=0x0 r31/ra=0x10028580
mdlo=0x55730 mdhi=0x0
cause=0x10 pc=0x100279f4
fpcsr=0x00000000 sr=0x0
badvaddr=0x0 fpeir=0x0
fcc0=0x0 fcc1=0x0
fcc2=0x0 fcc3=0x0
fcc4=0x0 fcc5=0x0
fcc6=0x0 fcc7=0x0
f0=0.0000000e+00 f1=0.0000000e+00 f2=0.0000000e+00
f3=0.0000000e+00 f4=0.0000000e+00 f5=0.0000000e+00
f6=0.0000000e+00 f7=0.0000000e+00 f8=0.0000000e+00
f9=0.0000000e+00 f10=0.0000000e+00 f11=0.0000000e+00
f12=0.0000000e+00 f13=0.0000000e+00 f14=0.0000000e+00
f15=0.0000000e+00 f16=0.0000000e+00 f17=0.0000000e+00
f18=0.0000000e+00 f19=0.0000000e+00 f20=0.0000000e+00
f21=0.0000000e+00 f22=0.0000000e+00 f23=0.0000000e+00
f24=0.0000000e+00 f25=0.0000000e+00 f26=0.0000000e+00
f27=0.0000000e+00 f28=0.0000000e+00 f29=0.0000000e+00
f30=0.0000000e+00 f31=0.0000000e+00
d0=1.600000000000000e+01 d1=0.000000000000000e+00
d2=1.000000000000000e+00 d3=0.000000000000000e+00
d4=0.000000000000000e+00 d5=0.000000000000000e+00
d6=0.000000000000000e+00 d7=0.000000000000000e+00
d8=0.000000000000000e+00 d9=0.000000000000000e+00
d10=0.000000000000000e+00 d11=0.000000000000000e+00
More (n if no)?y
d12=0.000000000000000e+00 d13=0.000000000000000e+00
d14=0.000000000000000e+00 d15=0.000000000000000e+00
d16=0.000000000000000e+00 d17=0.000000000000000e+00
d18=0.000000000000000e+00 d19=0.000000000000000e+00
d20=0.000000000000000e+00 d21=0.000000000000000e+00
d22=0.000000000000000e+00 d23=0.000000000000000e+00
d24=0.000000000000000e+00 d25=0.000000000000000e+00
d26=0.000000000000000e+00 d27=0.000000000000000e+00
d28=0.000000000000000e+00 d29=0.000000000000000e+00
d30=0.000000000000000e+00 d31=0.000000000000000e+00
(dbx) showproc
Process 20523 (bosserver) stopped on signal SIGBUS: Bus error (default)
(dbx) active
Process 20523 (bosserver) is active
(dbx)
Could anyone help? Thanks!
--
Martin Mokrejs <mmokrejs@natur.cuni.cz>, <m.mokrejs@gsf.de>
PGP5.0i key is at http://www.natur.cuni.cz/~mmokrejs
MIPS / Institute for Bioinformatics <http://mips.gsf.de>
GSF - National Research Center for Environment and Health
Ingolstaedter Landstrasse 1, D-85764 Neuherberg, Germany
tel.: +49-89-3187 3683 , fax: +49-89-3187 3585