[OpenAFS] OpenAFS 1.4.5 on Solaris 10/x86 - spurious pid(s) in ctstat and
ocassional kernel panics
Robert Sturrock
rns@unimelb.edu.au
Tue, 27 Nov 2007 17:17:34 +1100
Hi All.
We're noticing some problems on some new Solaris 10U4 (x86, kernel
120012-14) machines we've deployed as OpenAFS 1.4.5 clients. There
seem to be spurious process-ids showing up when we run "ctstat -v":
$ ctstat -vi 48
CTID ZONEID TYPE STATE HOLDER EVENTS QTIME NTIME
48 0 process owned 7 0 - -
cookie: 0x20
informative event set: none
critical event set: core signal hwerr empty
fatal event set: none
parameter set: inherit regent
member processes: 335 336 338 343 344 345 346 347 1936749667
inherited contracts: none
pids 335-347 are afsd pides, but 1936749667 is obviously not valid!
We run AFS via SMF. Shutting down AFS seems to result in an infinite
loop and we see these messages in the SMF log:
# tail /var/svc/log/site-openafs-client:default.log
[ Nov 27 12:06:57 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:06:58 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:06:59 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:07:01 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:07:02 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:07:03 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:07:04 Method or service exit timed out. Killing contract 47 ]
[ Nov 27 12:07:05 Method or service exit timed out. Killing contract 47 ]
[...]
(We can of course run AFS outside of SMF, but the underlying problem is still
there)
When we try to reboot the machine, it does not reboot cleanly, and if
I run a "ctstat -v" while it's shutting down it frequently kernel
panics. (The kernel panics can happen even *after* I shutdown AFS
manually first!).
panic[cpu0]/thread=ffffffff820cdc80: BAD TRAP: type=e (#pf Page fault) rp=fffffe80005ccb80 addr=fffffffffffff4e8
ctstat: #pf Page fault
Bad kernel fault at addr=0xfffffffffffff4e8
pid=497, pc=0xfffffffffb9d26f5, sp=0xfffffe80005ccc70, eflags=0x10282
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
cr2: fffffffffffff4e8 cr3: f7fd000 cr8: c
rdi: ffffffff874823c0 rsi: ffffffff81d6da60 rdx: 0
rcx: fffffffffffff438 r8: fffffe80005ccd30 r9: 100000
rax: fffffffffffff438 rbx: 6 rbp: fffffe80005cccf0
r10: fffffffffbc64860 r11: 0 r12: ffffffff87482200
r13: ffffffff874823b0 r14: ffffffff874822b8 r15: d
fsb: ffffffff80000000 gsb: fffffffffbc25460 ds: 43
es: 43 fs: 0 gs: 1c3
trp: e err: 0 rip: fffffffffb9d26f5
cs: 28 rfl: 10282 rsp: fffffe80005ccc70
ss: 30
fffffe80005cca90 unix:real_mode_end+7051 ()
fffffe80005ccb70 unix:trap+d86 ()
fffffe80005ccb80 unix:cmntrap+13f ()
fffffe80005cccf0 genunix:contract_process_status+155 ()
fffffe80005ccdb0 ctfs:ctfs_stat_ioctl+10e ()
fffffe80005ccde0 genunix:fop_ioctl+25 ()
fffffe80005ccec0 genunix:ioctl+ac ()
fffffe80005ccf10 unix:brand_sys_syscall32+1a3 ()
syncing file systems... done
dumping to /dev/md/dsk/d10, offset 1719074816, content: kernel
This problem seems similar to the one that's described here:
http://www.openafs.org/pipermail/openafs-info/2005-October/019765.html
We're running the prebuilt 1.4.5 Solaris10/x86 binaries from
openafs.org, although we put a locally built one in as a test and
it displayed the same symptoms.
Any help appreciated.
Regards,
Robert.