[OpenAFS] OpenAFS 1.4.5 on Solaris 10/x86 - spurious pid(s) in ctstat and ocassional kernel panics

Tue, 27 Nov 2007 17:17:34 +1100

Hi All.

We're noticing some problems on some new Solaris 10U4 (x86, kernel
120012-14) machines we've deployed as OpenAFS 1.4.5 clients.  There
seem to be spurious process-ids showing up when we run "ctstat -v":

    $ ctstat -vi 48   
    CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME   
    48      0       process owned   7       0       -       -       
	    cookie:                0x20
	    informative event set: none
	    critical event set:    core signal hwerr empty
	    fatal event set:       none
	    parameter set:         inherit regent
	    member processes:      335 336 338 343 344 345 346 347 1936749667
	    inherited contracts:   none

pids 335-347 are afsd pides, but 1936749667 is obviously not valid!

We run AFS via SMF.  Shutting down AFS seems to result in an infinite
loop and we see these messages in the SMF log:

    # tail /var/svc/log/site-openafs-client:default.log
    [ Nov 27 12:06:57 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:06:58 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:06:59 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:07:01 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:07:02 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:07:03 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:07:04 Method or service exit timed out.  Killing contract 47 ]
    [ Nov 27 12:07:05 Method or service exit timed out.  Killing contract 47 ]

    [...]

(We can of course run AFS outside of SMF, but the underlying problem is still
there)

When we try to reboot the machine, it does not reboot cleanly, and if
I run a "ctstat -v" while it's shutting down it frequently kernel
panics.  (The kernel panics can happen even *after* I shutdown AFS
manually first!).

    panic[cpu0]/thread=ffffffff820cdc80: BAD TRAP: type=e (#pf Page fault) rp=fffffe80005ccb80 addr=fffffffffffff4e8

    ctstat: #pf Page fault
    Bad kernel fault at addr=0xfffffffffffff4e8
    pid=497, pc=0xfffffffffb9d26f5, sp=0xfffffe80005ccc70, eflags=0x10282
    cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
    cr2: fffffffffffff4e8 cr3: f7fd000 cr8: c
	    rdi: ffffffff874823c0 rsi: ffffffff81d6da60 rdx:                0
	    rcx: fffffffffffff438  r8: fffffe80005ccd30  r9:           100000
	    rax: fffffffffffff438 rbx:                6 rbp: fffffe80005cccf0
	    r10: fffffffffbc64860 r11:                0 r12: ffffffff87482200
	    r13: ffffffff874823b0 r14: ffffffff874822b8 r15:                d
	    fsb: ffffffff80000000 gsb: fffffffffbc25460  ds:               43
	     es:               43  fs:                0  gs:              1c3
	    trp:                e err:                0 rip: fffffffffb9d26f5
	     cs:               28 rfl:            10282 rsp: fffffe80005ccc70
	     ss:               30

    fffffe80005cca90 unix:real_mode_end+7051 ()
    fffffe80005ccb70 unix:trap+d86 ()
    fffffe80005ccb80 unix:cmntrap+13f ()
    fffffe80005cccf0 genunix:contract_process_status+155 ()
    fffffe80005ccdb0 ctfs:ctfs_stat_ioctl+10e ()
    fffffe80005ccde0 genunix:fop_ioctl+25 ()
    fffffe80005ccec0 genunix:ioctl+ac ()
    fffffe80005ccf10 unix:brand_sys_syscall32+1a3 ()

    syncing file systems... done
    dumping to /dev/md/dsk/d10, offset 1719074816, content: kernel

This problem seems similar to the one that's described here:

    http://www.openafs.org/pipermail/openafs-info/2005-October/019765.html

We're running the prebuilt 1.4.5 Solaris10/x86 binaries from
openafs.org, although we put a locally built one in as a test and
it displayed the same symptoms.

Any help appreciated.

Regards,

Robert.