[OpenAFS] Solaris 10 11/06 afs 1.4.2 pam module panic.

Marcus Watts mdw@umich.edu
Mon, 18 Dec 2006 23:11:59 -0500

> > Some more interesting experiments.
> > How about:
> > 	pagsh		setpag
> Just running pagsh was enough to panic the system.

Great!  pagsh does almost nothing.  It's also the common factor
for everything else that you tried, and it does something
"tricky" that is inherently slightly unsafe.  It might not be
your only problem, but it's clearly something worth fixing.


> panic[cpu0]/thread=300016fe9a0: BAD TRAP: type=34 rp=2a100a958b0 addr=33 
> mmu_fsr=0
> sudo: alignment error:
> addr=0x33
> pid=577, pc=0x10b3cb0, sp=0x2a100a95151, tstate=0x80001602, context=0x477
> g1-g7: 33, 33, 0, 198, 0, 0, 300016fe9a0
> 000002a100a955d0 unix:die+9c (34, 2a100a958b0, 33, 0, 2a100a95690, c1e00000)
>    %l0-3: 00000000c0800000 0000000000000034 0000000000000000 00000300000715f0
>    %l4-7: 0000030000071640 000000000000000d 0000000000000001 0000000001076000
> 000002a100a956b0 unix:trap+690 (2a100a958b0, 10009, 0, 80000b, 0, 300016fe9a0)
>    %l0-3: 0000000000000000 00000600006c57e0 0000000000000034 000006000514ae20
>    %l4-7: 0000000000000000 0000000000000000 000000000000f000 0000000000010200
> 000002a100a95800 unix:ktl0+48 (60003ebbda0, 0, 242, 33, 33, 3)
>    %l0-3: 0000000000000003 0000000000001400 0000000080001602 000000000101aa04
>    %l4-7: 0000060000830000 0000000000000cc0 0000000000000000 000002a100a958b0
> 000002a100a95950 genunix:getproc+11c (2a100a95ad8, 0, 600006c57e0, 60005035bc0, 
> 600006c57e0, 1837400)
>    %l0-3: 0000060003ebbda0 00000000018a5c00 0000000000000000 ffffffffffffffff
>    %l4-7: 0000060005035bd8 0000060005035fd0 0000000000000242 0000000000000000
> 000002a100a95a00 genunix:cfork+94 (0, 1, 0, 1, 600006c57e0, 0)
>    %l0-3: 0000000000000000 0000000000000000 00000000b8680000 000000000000b868
>    %l4-7: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
> panic: entering debugger (continue to save dump)
> Welcome to kmdb
> kmdb: unable to determine terminal type: assuming `vt100'
> Loaded modules: [ zfs ]
> [0]>
> Any suggestions on what to look for?

Umm.  Here's where my solaris 10 knowledge runs a bit thin.
I'm not sure what "rp=", and I don't see anything here that looks
like the credentials structure that setpag should be using.
Actually, it looks rather like the setpag stuff came & sent, and
trashed something quite incidental to itself that broke something
random later.  If it repeats determistically reboot after reboot
(ie, same traceback, addresses, etc.,) that should make this
approach doable.  Painful, but doable.

So, that right approach would be to set breakpoints in the
afs kernel module at afs_HandlePioctl, afs_setpag, setpag, etc.,
and track the code through down through where things go wrong.

Failing that, well, the thing I can think of is to dump memory
000002a100a955d0 .. 000002a100a95a00
The next thing would be to find the trap frame - that's probably the
parameters to "unix:trap+690 +".  
Dump that if it's not in the stack dumped already.

Next thing after that would be to start examining bits of the thread
and process structures that are in the system for sshd.  The key
thing to look for would be the creds structure - Something like
but I'm not sure what you would tell kmdb for this.
curthread might be in %g7 - and ttoproc is probably a simple
structure offset.  In any case, everything that ttoproc(curthread)
points to is of interest.  kmdb might have its own perfectly simple
ways to look at the current per-process & per-thread variables, if
so, dump them.

kmdb may have a way to validate the slab memory allocator or
whatever solaris 10 uses in its place.  If so, that might be
of interest.

I apparently have just gotten access to a local sun-10/sparc64 box ,
which if it works may give me more knowledge of what's going on.
I've also just downloaded opensolaris.org's "onnv", which may
contain kernel source (so far, I've found their site somewhat baffling.)
I'll send you more if I learn more.

				-Marcus Watts