[OpenAFS] Re: 1.4.12fc1 kernel panics

Simon Wilkinson sxw@inf.ed.ac.uk
Sun, 17 Jan 2010 00:28:15 +0000


On 16 Jan 2010, at 19:28, Adam Megacz wrote:

>=20
> Derrick Brashear <shadow@gmail.com> writes:
>> Instead, give it some arguments, like -i
>> /Library/Logs/DiagnosticReports/(path to kernel report).
>=20
> Ok,

Thanks.

So, as I said earlier, this is the bug that Derrick tracked down, and I =
fixed earlier today. I thought it might be of use to say how we went =
about doing this, given the information in the following ...

>=20
> Panic Date:      Interval Since Last Panic Report:  472905 sec
> Kernel Version:  Darwin Kernel Version 10.2.0: Tue Nov  3 10:37:10 PST
> 2009; root:xnu-1486.2.11~1/RELEASE_I386
> OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> add symbol table from file
> "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 0x21b2bd
> <panic+445>: mov    0x8011d0,%eax
> 0x2a7ac2 <kernel_trap+1530>:    jmp    0x2a7ade <kernel_trap+1558>
> 0x29d968 <lo_alltraps+712>:     mov    %edi,%esp
> 0x4607e500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx

So, this is the point where the panic actually occurred (all of the =
stuff above is related to the trap mechanism). If we look at your =
earlier post - the raw panic log, we see:

CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS:  0x00000004, DS:  0x0000000c

So, at the time of the panic, EDX was probably NULL. So, the bug is a =
NULL pointer dereference of some kind, and we're looking up something =
that's 0x64 bytes into the structure that's pointed at by EDX.

If we had a kernel with debugging symbols, then we could look directly =
at the C source which corresponds to afs_GetDCache+7832. Unfortunately, =
we didn't have one to hand, so Derrick improvised. By disassembling the =
current kernel module, we can see exactly what code occurs at that =
location...

gdb --arch=3Di386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>:        mov    0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>:        call   0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>:        cmp    0x0,%eax
0x0001150e <afs_GetDCache+7846>:        je     0x1152c =
<afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>:        movl   $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>:        movl   $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>:        movl   $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>:        call   0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>:        movl   $0x0,0x0
0x00011536 <afs_GetDCache+7886>:        mov    0x0,%eax
0x0001153b <afs_GetDCache+7891>:        mov    %eax,(%esp)
0x0001153e <afs_GetDCache+7894>:        call   0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>:        mov    %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>:        mov    -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>:        mov    %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>:        call   0x4df78 <rx_EndCall>
[ ... ]

(The call 0x0 is because we're debugging a module that isn't loaded. You =
can get a dump that fills in the blanks for these by doing the =
following. When kextutil asks for the load address, give it the one that =
the panic log said the OpenAFS module was loaded at, in this case =
0x4606c000

   mkdir /tmp/symbols
   kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
   cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
   gdb --arch=3Di386 /mach_kernel
   (gdb) add_kext /tmp/symbols/afs.kext
   (gdb) disassemble afs_GetDCache
)

We're relatively lucky here, as near the panic location is a call to =
rx_EndCall(), the last of 2 that occur in that GetDCache. Lining this up =
with the corresponding source code, the last call to rx_EndCall occurs =
at line 2219 of afs_dcache.c, which looks something like:

    if (length > size) {
	/* The fileserver told us it is going to send more data
	 * than we requested. It shouldn't do that, and
	 * accepting that much data can make us take up more
	 * cache space than we're supposed to, so error. */
	code =3D rx_Error(tcall);
	RX_AFS_GUNLOCK();
	code1 =3D rx_EndCall(tcall, code);
	RX_AFS_GLOCK();
	tcall =3D (struct rx_call *)0;
	code =3D EIO;
    }

So, assuming that the compiler hasn't reordered our code, we've got =
rx_Error(), followed by RX_GUNLOCK. It isn't possible for the compiler =
to reorder over a lock, so it's pretty likely that the ordering is as =
shown. Both of these are macros masquerading as functions. Looking at =
the simpler, rx_Error(), we have:

    #define rx_Error(call)                  ((call)->error)

Now, if you recall, we panic'd because we were looking up something that =
has an offset of
0x64 from the base of the structure. We can verify that 'error' is =
stored 0x64 bytes into the rx_call structure by visually examining the =
structure definition or, if you've got a build tree from the same =
architecture, by running:

gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 =3D (afs_int32 *) 0x64

All of this points pretty conclusively at a NULL value for 'tcall'. =
Examining the code, there is a situation where that can occur, if we get =
a bogus length from the network, and as a result double free the rx =
call. It seems highly likely that this was your bug.

Hopefully all of that is vaguely of use to folk wanting to know how =
these things fit together - and a huge thanks to Derrick for all of the =
information about how he did the disassembly steps.

Cheers,

Simon.