[OpenAFS] Re: 1.4.12fc1 kernel panics
Simon Wilkinson
sxw@inf.ed.ac.uk
Sun, 17 Jan 2010 00:28:15 +0000
On 16 Jan 2010, at 19:28, Adam Megacz wrote:
>=20
> Derrick Brashear <shadow@gmail.com> writes:
>> Instead, give it some arguments, like -i
>> /Library/Logs/DiagnosticReports/(path to kernel report).
>=20
> Ok,
Thanks.
So, as I said earlier, this is the bug that Derrick tracked down, and I =
fixed earlier today. I thought it might be of use to say how we went =
about doing this, given the information in the following ...
>=20
> Panic Date: Interval Since Last Panic Report: 472905 sec
> Kernel Version: Darwin Kernel Version 10.2.0: Tue Nov 3 10:37:10 PST
> 2009; root:xnu-1486.2.11~1/RELEASE_I386
> OpenAFS Version: org.openafs.filesystems.afs(1.4.12fc1)
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> add symbol table from file
> "/tmp/afsdebugLAjeJl/org.openafs.filesystems.afs.sym"? 0x21b2bd
> <panic+445>: mov 0x8011d0,%eax
> 0x2a7ac2 <kernel_trap+1530>: jmp 0x2a7ade <kernel_trap+1558>
> 0x29d968 <lo_alltraps+712>: mov %edi,%esp
> 0x4607e500 <afs_GetDCache+7832>: mov 0x64(%edx),%ebx
So, this is the point where the panic actually occurred (all of the =
stuff above is related to the trap mechanism). If we look at your =
earlier post - the raw panic log, we see:
CR0: 0x8001003b, CR2: 0x00000064, CR3: 0x00101000, CR4: 0x000006e0
EAX: 0x00100000, EBX: 0x00000000, ECX: 0x460870e2, EDX: 0x00000000
CR2: 0x00000064, EBP: 0x34cabf1c, ESI: 0x0bff4004, EDI: 0xffffffff
EFL: 0x00010297, EIP: 0x4607e500, CS: 0x00000004, DS: 0x0000000c
So, at the time of the panic, EDX was probably NULL. So, the bug is a =
NULL pointer dereference of some kind, and we're looking up something =
that's 0x64 bytes into the structure that's pointed at by EDX.
If we had a kernel with debugging symbols, then we could look directly =
at the C source which corresponds to afs_GetDCache+7832. Unfortunately, =
we didn't have one to hand, so Derrick improvised. By disassembling the =
current kernel module, we can see exactly what code occurs at that =
location...
gdb --arch=3Di386 /var/db/openafs/etc/afs.kext/Contents/MacOS/afs
(gdb) disassemble afs_GetDCache
[ ... ]
0x00011500 <afs_GetDCache+7832>: mov 0x64(%edx),%ebx
0x00011503 <afs_GetDCache+7835>: call 0x0 <afs_atomlist_create>
0x00011508 <afs_GetDCache+7840>: cmp 0x0,%eax
0x0001150e <afs_GetDCache+7846>: je 0x1152c =
<afs_GetDCache+7876>
0x00011510 <afs_GetDCache+7848>: movl $0x8aa,0x8(%esp)
0x00011518 <afs_GetDCache+7856>: movl $0x8e438,0x4(%esp)
0x00011520 <afs_GetDCache+7864>: movl $0x8e080,(%esp)
0x00011527 <afs_GetDCache+7871>: call 0x642f8 <osi_AssertFailK>
0x0001152c <afs_GetDCache+7876>: movl $0x0,0x0
0x00011536 <afs_GetDCache+7886>: mov 0x0,%eax
0x0001153b <afs_GetDCache+7891>: mov %eax,(%esp)
0x0001153e <afs_GetDCache+7894>: call 0x0 <afs_atomlist_create>
0x00011543 <afs_GetDCache+7899>: mov %ebx,0x4(%esp)
0x00011547 <afs_GetDCache+7903>: mov -0xb8(%ebp),%ecx
0x0001154d <afs_GetDCache+7909>: mov %ecx,(%esp)
0x00011550 <afs_GetDCache+7912>: call 0x4df78 <rx_EndCall>
[ ... ]
(The call 0x0 is because we're debugging a module that isn't loaded. You =
can get a dump that fills in the blanks for these by doing the =
following. When kextutil asks for the load address, give it the one that =
the panic log said the OpenAFS module was loaded at, in this case =
0x4606c000
mkdir /tmp/symbols
kextutil -n -s /tmp/symbols /var/db/openafs/etc/afs.kext
cp -R /var/db/openafs/etc/afs.kext /tmp/symbols
gdb --arch=3Di386 /mach_kernel
(gdb) add_kext /tmp/symbols/afs.kext
(gdb) disassemble afs_GetDCache
)
We're relatively lucky here, as near the panic location is a call to =
rx_EndCall(), the last of 2 that occur in that GetDCache. Lining this up =
with the corresponding source code, the last call to rx_EndCall occurs =
at line 2219 of afs_dcache.c, which looks something like:
if (length > size) {
/* The fileserver told us it is going to send more data
* than we requested. It shouldn't do that, and
* accepting that much data can make us take up more
* cache space than we're supposed to, so error. */
code =3D rx_Error(tcall);
RX_AFS_GUNLOCK();
code1 =3D rx_EndCall(tcall, code);
RX_AFS_GLOCK();
tcall =3D (struct rx_call *)0;
code =3D EIO;
}
So, assuming that the compiler hasn't reordered our code, we've got =
rx_Error(), followed by RX_GUNLOCK. It isn't possible for the compiler =
to reorder over a lock, so it's pretty likely that the ordering is as =
shown. Both of these are macros masquerading as functions. Looking at =
the simpler, rx_Error(), we have:
#define rx_Error(call) ((call)->error)
Now, if you recall, we panic'd because we were looking up something that =
has an offset of
0x64 from the base of the structure. We can verify that 'error' is =
stored 0x64 bytes into the rx_call structure by visually examining the =
structure definition or, if you've got a build tree from the same =
architecture, by running:
gdb src/libafs/MODLOAD/libafs.nonfs.o
(gdb) print &((struct rx_call*)0)->error
$2 =3D (afs_int32 *) 0x64
All of this points pretty conclusively at a NULL value for 'tcall'. =
Examining the code, there is a situation where that can occur, if we get =
a bogus length from the network, and as a result double free the rx =
call. It seems highly likely that this was your bug.
Hopefully all of that is vaguely of use to folk wanting to know how =
these things fit together - and a huge thanks to Derrick for all of the =
information about how he did the disassembly steps.
Cheers,
Simon.