[OpenAFS] kernel panics in afs_GetDCache

Simon Wilkinson sxw@inf.ed.ac.uk
Mon, 15 Feb 2010 23:07:00 +0000

> Gah...  My reading skills need help, apperently.  That was
> 2.6.18-128.7.1.el5 with openafs 1.4.11.

I suspect we should probably move this into RT, but I thought recording =
the steps taken so far might be of use to others.

I grabbed, and installed, the debug package for this module onto a RHEL5 =
system. I then grabbed the kmod itself, and extracted it using rpm2cpio =
<rpm> | cpio -i -d=20

Starting up gdb on the kernel module then lets you do some poking...

First, we want to find where we've stopped. So, first we need to get the =
base address of the afs_GetDCache function...

(gdb) info line afs_GetDCache
Line 1497 of =
-128.7.1.el5-MP/afs_dcache.c" starts at address 0xb3cf <afs_GetDCache>
   and ends at 0xb3d9 <afs_GetDCache+10>.

Next, we want to find out where the problem we hit actually is...

(gdb) info line *(0xb3cf + 0x1c0a)
Line 2159 of =
   starts at address 0xcfc8 <afs_GetDCache+7161 at =
   and ends at 0xcfe6 <afs_GetDCache+7191 at =

Line 2159 of afs_dcache.c (in this version) is:
    if (code =3D=3D RXGEN_OPCODE || afs_serverHasNo64Bit(tc)) {

afs_serverHasNo64Bit is a macro, which does:
    ((tc)->srvr->server->flags & SNO_64BIT)

So, in the code, we want to know whether we're possibly in the right =
place. Let's take a look at the actual code we were running...

(gdb) disass 0xcfc8 0xcfe6
Dump of assembler code from 0xcfc8 to 0xcfe6:
0x0000cfc8 <afs_GetDCache+7161>:	cmpl   $0xfffffe39,0x3c(%esp)
0x0000cfd0 <afs_GetDCache+7169>:	je     0xcfe6 =
<afs_GetDCache+7191 at =
0x0000cfd2 <afs_GetDCache+7171>:	mov    0x68(%esp),%ebp
0x0000cfd6 <afs_GetDCache+7175>:	mov    0xc(%ebp),%eax
0x0000cfd9 <afs_GetDCache+7178>:	mov    0x8(%eax),%eax
0x0000cfdc <afs_GetDCache+7181>:	testb  $0x2,0x3d(%eax)
0x0000cfe0 <afs_GetDCache+7185>:	je     0xd0de =
<afs_GetDCache+7439 at =
End of assembler dump.

We die at 0x0000cfd9. By looking at the stack offsets of that macro =
reference, we can correlate them with the above code. Given that tc is =
an afs_conn, we have ...

(gdb) print &((struct afs_conn *)0)->srvr
$1 =3D (struct srvAddr **) 0xc
(gdb) print &((struct srvAddr *)0)->server
$2 =3D (struct server **) 0x8

(So srvr is 0xc bytes into the structure pointed at by 'tc', and server =
is 0x8 bytes into this structure. These match with the offsets done by =
the mov instructions at 0xcfd6 and 0xcfd9, indicating that we're looking =
in the right place)

Given we fail at 0xcfd9, it looks like for some reason the structure =
pointed to by 'tc' contains an invalid value (0x63, if you look at the =
contents of EAX in the panic dump) for its srvr element.