[OpenAFS-port-darwin] Different kernel panic that's probably AFS-related

Fri, 19 Mar 2004 17:20:46 +0100

On Tue, 16 Mar 2004 10:23:36 +0100
Sebastian Hagedorn <Hagedorn@uni-koeln.de> wrote:

> Hi,
> 
> I just had a different kernel panic (not vnreclaim) on my G4 Dual 1.25
> Ghz running 10.3.2.

Hello!

This crash occurs quite frequently for us as well. I have done some
investigation into this, and thought I'd share it, before I forget :)

We have seen that crashes are much more frequent when the finder or
other Carbon applications are involved.

One way to trigger the problem is to rename files in the finder. After I
have renamed a few files in the finder and then try to have those vnodes
reused (for instance by running find / -type f -xdev) I am likely to
have a crash. This crash is normaly the vnreclaim crash). I can also
trigger a crash by doing the above, and then try to rename a file again
in the finder. This usualy causes the Data access crash. It's not a
fool-proof recepie, but it's something.

Just renaming files in a shell does not seem to cause a crash.

What carbon applications do when renaming files is that they first
rename the file, then try to rename the resource-fork (which is called
._filename). In my test-cases this resource-fork doesn't exist, so the
syscall fails after the first call to vop_lookup. They then try to
remove the target resource-fork by calling the special delete
systemcall. This is basicly a normal unlink but with the changed
semantics. Delete doesn't unlink a file if it is busy, but returns an
error instead. This call also fails after a vop_lookup.

Here is a typical data access crash for us:

Kernel version:
Darwin Kernel Version 7.2.0:
Thu Dec 11 16:20:23 PST 2003; root:xnu/xnu-517.3.7.obj~1/RELEASE_PPC

panic(cpu 0): 0x300 - Data access
Latest stack backtrace for cpu 0:
      Backtrace:
         0x000833B8 0x0008389C 0x0001ED8C 0x000908C0 0x00093B8C 
Proceeding back via exception chain:
   Exception state (sv=0x1FAC7A00)
      PC=0x002239F4; MSR=0x00009030; DAR=0xDEADD1F5; DSISR=0x40000000; LR=0x000C06AC; R1=0x12203CD0; XCP=0x0000000C (0x300 - Data access)
      Backtrace:
         0x000C11F0 0x000C8DE4 0x0023DD24 0x00093D20 0x00000000 
   Exception state (sv=0x1FAA8A00)
      PC=0x9002E18C; MSR=0x0200F030; DAR=0x1C0E9008; DSISR=0x40000000; LR=0x90323F08; R1=0xBFFFBBC0; XCP=0x00000030 (0xC00 - System call)

Data access is essentialy a segmentation fault in the kernel. In other
words a faulty memory reference.

10.3.2 runs the kernel xnu-517.3.7 which can be obtained from
http://www.opensource.apple.com/darwinsource/10.3.2/

By starting gdb on /mach_kernel it is possible to resolve the symbols in
the backtrace.

The first backtrace is the exception stack, which is quite uninteresting.

(gdb) x/i 0x00093B8C 
0x93b8c <thandler+204>: lis     r10,512
(gdb) x/i 0x000908C0
0x908c0 <trap+1224>:    b       0x90e1c <trap+2596>
(gdb) x/i 0x0001ED8C
0x1ed8c <panic+472>:    lwz     r0,556(r29)
(gdb) x/i 0x0008389C
0x8389c <Debugger+108>: lwz     r0,552(r31)
(gdb) x/i 0x000833B8
0x833b8 <print_backtrace+176>:  bne+    cr4,0x83488 <print_backtrace+384>

The second element in the exception chain is more interesting, as it
shows that the crash occurs in vrele() called from a rename()

(gdb) x/i 0x00093D20
0x93d20 <shandler+256>: li      r3,7
(gdb) x/i 0x0023DD24
0x23dd24 <unix_syscall+580>:    lwz     r0,16412(r27)
(gdb) x/i 0x000C8DE4
0xc8de4 <rename+1580>:  cmpwi   cr7,r3,0
(gdb) x/i 0x000C11F0
0xc11f0 <vrele+24>:     mr      r30,r3

The location of the instruction causing the actual crash is in the PC
register.

(gdb) x/i 0x002239F4 
0x2239f4 <ubc_issetflags+44>:   lwz     r0,8(r2)

The DAR register in the exception is the address where the exception
occured. In this case 0xDEADD1F5. 

In sys/ubc.h UBC_NOINFO is defined to be the address 0xDEADD1ED (as a
struct ubc_info *)

 (gdb) p/x 0xDEADD1F5-0xDEADD1ED 
$3 = 0x8

In other words, the crash occurs in a reference 8 bytes from the start
of a dead struct ubc_info *. This is consistent with the program
counter.

Turning our attention to the source we see that ubc_issetflags() (in
bsd/kern/ubc_subr.c) takes a vnode *, takes out the v_ubcinfo element
and tries to reference ui_flags in the ubc_info.

I am a bit unsure as of the semantics of the ubc-structure in a vnode,
but it seems to me that we do something wrong :)

The comment at the beginning of afs_vop_open() in
src/afs/DARWIN/osi_vnodeops.c in the openafs sourcetree:

    /*----------------------------------------------------------------
     * osi_VM_TryReclaim() removes the ubcinfo of a vnode, but that vnode
     * can later be passed to vn_open(), which will skip the call to
     * ubc_hold(), and when the ubcinfo is later added, the ui_refcount
     * will be off.  So we compensate by calling ubc_hold() ourselves
     * when ui_refcount is less than 2.  If an error occurs in afs_open()
     * we must call ubc_rele(), which is what vn_open() would do if it
     * was able to call ubc_hold() in the first place.
     *----------------------------------------------------------------*/

feels a bit related. There's a lot of ubc magic around here which I
don't understand.

/mattiasa