[OpenAFS-port-darwin] Kernel panic from bug 41550 reproduced

Jonas Maebe jonas.maebe@elis.ugent.be
Sat, 13 Oct 2007 18:54:43 +0200


Hello,

I've discovered a use-case with which I can fairly reliably reproduce  
the kernel panic described in <http://rt.central.org/rt/Ticket/ 
Display.html?id=41550>

What I don't understand is how to add new comments to that bug report  
(is it only possible by sending a mail with a specially formatted  
subject line to openafs-bugs@openafs.org or so?), but I guess that's  
just me. Anyway:

I'm using a dual G5/1.8GHz with 10.4.10 and 3GB ram.

At first I had OpenAFS 1.4.2 still installed, and got this kernel panic:

*********

Sat Oct 13 17:26:35 2007
panic(cpu 1 caller 0x000E8E00): remove_fsref: no named reference
Latest stack backtrace for cpu 1:
       Backtrace:
          0x000952D8 0x000957F0 0x00026898 0x000E8E00 0x6F12D150  
0x6F127C88 0x000
FB660 0x000E2424
          0x000E1FB8 0x000EEC88 0x000EEEEC 0x000EEF8C 0x002AB548  
0x000ABB30 0x000
00000
       Kernel loadable modules in backtrace (with dependencies):
          org.openafs.filesystems.afs(1.4.2)@0x6f089000
Proceeding back via exception chain:
    Exception state (sv=0x5A239000)
       PC=0x900062AC; MSR=0x0200F930; DAR=0xE1285000;  
DSISR=0x42000000; LR=0x0004
4118; R1=0xBFFFCC50; XCP=0x00000030 (0xC00 - System call)

Kernel version:
Darwin Kernel Version 8.10.0: Wed May 23 16:50:59 PDT 2007;  
root:xnu-792.21.3~1/
RELEASE_PPC
*********

After googling for "remove_fsref: no named reference" I found <http:// 
www.nabble.com/OpenAFS-1.4.2-crashing-on-Intel-Macs-(10.4)- 
t3137912.html> and from there got the link to the bug report  
mentioned above. The last comment to that bug report mentions a  
commit of a possible fix. I checked CVS and it seems this commit  
should be in 1.4.5-pre1, so I downloaded and installed that one. I  
still get the kernel panic though:

*********

Sat Oct 13 17:44:17 2007
panic(cpu 0 caller 0x000E8E00): remove_fsref: no named reference
Latest stack backtrace for cpu 0:
       Backtrace:
          0x000952D8 0x000957F0 0x00026898 0x000E8E00 0x7175E2D4  
0x71758D64 0x000FB660 0x000E2424
          0x000E1FB8 0x000EEC88 0x000EEEEC 0x000EEF8C 0x002AB548  
0x000ABB30 0x636E746C
       Kernel loadable modules in backtrace (with dependencies):
          org.openafs.filesystems.afs(1.4.5fc1)@0x716ba000
Proceeding back via exception chain:
    Exception state (sv=0x719C4C80)
       PC=0x900062AC; MSR=0x0000F930; DAR=0xE12B4000;  
DSISR=0x42000000; LR=0x0004
4118; R1=0xBFFFCC70; XCP=0x00000030 (0xC00 - System call)

Kernel version:
Darwin Kernel Version 8.10.0: Wed May 23 16:50:59 PDT 2007;  
root:xnu-792.21.3~1/
RELEASE_PP
*********

Here's the symbolised version of the 1.4.5fc1 backtrace:

(gdb) x/i 0x000E8E00
0xe8e00 <vnode_removefsref+48>: lhz     r0,44(r31)
(gdb) x/i 0x7175E2D4
0x7175e2d4 <afs_darwin_finalizevnode+976>:      bl      0x7175e530  
<afs_darwin_finalizevnode+1580>
(gdb) x/i 0x71758D64
0x71758d64 <afs_vop_lookup+844>:        mr      r0,r3
(gdb) x/i 0x000FB660
0xfb660 <VNOP_LOOKUP+144>:      mr      r30,r3
(gdb) x/i 0x000E2424
0xe2424 <lookup+500>:   mr.     r28,r3
(gdb) x/i 0x000E1FB8
0xe1fb8 <namei+588>:    mr.     r30,r3
(gdb) x/i 0x000EEC88
0xeec88 <access+300>:   mr.     r29,r3
(gdb) x/i 0x000EEEEC
0xeeeec <access+912>:   lwz     r0,488(r1)
(gdb) x/i 0x000EEF8C
0xeef8c <stat+52>:      lwz     r0,88(r1)
(gdb) x/i 0x002AB548
0x2ab548 <unix_syscall+756>:    lwz     r0,20508(r29)
(gdb) x/i 0x000ABB30
0xabb30 <shandler+272>: li      r3,7
(the last address, 0x636E746C, appears to be bogus)

Now, how I can reproduce the panic: by compiling the run time library  
of the Free Pascal Compiler (fpc) with make -j 2, starting with the  
latest unstable of the compiler (haven't tried starting with the  
latest stable, but that one won't work very well with AFS anyway  
because it had problems with case-sensitive file systems under Mac OS  
X), with the sources located on a (remote) AFS volume.

One possibly interesting thing to note: fpc uses internal directory  
caching, i.e., the first time it looks for a file in a directory it  
immediately goes through all files and directories in that directory,  
adds their names to an internal hashtable, and uses that table from  
then on. It performs this directory caching using opendir/readdir/ 
closedir. So if two instances of the compiler are running  
simultaneously (it's an smp machine), you can get various kinds of  
interleaving of opendir/readdir/closedir on the same directory from  
the different compiler processes.

If you want to try it on your own system (note: the following  
sequence is *untested*, and requires that svn is installed), do the  
following *on an AFS volume*:

mkdir fpc
cd fpc
svn co -r 8765 http://svn.freepascal.org/svn/fpc/trunk/rtl rtl
curl -O http://www.elis.ugent.be/~jmaebe/ppcppc3.tbz
tar xjf ppcppc3.tbz
cd rtl/darwin
make FPC=`pwd`/../../ppcppc3 clean
make FPC=`pwd`/../../ppcppc3 OPT="-ap -XP" all -j 2

(don't do a single "make clean all -j 2", as the Makefile doesn't  
specify ordering for the clean and all targets)

When it panics, it does so for me fairly early on (the system unit is  
compiled using a single process as everything else depends on it, but  
from then on things start in parallel). It seems to happen more often  
the second time you do this (repeat the make clean and make all lines  
if it didn't panic), i.e., when some things have already been cached.

Note that that the supplied ppcppc3 is a PowerPC binary. I can also  
provide an i386 binary if required (I still don't have an Intel Mac  
myself, but I do have remote access to one).

I hope this helps tracking down the problem.


Jonas