[OpenAFS] 1.4.4 client on EL3: panic in afs_HashOutDcache

Stephan Wiesand Stephan.Wiesand@desy.de
Thu, 12 Apr 2007 09:30:59 +0200 (CEST)


On Wed, 11 Apr 2007, Derrick J Brashear wrote:

> On Wed, 11 Apr 2007, Stephan Wiesand wrote:
>
>> One of our systems panicked two times within 2 hours yesterday, at the same 
>> location in the OpenAFS client. I attached the kernel's last words below.
>> 
>> This is an SL3 system, kernel 2.4.21-47.0.1.ELsmp, i686. The client build 
>> has two patches on top of 1.4.4: linux-task-pointer-safety-20070320 from 
>> CVS, and the one from
>> https://lists.openafs.org/pipermail/openafs-devel/2007-March/014985.html
>
> afs_HashOutDCache has
>    /* if this guy is in the hash table, pull him out */
>    if (adc->f.fid.Fid.Volume != 0) {
>        i = DCHash(&adc->f.fid, adc->f.chunk);
>        us = afs_dchashTbl[i];
>        if (us == adc->index) {
> ..
>       } else {
>            /* somewhere on the chain */
>            while (us != NULLIDX) {
>                if (afs_dcnextTbl[us] == adc->index) {
>                    /* found item pointing at the one to delete */
>                    afs_dcnextTbl[us] = afs_dcnextTbl[adc->index];
>                    break;
>                }
>                us = afs_dcnextTbl[us];
>            }
>            if (us == NULLIDX)
>                osi_Panic("dcache hc");
>
> so basically you appear to have an unhashed dcache entry. Either there's a 
> locking bug or something is becoming erroneously unhashed.
>
> How reproducible is it?

Good news: it is reproducible. The user confessed that he'd run "less than 
20" parallel rsyncs transferring data to our cell. The files are a mixture 
af data and log files, with typical sizes of 15MB and 100kB.

So I set up a dozen rsyncs to copy this data into another volume, and 
after some 9 hours got the panic you find below.

I'm going to repeat this exercise now, and will also try to make the panic 
happen earlier (more rsyncs, read data from a faster source - any other
ideas?).

Just wondering what to do next then.

Thanks for caring,
 	Stephan

PS Here's the Oops:

dcache hc<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
  printing eip: 
f8a6da50 
*pde = 34669001 
*pte = 5b103067 
Oops: 0002 
panfs nfs lockd sunrpc openafs netconsole 3c59x mii microcode ohci1394 ieee1394 loop keybdev mousedev hid input usb-uhci usbcore ext3 jbd lvm-mod aic7xxx disk 
CPU:    2 
EIP:    0060:[<f8a6da50>]    Tainted: P 
EFLAGS: 00010282

EIP is at osi_Panic [openafs] 0x20 (2.4.21-47.0.1.ELsmp/i686) 
eax: 00000009   ebx: f8b74000   ecx: 00000046   edx: c0388e98 
esi: f8c328c0   edi: 0015fa73   ebp: 0000000d   esp: f5427e04 
ds: 0068   es: 0068   ss: 0068 
Process afs_cachetrim (pid: 987, stackpage=f5427000) 
Stack: f8a9365b 00000002 00000000 f8a46e77 f8c328c0 0015fa73 0000000d f8a2d9ef
        f8a9365b 00000002 00000000 f8a46e77 f8c328c0 d4938380 0015fa73 f8a2d6a8
        f8c328c0 00000000 00000000 0000f2da d0928990 00000000 00000000 4dd6d295 
Call Trace:   [<f8a9365b>] .rodata.str1.1 [openafs] 0x11f (0xf5427e04) 
[<f8a46e77>] shutdown_vcache [openafs] 0x357 (0xf5427e10) 
[<f8a2d9ef>] afs_HashOutDCache [openafs] 0x7f (0xf5427e20) 
[<f8a9365b>] .rodata.str1.1 [openafs] 0x11f (0xf5427e24) 
[<f8a46e77>] shutdown_vcache [openafs] 0x357 (0xf5427e30) 
[<f8a2d6a8>] afs_GetDownD [openafs] 0x528 (0xf5427e40) 
[<f8a2cd2e>] afs_CacheTruncateDaemon [openafs] 0x12e (0xf5427fa0) 
[<f8a7f9f0>] afsd_thread [openafs] 0x3e0 (0xf5427fe0) 
[<f8a7f610>] afsd_thread [openafs] 0x0 (0xf5427fe4) 
[<c01095cd>] kernel_thread_helper [kernel] 0x5 (0xf5427ff0)

Code: c6 05 00 00 00 00 00 83 c4 1c c3 90 8d 74 26 00 b8 4f 42 a9

Kernel panic: Fatal exception


-- 
Stephan Wiesand
   DESY - DV -
   Platanenallee 6
   15738 Zeuthen, Germany