[OpenAFS] 1.4.4 client on EL3: panic in afs_HashOutDcache

Stephan Wiesand wiesand@dv.ifh.de
Wed, 18 Apr 2007 18:26:49 +0200 (CEST)


On Thu, 12 Apr 2007, Derrick J Brashear wrote:

>>> On Wed, 11 Apr 2007, Stephan Wiesand wrote:
>>> 
>>>> One of our systems panicked two times within 2 hours yesterday, at the 
>>>> same location in the OpenAFS client. I attached the kernel's last words 
>>>> below.
[...]
> I'm thinking about a patch. I have something else I need to deal with but I 
> will try to work something up after. There's a 3rd possibility, namely the 
> missing object being mishashed. We can presumably just instead of panicing 
> iterate everything and dump state.
>
> I suppose the other possibility would be to get a kernel crash dump but it's 
> sort of cumbersome to move those around so unless you're comfortable with a 
> debugger on a kernel dump that's probably a non-starter.

Got one:

# crash /boot/vmlinux-2.4.21-47.0.1.ELsmp vmcore
crash 4.0-2.29
[...]
crash> bt
PID: 1002   TASK: f49f2000  CPU: 2   COMMAND: "afs_cachetrim"
  #0 [f49f3cc8] netconsole_netdump at f8a1d793
  #1 [f49f3cdc] try_crashdump at c0129033
  #2 [f49f3cec] die at c010c6f2
  #3 [f49f3d00] do_page_fault at c0120389
  #4 [f49f3dc4] error_code (via page_fault) at c02b01c0
     EAX: 00000009  EBX: f8b5a000  ECX: 00000046  EDX: c0388e98  EBP: 00000002
     DS:  0068      ESI: f8c2dfa0  ES:  0068      EDI: 0005867a
     CS:  0060      EIP: f8a6da50  ERR: ffffffff  EFLAGS: 00010282
  #5 [f49f3e00] osi_Panic at f8a6da50
  #6 [f49f3e20] afs_HashOutDCache at f8a2d9ea
  #7 [f49f3e40] afs_GetDownD at f8a2d6a3
  #8 [f49f3fa0] afs_CacheTruncateDaemon at f8a2cd29
  #9 [f49f3fe0] afsd_thread at f8a7f9eb
#10 [f49f3ff0] kernel_thread_helper at c01095cb
crash>

Alas, I'm afraid this is the point where I'll need either some guidance or 
a lot of reading and experimenting to get any further.

NB:

During my previous attempt to make this happen, I got no panic but 
lots of messages about the cache [partition] being full, and that I should 
reduce the cache. However, the dedicated ext3 filesystem was neither full 
nor out of inodes, and I think the cachesize setting (70% of what's left 
of the filesystem after subtracting 32MB for the journal) is rather 
conservative.

When I tried to restart the client, I experienced what I've seen 
frequently with 1.4.x clients on this platform: "kernel BUG at 
slab.c:892:" when re-inserting the openafs module. This seems to happen 
quite consistently when restarting the client after it has run for some 
time (say, a week).

I have a crashdump from this incident as well. After a reboot, it took 
less than three hours to get the above panic.

I don't think it's a hardware problem, but if it helps I'd be willing to 
try and reproduce this on another system.

- Stephan

-- 
Stephan Wiesand
   DESY - DV -
   Platanenallee 6
   15738 Zeuthen, Germany