[OpenAFS] 1.4.4 client on EL3: panic in afs_HashOutDcache
Stephan Wiesand
wiesand@dv.ifh.de
Wed, 18 Apr 2007 18:26:49 +0200 (CEST)
On Thu, 12 Apr 2007, Derrick J Brashear wrote:
>>> On Wed, 11 Apr 2007, Stephan Wiesand wrote:
>>>
>>>> One of our systems panicked two times within 2 hours yesterday, at the
>>>> same location in the OpenAFS client. I attached the kernel's last words
>>>> below.
[...]
> I'm thinking about a patch. I have something else I need to deal with but I
> will try to work something up after. There's a 3rd possibility, namely the
> missing object being mishashed. We can presumably just instead of panicing
> iterate everything and dump state.
>
> I suppose the other possibility would be to get a kernel crash dump but it's
> sort of cumbersome to move those around so unless you're comfortable with a
> debugger on a kernel dump that's probably a non-starter.
Got one:
# crash /boot/vmlinux-2.4.21-47.0.1.ELsmp vmcore
crash 4.0-2.29
[...]
crash> bt
PID: 1002 TASK: f49f2000 CPU: 2 COMMAND: "afs_cachetrim"
#0 [f49f3cc8] netconsole_netdump at f8a1d793
#1 [f49f3cdc] try_crashdump at c0129033
#2 [f49f3cec] die at c010c6f2
#3 [f49f3d00] do_page_fault at c0120389
#4 [f49f3dc4] error_code (via page_fault) at c02b01c0
EAX: 00000009 EBX: f8b5a000 ECX: 00000046 EDX: c0388e98 EBP: 00000002
DS: 0068 ESI: f8c2dfa0 ES: 0068 EDI: 0005867a
CS: 0060 EIP: f8a6da50 ERR: ffffffff EFLAGS: 00010282
#5 [f49f3e00] osi_Panic at f8a6da50
#6 [f49f3e20] afs_HashOutDCache at f8a2d9ea
#7 [f49f3e40] afs_GetDownD at f8a2d6a3
#8 [f49f3fa0] afs_CacheTruncateDaemon at f8a2cd29
#9 [f49f3fe0] afsd_thread at f8a7f9eb
#10 [f49f3ff0] kernel_thread_helper at c01095cb
crash>
Alas, I'm afraid this is the point where I'll need either some guidance or
a lot of reading and experimenting to get any further.
NB:
During my previous attempt to make this happen, I got no panic but
lots of messages about the cache [partition] being full, and that I should
reduce the cache. However, the dedicated ext3 filesystem was neither full
nor out of inodes, and I think the cachesize setting (70% of what's left
of the filesystem after subtracting 32MB for the journal) is rather
conservative.
When I tried to restart the client, I experienced what I've seen
frequently with 1.4.x clients on this platform: "kernel BUG at
slab.c:892:" when re-inserting the openafs module. This seems to happen
quite consistently when restarting the client after it has run for some
time (say, a week).
I have a crashdump from this incident as well. After a reboot, it took
less than three hours to get the above panic.
I don't think it's a hardware problem, but if it helps I'd be willing to
try and reproduce this on another system.
- Stephan
--
Stephan Wiesand
DESY - DV -
Platanenallee 6
15738 Zeuthen, Germany