[OpenAFS] Linux kernel panic, OpenAFS, gconf
Miles Davis
miles@cs.stanford.edu
Mon, 26 Apr 2004 10:44:56 -0700
Hi everybody,
I've been trying to track down a kernel panic we've been experiencing for
a couple of months now without success. I googled around and checked my
list archives, but I can't find anything that seems to match my problem.
A user has a research job that (for reasons I cannot explain) starts/stops
gnome sessions in quick succession to do <something>. If somebody wants
more detail, I'll try to find out exactly what it does, but it seems
secondary to the problem at the moment. The main thing to think about is a
login-run stuff-logout loop.
This worked fine on a local filesystem, but will eventually (within hours)
crash the client system when hosted in AFS. The trouble starts at some
point with gconf, my favorite program [:)], being unable or unwilling to
release a lock. I see this in the logs:
starting (version 2.2.0), pid 26455 user 'joe'
Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config source at position 0
Resolved address "xml:readwrite:/afs/blah/u/joe/.gconf" to a writable config source at position 1
Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only config source at position 2
Received signal 1, shutting down cleanly
Failed to give up lock on XML dir "/afs/blah/u/joe/.gconf": Failed to remove lock directory `/afs/blah/u/liblit/.gconf/%gconf-xml-backend.lock': Directory not empty
Repeat that somewhere between 10-25 times.
I'm not sure why this happens, but I don't care about that part. It's the
resultant kernel panic that bothers me:
Apr 25 19:47:59 moa kernel: TT3<1>Unable to handle kernel paging request at virtual address ffffffff
Apr 25 19:47:59 moa kernel: printing eip:
Apr 25 19:47:59 moa kernel: f8b7e950
Apr 25 19:47:59 moa kernel: *pde = 00003067
Apr 25 19:47:59 moa kernel: *pte = 00000000
Apr 25 19:47:59 moa kernel: Oops: 0002
Apr 25 19:47:59 moa kernel: libafs-2.4.20-24.9-i686.mp ide-cd cdrom lp parport nfsd nfs lockd sunrpc autofs tg3 ipt_REJECT iptable_filter ip_tables keybdev mousedev hid input usb-ohci us
Apr 25 19:47:59 moa kernel: CPU: 0
Apr 25 19:47:59 moa kernel: EIP: 0060:[<f8b7e950>] Tainted: PF
Apr 25 19:47:59 moa kernel: EFLAGS: 00210286
Apr 25 19:47:59 moa kernel:
Apr 25 19:47:59 moa kernel: EIP is at osi_Panic [libafs-2.4.20-24.9-i686.mp] 0x20 (2.4.20-31.9smp)
Apr 25 19:47:59 moa kernel: eax: 00000003 ebx: f8d51240 ecx: 00200002 edx: d967bdf4
Apr 25 19:47:59 moa kernel: esi: 00000000 edi: 00000401 ebp: 00000000 esp: d967be58
Apr 25 19:47:59 moa kernel: ds: 0068 es: 0068 ss: 0068
Apr 25 19:47:59 moa kernel: Process gconfd-2 (pid: 26455, stackpage=d967b000)
Apr 25 19:47:59 moa kernel: Stack: f8ba3774 cb122c80 d967beec d967becc d967bee8 d967bef8 d967bec0 f8b5ea53
Apr 25 19:47:59 moa kernel: f8ba3774 cb122c80 d967beec d967becc 00000001 00000001 02580023 00000001
Apr 25 19:47:59 moa kernel: 00000000 00000000 00000000 00000000 f8d51240 cb122c80 00000000 00000a6c
Apr 25 19:47:59 moa kernel: Call Trace: [<f8ba3774>] .rodata.str1.1 [libafs-2.4.20-24.9-i686.mp] 0x9c4 (0xd967be58))
Apr 25 19:47:59 moa kernel: [<f8b5ea53>] afs_lookup [libafs-2.4.20-24.9-i686.mp] 0x7b3 (0xd967be74))
Apr 25 19:47:59 moa kernel: [<f8ba3774>] .rodata.str1.1 [libafs-2.4.20-24.9-i686.mp] 0x9c4 (0xd967be78))
Apr 25 19:47:59 moa kernel: [<f8b56c19>] afs_access [libafs-2.4.20-24.9-i686.mp] 0xf9 (0xd967bec4))
Apr 25 19:47:59 moa kernel: [<f8b89834>] crget [libafs-2.4.20-24.9-i686.mp] 0x54 (0xd967bf14))
Apr 25 19:47:59 moa kernel: [<f8b8ea7e>] afs_linux_lookup [libafs-2.4.20-24.9-i686.mp] 0x5e (0xd967bf34))
Apr 25 19:47:59 moa kernel: [<c0160b72>] lookup_hash [kernel] 0xc2 (0xd967bf64))
Apr 25 19:47:59 moa kernel: [<c0162498>] sys_unlink [kernel] 0xa8 (0xd967bf84))
Apr 25 19:47:59 moa kernel: [<c01098cf>] system_call [kernel] 0x33 (0xd967bfc0))
Apr 25 19:47:59 moa kernel:
Apr 25 19:47:59 moa kernel:
Apr 25 19:47:59 moa kernel: Code: c6 05 ff ff ff ff 2a 83 c4 1c c3 90 8d 74 26 00 b8 62 3c ba
Where pid 26455 is the last instance of gconfd before the panic. The
system is then in a nice half-dead state.
I used to think that the crash had something to do with nightly 'vos
backupsys' since they seemed to happen at the same time, but that has
since been disproven. I've tried a few other things to try and cause the
crash, like moving the volume, restarting the fileserver, and nothing on
the server side seems to be able to trigger it.
The client is RedHat 9, stock kernel 2.4.20-31.9smp. Same crash happened
under 2.4.20-24.9smp and 2.4.20-30.9smp.
OpenAFS is 1.2.11-rh9.0.1, straight from openafs.org with no
customizations. Client options are currently "$XLARGE -nosettime -memcache
-fakestat", though the crash also occurred using disk cache and the
'MEDIUM' and 'SMALL' configs in /etc/sysconfig/afs.
Server is RedHat 9, 2.4.20-30.9smp, openafs-1.2.11-rh9.0.1. I don't see
anything logged at the time of the client crash.
If anybody has any suggestions, or a plan of attack, I would be grateful.
I can provide more information if needed.
Thanks,
--
// Miles Davis - miles@cs.stanford.edu - http://www.cs.stanford.edu/~miles
// Computer Science Department - Computer Facilities
// Stanford University