[OpenAFS-devel] [CSL #328045] kernel BUG: file locks and openafs 1.4.2
David Thompson
thomas@cs.wisc.edu
Thu, 19 Apr 2007 17:15:04 -0500
Howdy all -
I'm running into a recurring kernel BUG() on CentOS 4 (2.6.9-42.0.8.ELsmp) and
openafs 1.4.2. It looks like a FL_POSIX lock is getting into
locks_remove_flock. The all-knowing web says there are several ways to get
this.
I've set up netdump, and have several kernel dumps and logs.
If anyone with deep kernel magic would be willing to work with me on this, I
would greatly appreciate it.
Here's what my feeble knowledge of crash has come up with (please be gentle;
I'm far out of my comfort zone here...):
# crash /boot/System.map-2.6.9-42.0.8.ELsmp \
/usr/lib/debug/lib/modules/2.6.9-42.0.8.ELsmp/vmlinux vmcore
<snip>
SYSTEM MAP: /boot/System.map-2.6.9-42.0.8.ELsmp
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.9-42.0.8.ELsmp/vmlinux
(2.6.9-42.0.8.ELsmp)
DUMPFILE: vmcore
CPUS: 1
DATE: Thu Apr 19 00:00:39 2007
UPTIME: 15:41:08
LOAD AVERAGE: 0.65, 0.50, 0.53
TASKS: 78
NODENAME: <somehost>.cs.wisc.edu
RELEASE: 2.6.9-42.0.8.ELsmp
VERSION: #1 SMP Tue Jan 30 12:33:47 EST 2007
MACHINE: i686 (930 Mhz)
MEMORY: 510.8 MB
PANIC: "kernel BUG at fs/locks.c:1798!"
PID: 20999
COMMAND: "collector"
TASK: c3ef0c70 [THREAD_INFO: d8312000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 20999 TASK: c3ef0c70 CPU: 0 COMMAND: "collector"
#0 [d8312de8] netpoll_start_netdump at e0d79570
#1 [d8312e08] die at c0106045
#2 [d8312e3c] do_invalid_op at c0106420
#3 [d8312eec] error_code (via invalid_op) at c02d53cd
EAX: cfa9572c EBX: c864be68 ECX: 00000000 EDX: 00000001 EBP: c7da5c40
DS: 007b ESI: 00000000 ES: 007b EDI: c864bdc0
CS: 0060 EIP: c016e8ac ERR: ffffffff EFLAGS: 00010246
#4 [d8312f28] locks_remove_flock at c016e8ac
#5 [d8312f9c] __fput at c015bae1
#6 [d8312fb0] filp_close at c015a714
#7 [d8312fc0] system_call at c02d48d0
EAX: 00000006 EBX: 00000005 ECX: 00000000 EDX: 00575ff4
DS: 007b ESI: 0b2ec108 ES: 007b EDI: 00000000
SS: 007b ESP: bfed4984 EBP: bfed4990
CS: 0073 EIP: 003a77a2 ERR: 00000006 EFLAGS: 00000292
crash> bt -f
PID: 20999 TASK: c3ef0c70 CPU: 0 COMMAND: "collector"
<snip>
#4 [d8312f28] locks_remove_flock at c016e8ac
[RA: c015bae6 SP: d8312f2c FP: d8312f9c SIZE: 116]
d8312f2c: 00000000 e0a273f5 c7da5c40 c864be68
d8312f3c: d8312000 c016e763 00000246 00000000
d8312f4c: 00000401 4626f776 00000246 d68f4ac0
d8312f5c: 00005207 c700e6e4 d8312f6c c02d2e82
d8312f6c: c700e6e4 c7da5c40 c8640201 00000000
d8312f7c: 00000000 00000246 00000000 c7da5c40
d8312f8c: c7da5c40 dfe16e00 c864bdc0 d674db94
d8312f9c: c015bae6
#5 [d8312f9c] __fput at c015bae1
[RA: c015a719 SP: d8312fa0 FP: d8312fb0 SIZE: 20]
d8312fa0: c7da5c40 00000000 d68f4ac0 d8312000
d8312fb0: c015a719
<snip>
crash> struct file.f_dentry c7da5c40
f_dentry = 0xd674db94,
crash> struct dentry.d_inode 0xd674db94
d_inode = 0xc864bdc0,
crash> struct inode.i_flock 0xc864bdc0
i_flock = 0xcfa9572c,
crash> struct file_lock.fl_flags 0xcfa9572c
fl_flags = 1 '\001',
...which corresponds to the reports I found on the web (FL_POSIX instead of
FL_FLOCK or FL_LEASE), causing the BUG(). We're seeing this caused both by
firefox and by the cricket "collector" program. I haven't been able to get
down to a small test case (yet).
If this is known, any pointers to related information would be much appreciated. If this is new and someone with the appropriate experience is interested in working on it, please let me know.
Cheers,
Dave Thompson
UW-Madison