[OpenAFS-devel] [CSL #328045] kernel BUG: file locks and openafs 1.4.2

David Thompson thomas@cs.wisc.edu
Thu, 19 Apr 2007 17:15:04 -0500


Howdy all -

I'm running into a recurring kernel BUG() on CentOS 4 (2.6.9-42.0.8.ELsmp) and 
openafs 1.4.2.  It looks like a FL_POSIX lock is getting into 
locks_remove_flock.  The all-knowing web says there are several ways to get 
this.

I've set up netdump, and have several kernel dumps and logs.

If anyone with deep kernel magic would be willing to work with me on this, I 
would greatly appreciate it.

Here's what my feeble knowledge of crash has come up with (please be gentle; 
I'm far out of my comfort zone here...):

# crash /boot/System.map-2.6.9-42.0.8.ELsmp \
/usr/lib/debug/lib/modules/2.6.9-42.0.8.ELsmp/vmlinux vmcore
<snip>
  SYSTEM MAP: /boot/System.map-2.6.9-42.0.8.ELsmp                      
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.9-42.0.8.ELsmp/vmlinux 
(2.6.9-42.0.8.ELsmp)
    DUMPFILE: vmcore
        CPUS: 1
        DATE: Thu Apr 19 00:00:39 2007
      UPTIME: 15:41:08
LOAD AVERAGE: 0.65, 0.50, 0.53
       TASKS: 78
    NODENAME: <somehost>.cs.wisc.edu
     RELEASE: 2.6.9-42.0.8.ELsmp
     VERSION: #1 SMP Tue Jan 30 12:33:47 EST 2007
     MACHINE: i686  (930 Mhz)
      MEMORY: 510.8 MB
       PANIC: "kernel BUG at fs/locks.c:1798!"
         PID: 20999
     COMMAND: "collector"
        TASK: c3ef0c70  [THREAD_INFO: d8312000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 20999  TASK: c3ef0c70  CPU: 0   COMMAND: "collector"
 #0 [d8312de8] netpoll_start_netdump at e0d79570
 #1 [d8312e08] die at c0106045
 #2 [d8312e3c] do_invalid_op at c0106420
 #3 [d8312eec] error_code (via invalid_op) at c02d53cd
    EAX: cfa9572c  EBX: c864be68  ECX: 00000000  EDX: 00000001  EBP: c7da5c40 
    DS:  007b      ESI: 00000000  ES:  007b      EDI: c864bdc0 
    CS:  0060      EIP: c016e8ac  ERR: ffffffff  EFLAGS: 00010246 
 #4 [d8312f28] locks_remove_flock at c016e8ac
 #5 [d8312f9c] __fput at c015bae1
 #6 [d8312fb0] filp_close at c015a714
 #7 [d8312fc0] system_call at c02d48d0
    EAX: 00000006  EBX: 00000005  ECX: 00000000  EDX: 00575ff4 
    DS:  007b      ESI: 0b2ec108  ES:  007b      EDI: 00000000 
    SS:  007b      ESP: bfed4984  EBP: bfed4990 
    CS:  0073      EIP: 003a77a2  ERR: 00000006  EFLAGS: 00000292 

crash> bt -f
PID: 20999  TASK: c3ef0c70  CPU: 0   COMMAND: "collector"
<snip>
 #4 [d8312f28] locks_remove_flock at c016e8ac
    [RA: c015bae6  SP: d8312f2c  FP: d8312f9c  SIZE: 116]
    d8312f2c: 00000000  e0a273f5  c7da5c40  c864be68  
    d8312f3c: d8312000  c016e763  00000246  00000000  
    d8312f4c: 00000401  4626f776  00000246  d68f4ac0  
    d8312f5c: 00005207  c700e6e4  d8312f6c  c02d2e82  
    d8312f6c: c700e6e4  c7da5c40  c8640201  00000000  
    d8312f7c: 00000000  00000246  00000000  c7da5c40  
    d8312f8c: c7da5c40  dfe16e00  c864bdc0  d674db94  
    d8312f9c: c015bae6  
 #5 [d8312f9c] __fput at c015bae1
    [RA: c015a719  SP: d8312fa0  FP: d8312fb0  SIZE: 20]
    d8312fa0: c7da5c40  00000000  d68f4ac0  d8312000  
    d8312fb0: c015a719  
<snip>
crash> struct file.f_dentry c7da5c40
  f_dentry = 0xd674db94, 
crash> struct dentry.d_inode 0xd674db94
  d_inode = 0xc864bdc0, 
crash> struct inode.i_flock 0xc864bdc0
  i_flock = 0xcfa9572c, 
crash> struct file_lock.fl_flags 0xcfa9572c
  fl_flags = 1 '\001', 

...which corresponds to the reports I found on the web (FL_POSIX instead of 
FL_FLOCK or FL_LEASE), causing the BUG().  We're seeing this caused both by 
firefox and by the cricket "collector" program.  I haven't been able to get 
down to a small test case (yet).

If this is known, any pointers to related information would be much appreciated.  If this is new and someone with the appropriate experience is interested in working on it, please let me know.

Cheers,

Dave Thompson
UW-Madison