[OpenAFS] Re: openafs hang

Tue, 04 Sep 2012 17:05:47 +0200

Hi!

I got bitten by another hang, which will hopefully provide more
information...

On 08/09/2012 05:42 PM, Andrew Deason wrote:
> On Thu, 09 Aug 2012 11:48:25 +0200
> Alexander 'Leo' Bergolth <leo@strike.wu.ac.at> wrote:
>> My box, using openafs-1.6.1 and kernel-2.6.32-131.17.1.el6.i686 on
>> Centos 6, just hung completely and had to be rebooted.  It looks like
>> the problem was caused by a locking problem of the openafs kernel
>> module, all processes that e.g. used AFS authentication got stuck
>> inside libafs. (See the kernel call-traces below.)
> 
> This would be more useful with a trace of all processes; all those show
> is that we're waiting for a lock. You can get that with 'echo t >
> /proc/sysrq-trigger'.

The output is available at:
http://leo.kloburg.at/tmp/openafs-1.6.1-hang/sysrq-show-state3.txt

> If you have the ability to run 'crash' (requires crash to be installed,
> and the running kernel debuginfo), you could also run something like
> this:
> 
> # crash
> [...]
> crash> sym afs_global_owner
> crash> print ((int*)0xADDR)[0]
> 
> where ADDR is the address printed out by 'sym'. If that prints out a
> valid pid, knowing information about that pid would be helpful. You
> could even:
> 
> crash> set <pid>
> crash> bt

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.1.1.el6.i686/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2012-09-02-13:54:06/vmcore  [PARTIAL
DUMP]
        CPUS: 2
        DATE: Sun Sep  2 13:52:55 2012
      UPTIME: 19 days, 23:36:15
LOAD AVERAGE: 32.27, 20.22, 9.60
       TASKS: 371
    NODENAME: strike.wu-wien.ac.at
     RELEASE: 2.6.32-279.1.1.el6.i686
     VERSION: #1 SMP Tue Jul 10 12:30:45 UTC 2012
     MACHINE: i686  (2991 Mhz)
      MEMORY: 3.8 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)
         PID: 0
     COMMAND: "swapper"
        TASK: c0a425e0  (1 of 2)  [THREAD_INFO: c0a1a000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash> sym afs_global_owner
fa400228 (b) afs_global_owner [openafs]
crash> print ((int*)0xfa400228)[0]
$3 = 17283
crash> set 17283
    PID: 17283
COMMAND: "auth"
   TASK: eef0a550  [THREAD_INFO: f4554000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 17283  TASK: eef0a550  CPU: 1   COMMAND: "auth"
 #0 [f4555cc8] schedule at c083c5b3
 #1 [f4555d8c] __mutex_lock_slowpath at c083d943
 #2 [f4555db4] mutex_lock at c083d848
 #3 [f4555dc0] afs_dentry_iput at fa3d7208 [openafs]
 #4 [f4555ddc] dentry_iput at c0540e18
 #5 [f4555df4] d_kill at c0540f3d
 #6 [f4555e00] dput at c05422f8
 #7 [f4555e0c] afs_syscall_pioctl at fa3e4337 [openafs]
 #8 [f4555e64] afs_syscall at fa373770 [openafs]
 #9 [f4555eac] afs_unlocked_ioctl at fa387cba [openafs]
#10 [f4555edc] proc_reg_unlocked_ioctl at c057bac1
#11 [f4555f00] vfs_ioctl at c053d6e9
#12 [f4555f1c] do_vfs_ioctl at c053d8c7
#13 [f4555f90] sys_ioctl at c053de91
#14 [f4555fb0] ia32_sysenter_target at c0409a98
    EAX: 00000036  EBX: 00000003  ECX: 40044301  EDX: bfb8ee6c
    DS:  007b      ESI: 00000014  ES:  007b      EDI: 00000003
    SS:  007b      ESP: bfb8ee18  EBP: bfb8ee98  GS:  0033
    CS:  0073      EIP: 00d32424  ERR: 00000036  EFLAGS: 00200213

> ('exit' to exit crash). Or, you could just cause the machine to dump
> core instead of simply rebooting, via 'echo c > /proc/sysrq-trigger'
> (assuming the machine is configured to capture core on a crash, but I
> think that's the default), and provide the resulting core. Such a core
> would contain a lot of information about everything that's running on
> the box, so you would not want to make that generally publicly
> available.

Please let me know if you need further information. (The crash dump is
available.)

I'd greatly appreciate if some AFS expert could take a look at the problem!

Thanks,
--leo

P.S.: I am using the openafs-1.6.1-1.el6.i686 RPM for RHEL6.

-- 
e-mail   ::: Leo.Bergolth (at) wu.ac.at
fax      ::: +43-1-31336-906050
location ::: IT-Services | Vienna University of Economics | Austria