[OpenAFS] Re: openafs hang

Marc Dionne marc.c.dionne@gmail.com
Tue, 4 Sep 2012 11:38:09 -0400


On Tue, Sep 4, 2012 at 11:05 AM, Alexander 'Leo' Bergolth
<leo@strike.wu.ac.at> wrote:
> Hi!
>
> I got bitten by another hang, which will hopefully provide more
> information...
>
> On 08/09/2012 05:42 PM, Andrew Deason wrote:
>> On Thu, 09 Aug 2012 11:48:25 +0200
>> Alexander 'Leo' Bergolth <leo@strike.wu.ac.at> wrote:
>>> My box, using openafs-1.6.1 and kernel-2.6.32-131.17.1.el6.i686 on
>>> Centos 6, just hung completely and had to be rebooted.  It looks like
>>> the problem was caused by a locking problem of the openafs kernel
>>> module, all processes that e.g. used AFS authentication got stuck
>>> inside libafs. (See the kernel call-traces below.)
>>
>> This would be more useful with a trace of all processes; all those show
>> is that we're waiting for a lock. You can get that with 'echo t >
>> /proc/sysrq-trigger'.
>
> The output is available at:
> http://leo.kloburg.at/tmp/openafs-1.6.1-hang/sysrq-show-state3.txt
>
>> If you have the ability to run 'crash' (requires crash to be installed,
>> and the running kernel debuginfo), you could also run something like
>> this:
>>
>> # crash
>> [...]
>> crash> sym afs_global_owner
>> crash> print ((int*)0xADDR)[0]
>>
>> where ADDR is the address printed out by 'sym'. If that prints out a
>> valid pid, knowing information about that pid would be helpful. You
>> could even:
>>
>> crash> set <pid>
>> crash> bt
>
>       KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.1.1.el6.i686/vmlinux
>     DUMPFILE: /var/crash/127.0.0.1-2012-09-02-13:54:06/vmcore  [PARTIAL
> DUMP]
>         CPUS: 2
>         DATE: Sun Sep  2 13:52:55 2012
>       UPTIME: 19 days, 23:36:15
> LOAD AVERAGE: 32.27, 20.22, 9.60
>        TASKS: 371
>     NODENAME: strike.wu-wien.ac.at
>      RELEASE: 2.6.32-279.1.1.el6.i686
>      VERSION: #1 SMP Tue Jul 10 12:30:45 UTC 2012
>      MACHINE: i686  (2991 Mhz)
>       MEMORY: 3.8 GB
>        PANIC: "Oops: 0002 [#1] SMP " (check log for details)
>          PID: 0
>      COMMAND: "swapper"
>         TASK: c0a425e0  (1 of 2)  [THREAD_INFO: c0a1a000]
>          CPU: 0
>        STATE: TASK_RUNNING (PANIC)
>
> crash> sym afs_global_owner
> fa400228 (b) afs_global_owner [openafs]
> crash> print ((int*)0xfa400228)[0]
> $3 = 17283
> crash> set 17283
>     PID: 17283
> COMMAND: "auth"
>    TASK: eef0a550  [THREAD_INFO: f4554000]
>     CPU: 1
>   STATE: TASK_UNINTERRUPTIBLE
> crash> bt
> PID: 17283  TASK: eef0a550  CPU: 1   COMMAND: "auth"
>  #0 [f4555cc8] schedule at c083c5b3
>  #1 [f4555d8c] __mutex_lock_slowpath at c083d943
>  #2 [f4555db4] mutex_lock at c083d848
>  #3 [f4555dc0] afs_dentry_iput at fa3d7208 [openafs]
>  #4 [f4555ddc] dentry_iput at c0540e18
>  #5 [f4555df4] d_kill at c0540f3d
>  #6 [f4555e00] dput at c05422f8
>  #7 [f4555e0c] afs_syscall_pioctl at fa3e4337 [openafs]
>  #8 [f4555e64] afs_syscall at fa373770 [openafs]
>  #9 [f4555eac] afs_unlocked_ioctl at fa387cba [openafs]
> #10 [f4555edc] proc_reg_unlocked_ioctl at c057bac1
> #11 [f4555f00] vfs_ioctl at c053d6e9
> #12 [f4555f1c] do_vfs_ioctl at c053d8c7
> #13 [f4555f90] sys_ioctl at c053de91

That's very useful.  It looks like dput gets called with the global
lock held, and it can potentially call back into AFS code through
afs_dentry_iput.
Might have to look at the afs_syscall_pioctl in more detail (it's
heavily ifdef'ed and hard to follow), but at first glance the call to
dput() there should probably be done after dropping the lock, so
surround it with a AFS_GUNLOCK() and AFS_GLOCK() pair.

>
> Thanks,
> --leo

Marc