[OpenAFS] Re: openafs hang
Alexander 'Leo' Bergolth
leo@strike.wu.ac.at
Tue, 04 Sep 2012 17:05:47 +0200
Hi!
I got bitten by another hang, which will hopefully provide more
information...
On 08/09/2012 05:42 PM, Andrew Deason wrote:
> On Thu, 09 Aug 2012 11:48:25 +0200
> Alexander 'Leo' Bergolth <leo@strike.wu.ac.at> wrote:
>> My box, using openafs-1.6.1 and kernel-2.6.32-131.17.1.el6.i686 on
>> Centos 6, just hung completely and had to be rebooted. It looks like
>> the problem was caused by a locking problem of the openafs kernel
>> module, all processes that e.g. used AFS authentication got stuck
>> inside libafs. (See the kernel call-traces below.)
>
> This would be more useful with a trace of all processes; all those show
> is that we're waiting for a lock. You can get that with 'echo t >
> /proc/sysrq-trigger'.
The output is available at:
http://leo.kloburg.at/tmp/openafs-1.6.1-hang/sysrq-show-state3.txt
> If you have the ability to run 'crash' (requires crash to be installed,
> and the running kernel debuginfo), you could also run something like
> this:
>
> # crash
> [...]
> crash> sym afs_global_owner
> crash> print ((int*)0xADDR)[0]
>
> where ADDR is the address printed out by 'sym'. If that prints out a
> valid pid, knowing information about that pid would be helpful. You
> could even:
>
> crash> set <pid>
> crash> bt
KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.1.1.el6.i686/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2012-09-02-13:54:06/vmcore [PARTIAL
DUMP]
CPUS: 2
DATE: Sun Sep 2 13:52:55 2012
UPTIME: 19 days, 23:36:15
LOAD AVERAGE: 32.27, 20.22, 9.60
TASKS: 371
NODENAME: strike.wu-wien.ac.at
RELEASE: 2.6.32-279.1.1.el6.i686
VERSION: #1 SMP Tue Jul 10 12:30:45 UTC 2012
MACHINE: i686 (2991 Mhz)
MEMORY: 3.8 GB
PANIC: "Oops: 0002 [#1] SMP " (check log for details)
PID: 0
COMMAND: "swapper"
TASK: c0a425e0 (1 of 2) [THREAD_INFO: c0a1a000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> sym afs_global_owner
fa400228 (b) afs_global_owner [openafs]
crash> print ((int*)0xfa400228)[0]
$3 = 17283
crash> set 17283
PID: 17283
COMMAND: "auth"
TASK: eef0a550 [THREAD_INFO: f4554000]
CPU: 1
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 17283 TASK: eef0a550 CPU: 1 COMMAND: "auth"
#0 [f4555cc8] schedule at c083c5b3
#1 [f4555d8c] __mutex_lock_slowpath at c083d943
#2 [f4555db4] mutex_lock at c083d848
#3 [f4555dc0] afs_dentry_iput at fa3d7208 [openafs]
#4 [f4555ddc] dentry_iput at c0540e18
#5 [f4555df4] d_kill at c0540f3d
#6 [f4555e00] dput at c05422f8
#7 [f4555e0c] afs_syscall_pioctl at fa3e4337 [openafs]
#8 [f4555e64] afs_syscall at fa373770 [openafs]
#9 [f4555eac] afs_unlocked_ioctl at fa387cba [openafs]
#10 [f4555edc] proc_reg_unlocked_ioctl at c057bac1
#11 [f4555f00] vfs_ioctl at c053d6e9
#12 [f4555f1c] do_vfs_ioctl at c053d8c7
#13 [f4555f90] sys_ioctl at c053de91
#14 [f4555fb0] ia32_sysenter_target at c0409a98
EAX: 00000036 EBX: 00000003 ECX: 40044301 EDX: bfb8ee6c
DS: 007b ESI: 00000014 ES: 007b EDI: 00000003
SS: 007b ESP: bfb8ee18 EBP: bfb8ee98 GS: 0033
CS: 0073 EIP: 00d32424 ERR: 00000036 EFLAGS: 00200213
> ('exit' to exit crash). Or, you could just cause the machine to dump
> core instead of simply rebooting, via 'echo c > /proc/sysrq-trigger'
> (assuming the machine is configured to capture core on a crash, but I
> think that's the default), and provide the resulting core. Such a core
> would contain a lot of information about everything that's running on
> the box, so you would not want to make that generally publicly
> available.
Please let me know if you need further information. (The crash dump is
available.)
I'd greatly appreciate if some AFS expert could take a look at the problem!
Thanks,
--leo
P.S.: I am using the openafs-1.6.1-1.el6.i686 RPM for RHEL6.
--
e-mail ::: Leo.Bergolth (at) wu.ac.at
fax ::: +43-1-31336-906050
location ::: IT-Services | Vienna University of Economics | Austria