[OpenAFS] Re: openafs hang

Andrew Deason adeason@sinenomine.net
Thu, 9 Aug 2012 10:42:14 -0500


On Thu, 09 Aug 2012 11:48:25 +0200
Alexander 'Leo' Bergolth <leo@strike.wu.ac.at> wrote:

> My box, using openafs-1.6.1 and kernel-2.6.32-131.17.1.el6.i686 on
> Centos 6, just hung completely and had to be rebooted.  It looks like
> the problem was caused by a locking problem of the openafs kernel
> module, all processes that e.g. used AFS authentication got stuck
> inside libafs. (See the kernel call-traces below.)

This would be more useful with a trace of all processes; all those show
is that we're waiting for a lock. You can get that with 'echo t >
/proc/sysrq-trigger'.

If you have the ability to run 'crash' (requires crash to be installed,
and the running kernel debuginfo), you could also run something like
this:

# crash
[...]
crash> sym afs_global_owner
crash> print ((int*)0xADDR)[0]

where ADDR is the address printed out by 'sym'. If that prints out a
valid pid, knowing information about that pid would be helpful. You
could even:

crash> set <pid>
crash> bt

('exit' to exit crash). Or, you could just cause the machine to dump
core instead of simply rebooting, via 'echo c > /proc/sysrq-trigger'
(assuming the machine is configured to capture core on a crash, but I
think that's the default), and provide the resulting core. Such a core
would contain a lot of information about everything that's running on
the box, so you would not want to make that generally publicly
available.

But all of that is also only really helpful if the process holding the
relevant lock is still around. If someone for some reason just didn't
drop the lock before returning from somewhere/exiting, it's not really
easy to see where the problem comes from. I don't think there are any
known bugs like that, but there are a few that just cause
'weird'/undefined behavior, so it's hard to say.

-- 
Andrew Deason
adeason@sinenomine.net