[OpenAFS] Debugging Linux AFS client when client hangs
Simon Wilkinson
sxw@inf.ed.ac.uk
Thu, 29 Oct 2009 22:49:09 +0000
On 29 Oct 2009, at 22:34, John Perkins wrote:
> Any suggestions from the gurus out there for suggestions on useful
> debugging information to
> narrow down the cause of this crash would be helpful. I do have one
> crash dump of a system
> in this state, although I wasn't able to clean much out of it so far.
Here's the rough way I'd approach this:
If you have processes which are hung, then getting the alt-sysrq-t
stack trace output for the system often provides a good indication of
what's going on. If you don't have console access (or this key
sequence isn't enabled), you can do the same by running
echo t > /proc/sysrq-trigger
This will dump a short stack trace for every process running on the
system to your console log. It's a really good way of seeing if things
are blocked on kernel locks, or deadlocked.
The second thing to do is to check whether AFS is getting stuck on its
locks. cmdebug localhost will show you all of the AFS locks which are
currently held. If this doesn't return, then it's likely that you've
deadlocked whilst holding the AFS global lock (GLOCK), and the alt-
sysrq-t is the only information you're going to get to go on.
It would also be good to know what kind of operations are causing the
problem. If you have a test script that does it, what operations is
that script performing? Is there anything particular about the
directories you are getting stuck on? All of that information would
help track down which area of the code we should be looking for the
problem in.
The final thing is that Marc Dionne and I have spent a lot of time
recently improving the 1.5 Linux cache manager - in particular,
there's been a lot of work done on reducing the potential for
deadlocks. Assuming you're already using 1.4.11, it might be
interesting to take the latest 1.5 release and try that on a test
machine and see if it solves your problems. I'm not suggesting moving
it into production, but if 1.5 works, then it will again narrow down
where we should be looking.
Hope that helps,
Simon.