[OpenAFS] Debugging Linux AFS client when client hangs

Simon Wilkinson sxw@inf.ed.ac.uk
Thu, 29 Oct 2009 22:49:09 +0000


On 29 Oct 2009, at 22:34, John Perkins wrote:
> Any suggestions from the gurus out there for suggestions on useful  
> debugging information to
> narrow down the cause of this crash would be helpful.  I do have one  
> crash dump of a system
> in this state, although I wasn't able to clean much out of it so far.

Here's the rough way I'd approach this:

If you have processes which are hung, then getting the alt-sysrq-t  
stack trace output for the system often provides a good indication of  
what's going on. If you don't have console access (or this key  
sequence isn't enabled), you can do the same by running
echo t > /proc/sysrq-trigger
This will dump a short stack trace for every process running on the  
system to your console log. It's a really good way of seeing if things  
are blocked on kernel locks, or deadlocked.

The second thing to do is to check whether AFS is getting stuck on its  
locks. cmdebug localhost will show you all of the AFS locks which are  
currently held. If this doesn't return, then it's likely that you've  
deadlocked whilst holding the AFS global lock (GLOCK), and the alt- 
sysrq-t is the only information you're going to get to go on.

It would also be good to know what kind of operations are causing the  
problem. If you have a test script that does it, what operations is  
that script performing? Is there anything particular about the  
directories you are getting stuck on? All of that information would  
help track down which area of the code we should be looking for the  
problem in.

The final thing is that Marc Dionne and I have spent a lot of time  
recently improving the 1.5 Linux cache manager - in particular,  
there's been a lot of work done on reducing the potential for  
deadlocks. Assuming you're already using 1.4.11, it might be  
interesting to take the latest 1.5 release and try that on a test  
machine and see if it solves your problems. I'm not suggesting moving  
it into production, but if 1.5 works, then it will again narrow down  
where we should be looking.

Hope that helps,

Simon.