[OpenAFS] Debugging Linux AFS client when client hangs

Rainer Toebbicke rtb@pclella.cern.ch
Fri, 30 Oct 2009 09:54:22 +0100


John Perkins schrieb:
> We're dealing with an interesting situation at our site recently: after 
> rolling out RHEL 5 update 4
> our department's Linux computers, we're finding certain applications 
> seem to cause AFS to
> no longer respond when fetching contents of specific directories in 
> AFS.  Access to local
> filesystems in this state appears to work just fine.
> 
> Simon was kind enough to provide useful instructions at
> http://blob.inf.ed.ac.uk/sxw/2009/01/24/using-fstrace-to-debug-the-afs-cache-manager/ 
> 
> back in January...unfortunately, the fstrace process gets stuck in 
> device wait and will not
> return any useful information.
> If I could only get some useful debugging information, I would gladly 
> submit it to RT...
> 

On "hard" AFS lock-ups, if you have a crash dump and the necessary 
kernel-debug packages then displaying afs_global_owner using crash will tell 
you who's holding the global lock at the moment the dump was taken, and a 
stack trace of the fstrace process may hint at what it is waiting for.

For problems on "certain directories" or even "certain files", 'cmdebug 
localhost' is usually the best bet: it'll describe which volume the cache 
manager is busy with, which process is involved and where in the code this 
occurs which often gives a hint to the cause.

BTW: we're running RHEL 5.4 and derivatives with 1.4.8 on hundreds if not 
already thousands of machines without anything notably problematic.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155