[OpenAFS-devel] Any tips for tracking down causes of hangs?

Nathan Neulinger nneul@umr.edu
Fri, 22 Mar 2002 06:42:03 -0600


Derrick J Brashear wrote:
> 
> On Thu, 21 Mar 2002, Nickolai Zeldovich wrote:
> 
> > On Solaris, this is really easy: you find the struct proc for the
> > hung process, and look at the stack trace starting at tlist->sp. :)
> > Probably kgdb lets you do similar things on Linux.  You might try
> > using cmdebug remotely, if the in-kernel Rx server is still working;
> > this sounds like a deadlock of some sort, which should show up in
> > cmdebug output in some way.  You could also enable lock tracing, in
> > which case your fstrace output should tell you where the deadlock
> > occurred (assuming you can usefully run fstrace).
> 
> I never got lock tracing (via fstrace) to work on Linux, just fyi. I ended
> up using printks to do it instead.

Interesting... I got the following with a cmdebug after one of these
hangs:

troot-srvtst02(12)> cmdebug localhost
** Cache entry @ 0xd0e43e40 for 1.536908342.1.1
    locks: (none_waiting, write_locked(pid:7228 at:54))
    0 bytes     DV 0 refcnt 1
    callback cbb54b60   expires 0
    0 opens     0 writers
    volume root
    states (0x0)
troot-srvtst02(13)> vos e 536908342

troot-srvtst02(14)> ps -auxwww | grep 7228
gpweb     7228  0.0  0.6  5764 1592 ?        S    03:59   0:00 ./httpd
-d /afs/u
mr.edu/software/gpweb/apache-root-websrv1
root     10707  0.0  0.2  1680  612 pts/0    S    06:29   0:00 grep 7228

And get this... I can't vos e that volume from anywhere.... even other
clients.

I've got a feeling this might not have anything to do with that client,
and may instead being something wrong with the server that volume it on,
cause problems keep cropping up with specific volumes on that server.

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216