[OpenAFS-Doc] Process signals

Russ Allbery rra@stanford.edu
Sat, 25 Aug 2007 09:07:02 -0700


Jason Edgecombe <jason@rampaginggeek.com> writes:

> Derrick gave a very useful tidbit on the -info list. You can use kill
> -XCPU on the fileserver to print out a list of connected clients.

> This isn't in the fileserver man page. Are there any objections to putting
> it in there? What section should it go under, "description" possibly?

A separate troubleshooting section sounds right to me as well.

> What other useful signals are lurking out there and where can find out
> what they are and what they do? I think these need to be documented.

Here's an internal Stanford document that includes a bunch of
Stanford-specific tricks but has some additional details like that worth
extracting.

  Author: Russ Allbery <rra@stanford.edu>
 Subject: Debugging AFS file server load problems
Revision: $Id: debug-fileserver,v 1.6 2005/01/21 01:25:10 eagle Exp $

The basic metric of whether an AFS file server is doing well is its
blocked connection count.  We regularly monitor this two ways, once via
Nagios to send pages and mail if it goes over a fairly low number, and
once for the statistics page:

    <http://www.stanford.edu/services/afs/cellinfo/clients.html>

If the blocked connection count is ever above 0, the server is having
problems replying to clients in a timely fashion.  If it gets above 10,
roughly, there will be user-noticable slowness.  (The total number of
connections is a mostly irrelevant number that goes essentially
monotonically for as long as the server has been running and then goes
back down to zero when it's restarted.)

To determine the blocked connection count by hand, run:

    /usr/afsws/etc/rxdebug <server> | grep waiting_for

Each line returned is a blocked connection.

The most common cause of blocked connections rising on a server is some
process somewhere performing an abnormal number of accesses to that server
and its volumes.  If multiple servers have a blocked connection count, the
most likely explanation is that there is a volume replicated between those
servers that is absorbing an abnormally high access rate.

To get an access count on all the volumes on a server, run:

    vos listvol <server> -long

and save the output in a file.  The results will look like a bunch of vos
examine output for each volume on the server.  Look for lines like:

    40065 accesses in the past day (i.e., vnode references)

and look for volumes with an abnormally high number of accesses.  Anything
over 10,000 is fairly high, but some of our core infrastructure volumes
like users.a, pubsw, systems, group.homepage, and the like will have that
many hits routinely.  Anything over 100,000 is generally abnormally high.
The count resets about once a day.

Another approach that can be used to narrow the possibilities for a
replicated volume, when multiple servers are having trouble, is to find
all replicated volumes for that server.  Run:

    lvldbs <server>

where <server> is one of the servers having problems to refresh the VLDB
cache in /afs/ir/service/afs/data for that server, and then run:

    shortvldb <server> <partition>

to get a list of all volumes on that server and partition, including every
other server that they're replicated to.  So, for example, if volumes are
replicated on afssvr19 /vicepa, afssvr23, and afssvr22, a command like:

    lvldbs afssvr19
    shortvldb afssvr19 a | grep '22.' | grep '23.'

will show you all of the volumes replicated across those three servers.

Once the volume causing the problem has been identified, the best way to
deal with the problem is to move that volume to another server with a low
load.  Often the volume will be enough information to tell what's going on
by scanning the cluster for scripts run by that user (if it's a user
volume) or using that program (if it's a pubsw volume).

If you still need additional information about who's hitting that server,
sometimes you can guess at that information from the failed callbacks in
the FileLog log in /var/log/afs on the server, or from the output of:

    /usr/afsws/etc/rxdebug <server> -rxstats

but the best way is to turn on debugging output from the file server.
(Warning:  This generates a *lot* of output into FileLog on the AFS
server.)  To do this, log on to the AFS server, find the PID of the
fileserver process, and do:

    kill -TSTP <pid>

This will raise the debugging level so that you'll start seeing what
people are actually doing on the server.  You can do this up to three more
times to get even more output if needed.  To reset the debugging level
back to normal, use:

    kill -HUP <pid>

(No, this won't terminate the file server.)  Be sure to reset debugging
back to normal when you're done, or the AFS server may well fill its disks
with debugging output.

The lines of the debugging output that I've found the most useful for
debugging load problems are:

    SAFS_FetchStatus,  Fid = 2003828163.77154.82248, Host 171.64.15.76
    SRXAFS_FetchData, Fid = 2003828163.77154.82248

(partly truncated to highlight the interesting information).  The Fid
identifies the volume and inode within the volume; the volume is the first
long number.  So, for example, this was:

    afssvr5:~> vos examine 2003828163
    pubsw.matlab61                   2003828163 RW    1040060 K  On-line
        afssvr5.Stanford.EDU /vicepa 
        RWrite 2003828163 ROnly 2003828164 Backup 2003828165 
        MaxQuota    3000000 K 
        Creation    Mon Aug  6 16:40:55 2001
        Last Update Tue Jul 30 19:00:25 2002
        86181 accesses in the past day (i.e., vnode references)

        RWrite: 2003828163    ROnly: 2003828164    Backup: 2003828165
        number of sites -> 3
           server afssvr5.Stanford.EDU partition /vicepa RW Site 
           server afssvr11.Stanford.EDU partition /vicepd RO Site 
           server afssvr5.Stanford.EDU partition /vicepa RO Site 

and from the Host information one can tell what system is accessing that
volume.

Note that the output of vos examine also includes the access count, so
once the problem has been identified, vos examine can be used to see if
the access count is still increasing.  Also remember that you can run vos
examine on, e.g., pubsw.matlab61.readonly to see the access counts on the
read-only replica on all of the servers that it's located on.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>