[OpenAFS-devel] fileserver loop

Russ Allbery rra@stanford.edu
Wed, 21 Aug 2002 12:05:38 -0700


Derrick J Brashear <shadow@dementia.org> writes:
> On Wed, 21 Aug 2002, Russ Allbery wrote:

>> Could you let me know some more about the symptoms of this problem?
>> We've been seeing some significant difficulties periodically in our
>> cell since we upgraded to OpenAFS 1.2.6 file servers, and I'm wondering
>> if this is the same thing.

> Well, what symptoms are you having, Russ? 

Sorry, was on the run to a meeting and didn't have time to go into
details.

In the following, "blocked connections" refer to connections that show up
with the command:

    rxdebug <server> -allconn -rxstats | grep waiting_for_process

This is the standard metric that we use to check the health of file
servers.  It's pretty reliable.  If there are any blocked connections, the
server is having trouble.  If the blocked connection count is over 10,
users are noticing; if it's over 50, the server is going to be nearly
unusable and people are experiencing a significant outage.

Last night we started seeing blocked connections above 50 across four of
our AFS servers; one that holds read-write volumes with one replica and
three pure replica servesr.  One other replica server and one other
read-write with replicas server were not affected, so my hypothesis is
that it's somehow related to one of the handful of volumes we have
replicated across exactly those four servers (root.afs and root.cell being
two volumes with that property).

This is a repeat of a problem that we had last Friday, but with a
different set of three servers; last Friday, we thought we'd isolated it
to the volume containing Matlab, which was getting unusually high (but not
horribly excessive) numbers of accesses.  When we moved that volume off
onto unloaded servers, everything recovered.

Last night, I couldn't find any volume that was getting an unusual number
of accesses.  I got on the file servers and turned on one level of
debugging with kill -TSTP, and watched FileLog, and was seeing a pattern
of a large number of accesses scrolling by, then the logs freezing for
anywhere between 30 seconds and two minutes, and then another flood of
accesses scrolling by, like some sort of periodic freeze was happening.
The servers were still serving files during the periods where they weren't
frozen, and were trying to recover (at several points the blocked
connection count went down to zero on one or more of the servers, only to
climb back up again).

I thought I'd figured it out, since during those pauses we saw log entries
like:

Tue Aug 20 22:15:56 2002 CB: WhoAreYou failed for 171.64.127.138:7001, error -3
Tue Aug 20 22:17:59 2002 CB: Call back connect back failed (in break delayed) for ab407f8a.7001
Tue Aug 20 22:17:59 2002 BreakDelayedCallbacks FAILED for host ab407f8a which IS UP.  Possible network or routing failure.
Tue Aug 20 22:17:59 2002 MultiProbe failed to find new address for host ab407f8a.7001

but now I think that those are just getting logged periodically and the
only reason why it looked like they were related is that that was all that
was logged during those periods of frozen access.

I didn't see anything unusual in the debugging output, just lots of the
normal sort of:

Tue Aug 20 22:15:41 2002 SAFS_FetchStatus,  Fid = 2003588573.1.1, Host 171.64.48.63, Id 32766
Tue Aug 20 22:15:41 2002 SAFS_FetchStatus,  Fid = 536870916.130.38677, Host 171.64.15.106, Id 55055
Tue Aug 20 22:15:41 2002 SAFS_FetchStatus,  Fid = 536870913.1.1, Host 171.64.96.11, Id 32766

log entries.  If this happens again, I'll crank the debugging up a lot
higher and also try to gcore the server while it's frozen.

We finally ran bos restart on each of the affected servers, and that
seemed to fix the problem at least temporarily (although we started
getting blocked connections on one of them again after another two hours
and did another bos restart).  Everything seems to be healthy again this
morning.

This problem only started happening after we upgraded to OpenAFS 1.2.6, we
think; these systems were mostly previously running Transarc AFS 3.6p2.
However, we did have one incident vaguely similar to this some time back
that we thought we'd tracked down to a user who was trying to use AFS via
a PC AFS client and had one of those horrible "personal firewall" packages
installed that was blocking all of the AFS callbacks (so it kept sending
requests, the server couldn't contact it, it sent another request, etc.).

The system load was essentially zero during the whole outage, if that
means anything (which it often doesn't).  We're running the file server
with the following flags:

Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Wed Aug 21 01:08:55 2002 (2 proc starts)
    Command 1 is '/usr/afs/bin/fileserver -nojumbo -L -p 16 -rxpck 400'
    Command 2 is '/usr/afs/bin/volserver -nojumbo'
    Command 3 is '/usr/afs/bin/salvager'

(-nojumbo is old cargo-cult configuration from back when all these systems
were on a FDDI ring that I just haven't removed because it doesn't seem to
be hurting anything.)

We're starting to get a bit worried about this since we've had two
incidents in four days now (and during one of these incidents, many of our
unreplicated volumes are essentially inaccessbile).

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>