[OpenAFS] [1.2.7] Strange file server meltdown

Russ Allbery rra@stanford.edu
Thu, 12 Dec 2002 16:14:43 -0800


Hello folks,

We're running OpenAFS 1.2.7 on Solaris 8, and are seeing an unusual
problem.  Two of our file servers are periodically going into an
apparently load-related meltdown around 3:30am to 4:00am at fairly
unpredictable intervals.  We're having about one instance of this a week.

Both of the machines have identical configurations (Sun Netra 20 with disk
on a SAN), and both contain the read-write and one replica site for many
of our replicated volumes (including root.cell and root.afs, but mostly
software volumes).  One also contains a lot of data volumes, while the
other contains user home directories.  We have four other systems with the
same identical configuration that are not having this problem.

The symptoms are as follows:  the number of connections on the server
which rxdebug reports as waiting_for_process starts rising quickly,
hitting several hundred in the course of 15 minutes, and then begins
accelerating, quickly reaching thousands and thousands of connections in
that state.  The system load at the same time becomes quite high.
Unfortunately as yet we've not managed to get into the system while this
is happening, but we're working on fixing that so that we can capture more
debugging information.

There seem to be at least four separate bugs involved in this.  (I seem to
recall that OpenAFS has a bug reporting system, but I'm not sure how it's
supposed to be used; let me know if I should break these up into four bugs
and submit them somehow.)

(1) The problem as described above.  We're not sure what's causing it; my
    inclination is to believe that some client is suddenly pounding more
    load at the server than it can recover from.  I do have rxdebug output
    from the server while it's in trouble, and I'm not seeing a huge flood
    of connections from any single client or anything obvious.  I know
    that this will be hard to diagnose without more information, but I'd
    be very curious to know if anyone else is seeing the same thing.  When
    we've had load problems before, the onset has been a lot more gradual
    and varying, not just a sudden transition from fine to thousands of
    blocked connections.

(2) From the rxdebug output, I can see that there are several clients that
    have multiple connections open to the server.  Over time, as the
    server continues to have problems, they all continue opening more
    connections (about 20 per system over the course of ten minutes).  All
    of the clients that have this property appear to be Windows machines;
    none of the Unix clients seem to be doing this.

    This sounds like a bug in the Windows client.  I remember previous
    list traffic mentioning something about Windows having nasty timeouts,
    and this seems to confirm that.  Are there any thoughts on ways to
    deal with this?  (Is this a tunable parameter somewhere, for example?)

(3) Once the server goes into this failure mode, it appears to be
    impossible to restart with bos restart.  The status of the service
    changes in bos status (it goes to temporarily disabled), but the file
    server never shuts down.  bos restart works if you catch the server
    early enough, but by the time that it has a thousand blocked
    connections, it no longer seems to be listening.

    This seems like it's a bug in the interface between bosserver and the
    fileserver, since bos restart is often used to restart a file server
    that's in trouble.  Is there some sort of a force flag that I'm
    missing?

(4) Once the server goes into this failure mode, I would have expected
    clients accessing replicated volumes on that server to fall over to
    other replica sites, but they don't.  From the client perspective, the
    server connection ends up in waiting_for_process for basically
    forever.  Some client processes seem to just wait forever for it;
    others seem to time out, but that timeout doesn't apparently turn into
    a recognition that the file server is down, and the next time the same
    volume is accessed, the client goes back to waiting on that file
    server again.

    When a file access on a replicated volume times out like that, I think
    the client should switch to another AFS server for that volume right
    away.

I have the rxdebug output at a couple points during one of these failures,
if anyone wants to see it (it's quite long).  We've now written a script
to monitor for a waiting_for_process count of over 100 and automatically
do a bos restart, and I'll look at modifying that script to try to obtain
additional debugging information.  Is there anything in particular that
people think would be useful?  Getting a core dump is easy; bumping the
debugging level up is a bit harder since if we wait too long to restart
the file server, we no longer can.

Thanks for any help or suggestions anyone can provide.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>