[OpenAFS-devel] idle dead timeout processing in clients

Wed, 07 Dec 2011 17:57:51 -0800

We're currently seeing a complex of serious issues with OpenAFS file
servers that we believe may be related to this issue, the vnode locking
problem, and a few other related issues.  Here is some additional
information to help in judging whether this is related.

We started seeing problems when moving from Debian lenny running 1.4.11
file servers to Debian squeeze running 1.4.14 file servers.  This happened
at roughly the same time as reducing the number of file servers by half
(from about 13 to about 6), which of course concentrates any locking
problems.

The problems are primarily affecting the www.stanford.edu servers, which
are currently running OpenAFS 1.4.12.1 built 2011-05-31.

The symptoms are (not all of which may be related):

1. The OpenAFS file servers are generally running at a much higher
   reported system load (as reported by uptime) than they were previously,
   although the higher load average is not consistent.

2. Running vos listvol -long -extended against a server causes the load
   average to shoot up to over 15 for as long as vos listvol is running.
   It's not clear whether this is correlated with client problems or the
   other symptoms below or not.  Sometimes it seems to be, and sometimes
   the problems happen at different times.

3. The AFS file servers report periodic surges in client connections
   waiting for a thread.  Previously, this was extremely rare and
   indicated a file server meltdown that was probably unrecoverable.  Now,
   we're occasionally seeing spikes to 20, 50, even 80 clients waiting for
   a thread that persists for more than 30 seconds but then recovers by
   itself.  It's also been frequently going over 100, at which point
   monitoring that we put in place during previous file server problems
   does a forced restart of the file server, which of course takes quite a
   long time.

4. The www.stanford.edu servers periodically block on AFS access and have
   their load shoot up to over 200.  Normally they recover on their own
   after a few minutes.

5. When a file server has been forcibly restarted, sometimes the AFS
   clients on the www.stanford.edu servers will never recover.  They go
   into an endless cycle of kernel errors and have to be forcibly rebooted
   in order to recover.  (Unfortunately, I don't have one of those kernel
   errors handy, since it doesn't seem to be logged to syslog.)

6. We're seeing increasing numbers of kernel errors from other servers,
   particularly our filedrawers servers, reporting blocked processes
   (saying that the process was unable to make forward progress for more
   than X seconds) that are attempting to access AFS.

7. When looking at an rxdebug -allconn snapshot of the file server during
   one of these cases of large numbers of blocked connections, the only
   hosts that have more than four connections to the file server are the
   www.stanford.edu hosts, which frequently have up to 75.  Note that our
   web infrastructure generates very large numbers of separate PAGs, since
   we use complete AFS and Kerberos isolation via suexec for most user CGI
   processes and therefore spawn a new PAG and AFS token for each incoming
   client request.

8. We are getting large numbers of the following error reported by our
   file servers:

Wed Dec  7 17:14:45 2011 CallPreamble: Couldn't get CPS. Too many lockers

   By large, I mean that one server has seen 156 of those errors so far
   today.

It's probably also worth noting that we continue to have the issue with
AFS file servers, which we've had for years, that restarting a file server
completely destroys AFS clients during the time period while the file
server is attaching volumes.  Between the point where the file server
starts attaching volumes and finishes attaching volumes, any client that
attempts to access those volumes ends up being swamped in processes in
disk wait and usually essentially becomes inaccessible.  We therefore
block all access to the file server using iptables when restarting it and
keep access blocked until all volumes are attached so that we can at least
access data that's stored on other servers.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>