[OpenAFS-devel] Re: idle dead timeout processing in clients

Andrew Deason adeason@sinenomine.net
Thu, 8 Dec 2011 11:30:01 -0600


On Wed, 07 Dec 2011 17:57:51 -0800
Russ Allbery <rra@stanford.edu> wrote:

> 5. When a file server has been forcibly restarted, sometimes the AFS
>    clients on the www.stanford.edu servers will never recover.  They
>    go into an endless cycle of kernel errors and have to be forcibly
>    rebooted in order to recover.  (Unfortunately, I don't have one of
>    those kernel errors handy, since it doesn't seem to be logged to
>    syslog.)

Even if you don't have the exact messages... any recollection as to what
they were? "blocked for more than X seconds" or something else?

> 8. We are getting large numbers of the following error reported by our
>    file servers:
> 
> Wed Dec  7 17:14:45 2011 CallPreamble: Couldn't get CPS. Too many lockers
> 
>    By large, I mean that one server has seen 156 of those errors so far
>    today.

Yeah, I've been wondering lately if this is just from the larger amount
of new connections you see on a particular server; due to the
consolidation and the large number of pags/settokens you see. Every
other time I've seen this it's either from a bug or client connection
issues, but if you have enough completely new connections coming in, it
would seem possible to just have it happen during the normal connection
negotiation.

It would be easy to make the host lock quota configurable (to adjust the
number, or turn it off entirely), and you could see if that makes
anything better.

> It's probably also worth noting that we continue to have the issue
> with AFS file servers, which we've had for years, that restarting a
> file server completely destroys AFS clients during the time period
> while the file server is attaching volumes.  Between the point where
> the file server starts attaching volumes and finishes attaching
> volumes, any client that attempts to access those volumes ends up
> being swamped in processes in disk wait and usually essentially
> becomes inaccessible.  We therefore block all access to the file
> server using iptables when restarting it and keep access blocked until
> all volumes are attached so that we can at least access data that's
> stored on other servers.

The current behavior is deliberate, and so is easy to change. The client
currently waits for a VRESTARTING error to clear up; it's a simple
matter of adding a client option to instead make it error out
immediately, if that's what you want. That makes server restarts very
visible to processes, though. (We had discussed server-side
solutions/workarounds to this before, but I don't really think that's
the right way.)

-- 
Andrew Deason
adeason@sinenomine.net