[OpenAFS] Re: Investigating 'calls waiting' from rxdebug

Fri, 16 Aug 2013 13:25:59 -0700

drosih@rpi.edu writes:

> Dan's message looks very useful, and also makes me feel good because it
> implies that I was making some good guesses as I tried to pin down this
> problem.  I did try to turn up logging at one point, and here are all
> the log entries which came up in FileLog:

>     Thu Aug 15 02:34:58 2013 Set Debug On level = 1
>     Thu Aug 15 02:35:08 2013 [0] Set Debug On level = 5
>     Thu Aug 15 02:35:18 2013 [0] Reset Debug levels to 0

> That's it.  3 entries.

The specific pathology that we've seen in the past is that a client holds
a callback on some file or directory (usually a directory) that a bunch of
other clients want to access.  Another client tries to do something that
requires a callback break.  The client holding the callback can't be
contacted for some reason.  Therefore, the threads trying to do something
with that object all start blocking on the thread trying to break the
callback (or callbacks).  If there is enough volume of activity on that
particular object, you can end up in a condition where every available
file server thread is waiting on the thread that is trying, without
forward progress, to break the callback.

If you have large numbers of writers to the same directory (such as in
some distributed computing scenarios), it's more likely that you'll
trigger this situation.

Depending on the client, you can then get into a pathological situation
where the clients all start retrying, which consumes even more file server
threads, until you have the poor server so tangled in knots that, even
when it finally times out the callback, it can't really recover, because
all the threads are tied up dealing with this one object.

Various fixes have been put into the file server and the clients to try to
reduce this problem over the years, so running the latest of everything
should, in general, make it better and less likely to happen, although I
think some folks are still seeing the problem.  Also, if you haven't
already, increasing the number of file server threads to way more than the
default number (which is quite low) can obviously help by creating more of
the resource that is exhausted by this problem.

Complete lack of useful debug logging is consistent with our prior
experience with this problem.  There's nothing in the debug logs because
the file server isn't doing anything; all file server threads are blocked
waiting for this callback break to finish.

> I'll also say that at one point I thought the problem might have been
> that we had too many AFS volumes on one of the partitions on the
> "calls-waiting" server, so I started doing 'vos move's to move AFS
> volumes to a different server.  None of those vos moves ran into any
> lags at all, even while the calls-waiting counter was very high.

If the above is correct, there was probably one specific volume that was
causing contention; the others, as long as callbacks could be broken,
would move easily.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>