[OpenAFS] Re: Investigating 'calls waiting' from rxdebug

Andrew Deason adeason@sinenomine.net
Fri, 16 Aug 2013 17:57:47 -0500

On Fri, 16 Aug 2013 13:25:59 -0700
Russ Allbery <rra@stanford.edu> wrote:

> The specific pathology that we've seen in the past is that a client
> holds a callback on some file or directory (usually a directory) that
> a bunch of other clients want to access.  Another client tries to do
> something that requires a callback break.  The client holding the
> callback can't be contacted for some reason.  Therefore, the threads
> trying to do something with that object all start blocking on the
> thread trying to break the callback (or callbacks).  If there is
> enough volume of activity on that particular object, you can end up in
> a condition where every available file server thread is waiting on the
> thread that is trying, without forward progress, to break the
> callback.

Note that in this scenario, you should likely see issues logged about
failing to contact the relevant address at some point. Then again you
might see a lot of those messages almost constantly, like some people
do. (Also I'm trying to not go into various other ways this can happen,
since there's a zillion of them.)

> Various fixes have been put into the file server and the clients to
> try to reduce this problem over the years, so running the latest of
> everything should, in general, make it better and less likely to
> happen, although I think some folks are still seeing the problem.

Yeah, it'll probably always be possible in some form until there are
waiting-thread quotas per-vnode or per-volume. But the specific
requirements for access patterns have become more restricted.

On Fri, 16 Aug 2013 15:36:17 -0400
drosih@rpi.edu wrote:

>> I did also try doing some tcpdumps and summarizing that traffic, but
>> nothing remarkable showed up.  However earlier today I learned that
>> the way I did that might have generated misleading results (for
>> reasons I won't bore you with right now).  But based on those
>> tcpdumps I doubt we were getting hammered with AFS traffic,
>> especially not for such a long stretch of time in the middle of the
>> summer.

Did you see _anything_ AFS-related in captured traffic? Even small
packets like our rx ACKs or ABORTs could give an indication as to what
is happening, or just information like which hosts the packets are going

Andrew Deason