[OpenAFS] Re: Investigating 'calls waiting' from rxdebug

Fri, 16 Aug 2013 22:08:53 -0400

Aside: we did have another one of these calls-waiting events at
   RPI today, on a different file server.  This one lasted about
   30 minutes, and it cleared up before we could pin down any
   specific culprit.

On Fri, 16 Aug 2013, Andrew Deason wrote:
>
> On Fri, 16 Aug 2013, Russ Allbery <rra@stanford.edu> wrote:
> 
> > The specific pathology that we've seen in the past is that a client
> > holds a callback on some file or directory (usually a directory)
> > that a bunch of other clients want to access.  Another client tries
> > to do something that requires a callback break.  The client holding
> > the callback can't be contacted for some reason.  Therefore, the
> > threads trying to do something with that object all start blocking
> > on the thread trying to break the callback (or callbacks). [...]
> 
> Note that in this scenario, you should likely see issues logged about
> failing to contact the relevant address at some point. Then again you
> might see a lot of those messages almost constantly, like some people
> do.

In the FileLog we do see messages such as:

   ProbeUuid for 128.113.71.61:7262 failed -01
   WhoAreYou failed for host ac423c08 (128.113.14.38:7001), error -01
   RCallBackConnectBack (host.c) failed for host 128.113.26.87:7001

Looking farther back in the logs we see these from time-to-time on
all the file servers.  I haven't added up specific numbers, but it
looks like we do see a few more of these during the calls-waiting
period, but not dramatically more.  So I didn't know how much to
make of those messages.

And with the debugging-level up during today's event we saw the much
more damning:

   RCallBackConnectBack failed for host ac5d9f08 (128.113.102.29:7001)
   Call back connect back failed (in break delayed) for Host
128.113.102.29:7001
   BreakDelayedCallbacks FAILED for host 128.113.102.29:7001 which IS UP.
      Connection from 128.113.102.29:7001.  Possible network or routing
failure.

The networking guy was trying to investigate everything about the
network and switches between 128.113.102.29 and the file server, but
then his machines locked up due to the AFS problems.  By the time he
was able to do stuff again, the calls-waiting problem had cleared up.

> On Fri, 16 Aug 2013, drosih@rpi.edu wrote:
> 
> >> I did also try doing some tcpdumps and summarizing that traffic, but
> >> nothing remarkable showed up.  However earlier today I learned that
> >> the way I did that might have generated misleading results (for
> >> reasons I won't bore you with right now).  But based on those
> >> tcpdumps I doubt we were getting hammered with AFS traffic,
> >> especially not for such a long stretch of time in the middle of the
> >> summer.
> 
> Did you see _anything_ AFS-related in captured traffic? Even small
> packets like our rx ACKs or ABORTs could give an indication as to
> what is happening, or just information like which hosts the packets
> are going to/from.

I had tcpdump capturing *only* the packets for ports 7001 to 7005, so
all the traffic I captured was AFS traffic.  But I didn't look too
closely at the packets.  I was doing quick traffic summaries, just so
I could see if some specific AFS client was hammering the server.

Russ Allbery also suggested
   "Also, if you haven't already, increasing the number of file
    server threads to way more than the default number (which is
    quite low) can obviously help by creating more of the resource
    that is exhausted by this problem."

FWIW we are running with '/usr/afs/bin/fileserver -L', but I do hope
to get a chance to restart the file servers and increase the number
of file-server threads during this weekend.

-- 
Garance Alistair Drosehn
Senior Systems Programmer
Rensselaer Polytechnic Institute;  Troy NY