[OpenAFS] Investigating 'calls waiting' from rxdebug

Thu, 15 Aug 2013 22:33:13 -0400

Hi.

In the past week we have had two frustrating periods of significant
performance problems in our 
AFS cell.  The first one lasted for maybe two hours, at which point it
seemed the culprit was 
something odd-looking on two of our remote-access linux servers.  I
rebooted those servers, and 
the performance problems disappeared.  That sounds good, but I was so
busy investigating 
various red-herrings that the performance problems might have stopped
15-20 minutes earlier, 
and I just didn't notice until after I had done that reboot.  This
incident, by itself, is not too 
worrisome.

Wednesday the significant (but intermittent) performance problems
returned, and there was 
nothing particularly odd-looking on any machines I could see.  Based on
some google searches, 
we zeroed in on the fact that one of our file servers was reporting
rather high values for 'calls 
waiting for a thread' in the output of 'rxdebug $fileserver -rxstats'.
The other file servers almost 
always reported zero calls waiting, but on this one file server the value 
tended to range between 5 
and 50.  Occasionally it got over 100.  And the higher the value, the
more likely we would see 
performance problems on a wide variety of AFS clients.

Googling some more showed that many people had reported that this value
was indeed a good 
indicator of performance problems.  And looking in log files on the file
servers we saw a few (but 
not many) messages which pointed us to problems in our network.  Most of
those looked like 
minor problems, one or two were more significant and were magnified by
some heavy network 
traffic which happened to be going on at the time.  We fixed all of
those, and actually shut down 
the process which was (legitimately) doing a lot of network I/O.  These
were all good things to do, 
and none of them made a bit of difference to the values we saw for 'calls 
waiting" on that file 
server, or on the very frustratingly hangs we were seeing on AFS clients.

And then at 7:07am this morning, the problem disappeared.  Completely.
The 'calls wating' value 
on that server has not gone above zero for the entire rest of the day.
So, the immediate crisis is 
over.  Everything is working fine.

But my question is:  If this returns, how can I track down what is
*causing* the calls-waiting value 
to climb?  We had over 100 workstations using AFS at the time, scattered
all around campus.  I did 
a variety of things to try and pinpoint the culprit, but didn't have much luck.

So, given a streak of high values for 'call waiting', how can I track
that down to a specific client (or 
clients), or maybe a specific AFS volume?

-- 
Garance Alistair Drosehn
Senior Systems Programmer
RPI; Troy NY