[OpenAFS] Investigating 'calls waiting' from rxdebug
drosih@rpi.edu
drosih@rpi.edu
Thu, 15 Aug 2013 22:33:13 -0400
Hi.
In the past week we have had two frustrating periods of significant
performance problems in our
AFS cell. The first one lasted for maybe two hours, at which point it
seemed the culprit was
something odd-looking on two of our remote-access linux servers. I
rebooted those servers, and
the performance problems disappeared. That sounds good, but I was so
busy investigating
various red-herrings that the performance problems might have stopped
15-20 minutes earlier,
and I just didn't notice until after I had done that reboot. This
incident, by itself, is not too
worrisome.
Wednesday the significant (but intermittent) performance problems
returned, and there was
nothing particularly odd-looking on any machines I could see. Based on
some google searches,
we zeroed in on the fact that one of our file servers was reporting
rather high values for 'calls
waiting for a thread' in the output of 'rxdebug $fileserver -rxstats'.
The other file servers almost
always reported zero calls waiting, but on this one file server the value
tended to range between 5
and 50. Occasionally it got over 100. And the higher the value, the
more likely we would see
performance problems on a wide variety of AFS clients.
Googling some more showed that many people had reported that this value
was indeed a good
indicator of performance problems. And looking in log files on the file
servers we saw a few (but
not many) messages which pointed us to problems in our network. Most of
those looked like
minor problems, one or two were more significant and were magnified by
some heavy network
traffic which happened to be going on at the time. We fixed all of
those, and actually shut down
the process which was (legitimately) doing a lot of network I/O. These
were all good things to do,
and none of them made a bit of difference to the values we saw for 'calls
waiting" on that file
server, or on the very frustratingly hangs we were seeing on AFS clients.
And then at 7:07am this morning, the problem disappeared. Completely.
The 'calls wating' value
on that server has not gone above zero for the entire rest of the day.
So, the immediate crisis is
over. Everything is working fine.
But my question is: If this returns, how can I track down what is
*causing* the calls-waiting value
to climb? We had over 100 workstations using AFS at the time, scattered
all around campus. I did
a variety of things to try and pinpoint the culprit, but didn't have much luck.
So, given a streak of high values for 'call waiting', how can I track
that down to a specific client (or
clients), or maybe a specific AFS volume?
--
Garance Alistair Drosehn
Senior Systems Programmer
RPI; Troy NY