[OpenAFS] Investigating 'calls waiting' from rxdebug

Dan Van Der Ster daniel.vanderster@cern.ch
Fri, 16 Aug 2013 08:15:54 +0000


Hi,
Whenever we get waiting calls it is ~always caused by one or two users hamm=
ering a fileserver from batch jobs.
=20
To find the culprit(s) you could try debugging the fileserver by sending th=
e TSTP signal:
   http://rzdocs.uni-hohenheim.de/afs_3.6/debug/fs/fileserver.html

We have a script that enables debugging for 3 seconds then parses the outpu=
t to make a nice summary. It has some dependencies on our local perl mgmt a=
pi but perhaps you can adapt it to work for you. I copied it here: http://p=
astebin.com/B6De4idS

Cheers, Dan

On Aug 16, 2013, at 4:33 AM, drosih@rpi.edu wrote:

> Hi.
>=20
> In the past week we have had two frustrating periods of significant
> performance problems in our=20
> AFS cell.  The first one lasted for maybe two hours, at which point it
> seemed the culprit was=20
> something odd-looking on two of our remote-access linux servers.  I
> rebooted those servers, and=20
> the performance problems disappeared.  That sounds good, but I was so
> busy investigating=20
> various red-herrings that the performance problems might have stopped
> 15-20 minutes earlier,=20
> and I just didn't notice until after I had done that reboot.  This
> incident, by itself, is not too=20
> worrisome.
>=20
> Wednesday the significant (but intermittent) performance problems
> returned, and there was=20
> nothing particularly odd-looking on any machines I could see.  Based on
> some google searches,=20
> we zeroed in on the fact that one of our file servers was reporting
> rather high values for 'calls=20
> waiting for a thread' in the output of 'rxdebug $fileserver -rxstats'.
> The other file servers almost=20
> always reported zero calls waiting, but on this one file server the value=
=20
> tended to range between 5=20
> and 50.  Occasionally it got over 100.  And the higher the value, the
> more likely we would see=20
> performance problems on a wide variety of AFS clients.
>=20
> Googling some more showed that many people had reported that this value
> was indeed a good=20
> indicator of performance problems.  And looking in log files on the file
> servers we saw a few (but=20
> not many) messages which pointed us to problems in our network.  Most of
> those looked like=20
> minor problems, one or two were more significant and were magnified by
> some heavy network=20
> traffic which happened to be going on at the time.  We fixed all of
> those, and actually shut down=20
> the process which was (legitimately) doing a lot of network I/O.  These
> were all good things to do,=20
> and none of them made a bit of difference to the values we saw for 'calls=
=20
> waiting" on that file=20
> server, or on the very frustratingly hangs we were seeing on AFS clients.
>=20
> And then at 7:07am this morning, the problem disappeared.  Completely.
> The 'calls wating' value=20
> on that server has not gone above zero for the entire rest of the day.
> So, the immediate crisis is=20
> over.  Everything is working fine.
>=20
> But my question is:  If this returns, how can I track down what is
> *causing* the calls-waiting value=20
> to climb?  We had over 100 workstations using AFS at the time, scattered
> all around campus.  I did=20
> a variety of things to try and pinpoint the culprit, but didn't have much=
 luck.
>=20
> So, given a streak of high values for 'call waiting', how can I track
> that down to a specific client (or=20
> clients), or maybe a specific AFS volume?
>=20
> --=20
> Garance Alistair Drosehn
> Senior Systems Programmer
> RPI; Troy NY
>=20
>=20
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info