[OpenAFS] Investigating 'calls waiting' from rxdebug
Anne Salemme
Anne Salemme <anne@salemme.net>
Fri, 16 Aug 2013 12:12:20 -0700 (PDT)
--790511012-771795844-1376680340=:82534
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Nice script., Dan! I was going to suggest running tcpdump to see if one cli=
ent is accounting for most of the traffic. Some misconfiguration or a hardw=
are problem out at the client end can definitely cause a headache for a ser=
ver. (I dimly recall finding some client system that appeared to have two d=
ifferent afs clients installed and running, or trying to run, at the same t=
ime, causing a nasty load on a server.)=A0=0A=0ABest regards,=0AAnne=0A=0A=
=0A=0A________________________________=0A From: Dan Van Der Ster <daniel.va=
nderster@cern.ch>=0ATo: "<drosih@rpi.edu>" <drosih@rpi.edu> =0ACc: "<openaf=
s-info@openafs.org>" <openafs-info@openafs.org> =0ASent: Friday, August 16,=
2013 4:15 AM=0ASubject: Re: [OpenAFS] Investigating 'calls waiting' from r=
xdebug=0A =0A=0AHi,=0AWhenever we get waiting calls it is ~always caused by=
one or two users hammering a fileserver from batch jobs.=0A=0ATo find the =
culprit(s) you could try debugging the fileserver by sending the TSTP signa=
l:=0A=A0 http://rzdocs.uni-hohenheim.de/afs_3.6/debug/fs/fileserver.html=
=0A=0AWe have a script that enables debugging for 3 seconds then parses the=
output to make a nice summary. It has some dependencies on our local perl =
mgmt api but perhaps you can adapt it to work for you. I copied it here: ht=
tp://pastebin.com/B6De4idS=0A=0ACheers, Dan=0A=0AOn Aug 16, 2013, at 4:33 A=
M, drosih@rpi.edu wrote:=0A=0A> Hi.=0A> =0A> In the past week we have had t=
wo frustrating periods of significant=0A> performance problems in our =0A> =
AFS cell.=A0 The first one lasted for maybe two hours, at which point it=0A=
> seemed the culprit was =0A> something odd-looking on two of our remote-ac=
cess linux servers.=A0 I=0A> rebooted those servers, and =0A> the performan=
ce problems disappeared.=A0 That sounds good, but I was so=0A> busy investi=
gating =0A> various red-herrings that the performance problems might have s=
topped=0A> 15-20 minutes earlier, =0A> and I just didn't notice until after=
I had done that reboot.=A0 This=0A> incident, by itself, is not too =0A> w=
orrisome.=0A> =0A> Wednesday the significant (but intermittent) performance=
problems=0A> returned, and there was =0A> nothing particularly odd-looking=
on any machines I could see.=A0 Based on=0A> some google searches, =0A> we=
zeroed in on the fact that one of our file servers was reporting=0A> rathe=
r high values for 'calls =0A> waiting for a thread' in the output of 'rxdeb=
ug $fileserver -rxstats'.=0A> The other file servers almost =0A> always rep=
orted zero calls waiting, but on this one file server the value =0A> tended=
to range between 5 =0A> and 50.=A0 Occasionally it got over 100.=A0 And th=
e higher the value, the=0A> more likely we would see =0A> performance probl=
ems on a wide variety of AFS clients.=0A> =0A> Googling some more showed th=
at many people had reported that this value=0A> was indeed a good =0A> indi=
cator of performance problems.=A0 And looking in log files on the file=0A> =
servers we saw a few (but =0A> not many) messages which pointed us to probl=
ems in our network.=A0 Most of=0A> those looked like =0A> minor problems, o=
ne or two were more significant and were magnified by=0A> some heavy networ=
k =0A> traffic which happened to be going on at the time.=A0 We fixed all o=
f=0A> those, and actually shut down =0A> the process which was (legitimatel=
y) doing a lot of network I/O.=A0 These=0A> were all good things to do, =0A=
> and none of them made a bit of difference to the values we saw for 'calls=
=0A> waiting" on that file =0A> server, or on the very frustratingly hangs=
we were seeing on AFS clients.=0A> =0A> And then at 7:07am this morning, t=
he problem disappeared.=A0 Completely.=0A> The 'calls wating' value =0A> on=
that server has not gone above zero for the entire rest of the day.=0A> So=
, the immediate crisis is =0A> over.=A0 Everything is working fine.=0A> =0A=
> But my question is:=A0 If this returns, how can I track down what is=0A> =
*causing* the calls-waiting value =0A> to climb?=A0 We had over 100 worksta=
tions using AFS at the time, scattered=0A> all around campus.=A0 I did =0A>=
a variety of things to try and pinpoint the culprit, but didn't have much =
luck.=0A> =0A> So, given a streak of high values for 'call waiting', how ca=
n I track=0A> that down to a specific client (or =0A> clients), or maybe a =
specific AFS volume?=0A> =0A> -- =0A> Garance Alistair Drosehn=0A> Senior S=
ystems Programmer=0A> RPI; Troy NY=0A> =0A> =0A> __________________________=
_____________________=0A> OpenAFS-info mailing list=0A> OpenAFS-info@openaf=
s.org=0A> https://lists.openafs.org/mailman/listinfo/openafs-info=0A=0A____=
___________________________________________=0AOpenAFS-info mailing list=0AO=
penAFS-info@openafs.org=0Ahttps://lists.openafs.org/mailman/listinfo/openaf=
s-info
--790511012-771795844-1376680340=:82534
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
<html><body><div style=3D"color:#000; background-color:#fff; font-family:ti=
mes new roman, new york, times, serif;font-size:12pt"><div><span>Nice scrip=
t., Dan! I was going to suggest running tcpdump to see if one client is acc=
ounting for most of the traffic. Some misconfiguration or a hardware proble=
m out at the client end can definitely cause a headache for a server. (I di=
mly recall finding some client system that appeared to have two different a=
fs clients installed and running, or trying to run, at the same time, causi=
ng a nasty load on a server.) </span></div><div style=3D"color: rgb(0,=
0, 0); font-size: 16px; font-family: 'times new roman', 'new york', times,=
serif; background-color: transparent; font-style: normal;"><span><br></spa=
n></div><div style=3D"color: rgb(0, 0, 0); font-size: 16px; font-family: 't=
imes new roman', 'new york', times, serif; background-color: transparent; f=
ont-style: normal;"><span>Best regards,</span></div><div style=3D"color:
rgb(0, 0, 0); font-size: 16px; font-family: 'times new roman', 'new york',=
times, serif; background-color: transparent; font-style: normal;"><span>An=
ne</span></div><div style=3D"color: rgb(0, 0, 0); font-size: 16px; font-fam=
ily: 'times new roman', 'new york', times, serif; background-color: transpa=
rent; font-style: normal;"><span><br></span></div><div><br></div> <div sty=
le=3D"font-family: 'times new roman', 'new york', times, serif; font-size: =
12pt;"> <div style=3D"font-family: 'times new roman', 'new york', times, se=
rif; font-size: 12pt;"> <div dir=3D"ltr"> <hr size=3D"1"> <font size=3D"2"=
face=3D"Arial"> <b><span style=3D"font-weight:bold;">From:</span></b> Dan =
Van Der Ster <daniel.vanderster@cern.ch><br> <b><span style=3D"font-w=
eight: bold;">To:</span></b> "<drosih@rpi.edu>" <drosih@rpi.edu>=
; <br><b><span style=3D"font-weight: bold;">Cc:</span></b> "<openafs-inf=
o@openafs.org>" <openafs-info@openafs.org> <br> <b><span style=3D"=
font-weight:
bold;">Sent:</span></b> Friday, August 16, 2013 4:15 AM<br> <b><span style=
=3D"font-weight: bold;">Subject:</span></b> Re: [OpenAFS] Investigating 'ca=
lls waiting' from rxdebug<br> </font> </div> <div class=3D"y_msg_container"=
><br>=0AHi,<br>Whenever we get waiting calls it is ~always caused by one or=
two users hammering a fileserver from batch jobs.<br> <br>To find the culp=
rit(s) you could try debugging the fileserver by sending the TSTP signal:<b=
r> http://rzdocs.uni-hohenheim.de/afs_3.6/debug/fs/fileserver.html<b=
r><br>We have a script that enables debugging for 3 seconds then parses the=
output to make a nice summary. It has some dependencies on our local perl =
mgmt api but perhaps you can adapt it to work for you. I copied it here: ht=
tp://pastebin.com/B6De4idS<br><br>Cheers, Dan<br><br>On Aug 16, 2013, at 4:=
33 AM, <a ymailto=3D"mailto:drosih@rpi.edu" href=3D"mailto:drosih@rpi.edu">=
drosih@rpi.edu</a> wrote:<br><br>> Hi.<br>> <br>> In the past week=
we have had two frustrating periods of significant<br>> performance pro=
blems in our <br>> AFS cell. The first one lasted for maybe two ho=
urs, at which point it<br>> seemed the culprit was <br>> something
odd-looking on two of our remote-access linux servers. I<br>> reb=
ooted those servers, and <br>> the performance problems disappeared.&nbs=
p; That sounds good, but I was so<br>> busy investigating <br>> vario=
us red-herrings that the performance problems might have stopped<br>> 15=
-20 minutes earlier, <br>> and I just didn't notice until after I had do=
ne that reboot. This<br>> incident, by itself, is not too <br>>=
worrisome.<br>> <br>> Wednesday the significant (but intermittent) p=
erformance problems<br>> returned, and there was <br>> nothing partic=
ularly odd-looking on any machines I could see. Based on<br>> some=
google searches, <br>> we zeroed in on the fact that one of our file se=
rvers was reporting<br>> rather high values for 'calls <br>> waiting =
for a thread' in the output of 'rxdebug $fileserver -rxstats'.<br>> The =
other file servers almost <br>> always reported zero calls
waiting, but on this one file server the value <br>> tended to range be=
tween 5 <br>> and 50. Occasionally it got over 100. And the =
higher the value, the<br>> more likely we would see <br>> performance=
problems on a wide variety of AFS clients.<br>> <br>> Googling some =
more showed that many people had reported that this value<br>> was indee=
d a good <br>> indicator of performance problems. And looking in l=
og files on the file<br>> servers we saw a few (but <br>> not many) m=
essages which pointed us to problems in our network. Most of<br>> =
those looked like <br>> minor problems, one or two were more significant=
and were magnified by<br>> some heavy network <br>> traffic which ha=
ppened to be going on at the time. We fixed all of<br>> those, and=
actually shut down <br>> the process which was (legitimately) doing a l=
ot of network I/O. These<br>> were all good things to do,
<br>> and none of them made a bit of difference to the values we saw fo=
r 'calls <br>> waiting" on that file <br>> server, or on the very fru=
stratingly hangs we were seeing on AFS clients.<br>> <br>> And then a=
t 7:07am this morning, the problem disappeared. Completely.<br>> T=
he 'calls wating' value <br>> on that server has not gone above zero for=
the entire rest of the day.<br>> So, the immediate crisis is <br>> o=
ver. Everything is working fine.<br>> <br>> But my question is:=
If this returns, how can I track down what is<br>> *causing* the =
calls-waiting value <br>> to climb? We had over 100 workstations u=
sing AFS at the time, scattered<br>> all around campus. I did <br>=
> a variety of things to try and pinpoint the culprit, but didn't have m=
uch luck.<br>> <br>> So, given a streak of high values for 'call wait=
ing', how can I track<br>> that down to a specific client (or
<br>> clients), or maybe a specific AFS volume?<br>> <br>> -- <br=
>> Garance Alistair Drosehn<br>> Senior Systems Programmer<br>> RP=
I; Troy NY<br>> <br>> <br>> ______________________________________=
_________<br>> OpenAFS-info mailing list<br>> <a ymailto=3D"mailto:Op=
enAFS-info@openafs.org" href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-in=
fo@openafs.org</a><br>> <a href=3D"https://lists.openafs.org/mailman/lis=
tinfo/openafs-info" target=3D"_blank">https://lists.openafs.org/mailman/lis=
tinfo/openafs-info</a><br><br>_____________________________________________=
__<br>OpenAFS-info mailing list<br><a ymailto=3D"mailto:OpenAFS-info@openaf=
s.org" href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a=
><br><a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" ta=
rget=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a>=
<br><br><br></div> </div> </div> </div></body></html>
--790511012-771795844-1376680340=:82534--