[OpenAFS] Investigating 'calls waiting' from rxdebug

Fri, 16 Aug 2013 12:12:20 -0700 (PDT)

--790511012-771795844-1376680340=:82534
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Nice script., Dan! I was going to suggest running tcpdump to see if one cli=
ent is accounting for most of the traffic. Some misconfiguration or a hardw=
are problem out at the client end can definitely cause a headache for a ser=
ver. (I dimly recall finding some client system that appeared to have two d=
ifferent afs clients installed and running, or trying to run, at the same t=
ime, causing a nasty load on a server.)=A0=0A=0ABest regards,=0AAnne=0A=0A=
=0A=0A________________________________=0A From: Dan Van Der Ster <daniel.va=
nderster@cern.ch>=0ATo: "<drosih@rpi.edu>" <drosih@rpi.edu> =0ACc: "<openaf=
s-info@openafs.org>" <openafs-info@openafs.org> =0ASent: Friday, August 16,=
 2013 4:15 AM=0ASubject: Re: [OpenAFS] Investigating 'calls waiting' from r=
xdebug=0A =0A=0AHi,=0AWhenever we get waiting calls it is ~always caused by=
 one or two users hammering a fileserver from batch jobs.=0A=0ATo find the =
culprit(s) you could try debugging the fileserver by sending the TSTP signa=
l:=0A=A0  http://rzdocs.uni-hohenheim.de/afs_3.6/debug/fs/fileserver.html=
=0A=0AWe have a script that enables debugging for 3 seconds then parses the=
 output to make a nice summary. It has some dependencies on our local perl =
mgmt api but perhaps you can adapt it to work for you. I copied it here: ht=
tp://pastebin.com/B6De4idS=0A=0ACheers, Dan=0A=0AOn Aug 16, 2013, at 4:33 A=
M, drosih@rpi.edu wrote:=0A=0A> Hi.=0A> =0A> In the past week we have had t=
wo frustrating periods of significant=0A> performance problems in our =0A> =
AFS cell.=A0 The first one lasted for maybe two hours, at which point it=0A=
> seemed the culprit was =0A> something odd-looking on two of our remote-ac=
cess linux servers.=A0 I=0A> rebooted those servers, and =0A> the performan=
ce problems disappeared.=A0 That sounds good, but I was so=0A> busy investi=
gating =0A> various red-herrings that the performance problems might have s=
topped=0A> 15-20 minutes earlier, =0A> and I just didn't notice until after=
 I had done that reboot.=A0 This=0A> incident, by itself, is not too =0A> w=
orrisome.=0A> =0A> Wednesday the significant (but intermittent) performance=
 problems=0A> returned, and there was =0A> nothing particularly odd-looking=
 on any machines I could see.=A0 Based on=0A> some google searches, =0A> we=
 zeroed in on the fact that one of our file servers was reporting=0A> rathe=
r high values for 'calls =0A> waiting for a thread' in the output of 'rxdeb=
ug $fileserver -rxstats'.=0A> The other file servers almost =0A> always rep=
orted zero calls waiting, but on this one file server the value =0A> tended=
 to range between 5 =0A> and 50.=A0 Occasionally it got over 100.=A0 And th=
e higher the value, the=0A> more likely we would see =0A> performance probl=
ems on a wide variety of AFS clients.=0A> =0A> Googling some more showed th=
at many people had reported that this value=0A> was indeed a good =0A> indi=
cator of performance problems.=A0 And looking in log files on the file=0A> =
servers we saw a few (but =0A> not many) messages which pointed us to probl=
ems in our network.=A0 Most of=0A> those looked like =0A> minor problems, o=
ne or two were more significant and were magnified by=0A> some heavy networ=
k =0A> traffic which happened to be going on at the time.=A0 We fixed all o=
f=0A> those, and actually shut down =0A> the process which was (legitimatel=
y) doing a lot of network I/O.=A0 These=0A> were all good things to do, =0A=
> and none of them made a bit of difference to the values we saw for 'calls=
 =0A> waiting" on that file =0A> server, or on the very frustratingly hangs=
 we were seeing on AFS clients.=0A> =0A> And then at 7:07am this morning, t=
he problem disappeared.=A0 Completely.=0A> The 'calls wating' value =0A> on=
 that server has not gone above zero for the entire rest of the day.=0A> So=
, the immediate crisis is =0A> over.=A0 Everything is working fine.=0A> =0A=
> But my question is:=A0 If this returns, how can I track down what is=0A> =
*causing* the calls-waiting value =0A> to climb?=A0 We had over 100 worksta=
tions using AFS at the time, scattered=0A> all around campus.=A0 I did =0A>=
 a variety of things to try and pinpoint the culprit, but didn't have much =
luck.=0A> =0A> So, given a streak of high values for 'call waiting', how ca=
n I track=0A> that down to a specific client (or =0A> clients), or maybe a =
specific AFS volume?=0A> =0A> -- =0A> Garance Alistair Drosehn=0A> Senior S=
ystems Programmer=0A> RPI; Troy NY=0A> =0A> =0A> __________________________=
_____________________=0A> OpenAFS-info mailing list=0A> OpenAFS-info@openaf=
s.org=0A> https://lists.openafs.org/mailman/listinfo/openafs-info=0A=0A____=
___________________________________________=0AOpenAFS-info mailing list=0AO=
penAFS-info@openafs.org=0Ahttps://lists.openafs.org/mailman/listinfo/openaf=
s-info
--790511012-771795844-1376680340=:82534
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"color:#000; background-color:#fff; font-family:ti=
mes new roman, new york, times, serif;font-size:12pt"><div><span>Nice scrip=
t., Dan! I was going to suggest running tcpdump to see if one client is acc=
ounting for most of the traffic. Some misconfiguration or a hardware proble=
m out at the client end can definitely cause a headache for a server. (I di=
mly recall finding some client system that appeared to have two different a=
fs clients installed and running, or trying to run, at the same time, causi=
ng a nasty load on a server.)&nbsp;</span></div><div style=3D"color: rgb(0,=
 0, 0); font-size: 16px; font-family: 'times new roman', 'new york', times,=
 serif; background-color: transparent; font-style: normal;"><span><br></spa=
n></div><div style=3D"color: rgb(0, 0, 0); font-size: 16px; font-family: 't=
imes new roman', 'new york', times, serif; background-color: transparent; f=
ont-style: normal;"><span>Best regards,</span></div><div style=3D"color:
 rgb(0, 0, 0); font-size: 16px; font-family: 'times new roman', 'new york',=
 times, serif; background-color: transparent; font-style: normal;"><span>An=
ne</span></div><div style=3D"color: rgb(0, 0, 0); font-size: 16px; font-fam=
ily: 'times new roman', 'new york', times, serif; background-color: transpa=
rent; font-style: normal;"><span><br></span></div><div><br></div>  <div sty=
le=3D"font-family: 'times new roman', 'new york', times, serif; font-size: =
12pt;"> <div style=3D"font-family: 'times new roman', 'new york', times, se=
rif; font-size: 12pt;"> <div dir=3D"ltr"> <hr size=3D"1">  <font size=3D"2"=
 face=3D"Arial"> <b><span style=3D"font-weight:bold;">From:</span></b> Dan =
Van Der Ster &lt;daniel.vanderster@cern.ch&gt;<br> <b><span style=3D"font-w=
eight: bold;">To:</span></b> "&lt;drosih@rpi.edu&gt;" &lt;drosih@rpi.edu&gt=
; <br><b><span style=3D"font-weight: bold;">Cc:</span></b> "&lt;openafs-inf=
o@openafs.org&gt;" &lt;openafs-info@openafs.org&gt; <br> <b><span style=3D"=
font-weight:
 bold;">Sent:</span></b> Friday, August 16, 2013 4:15 AM<br> <b><span style=
=3D"font-weight: bold;">Subject:</span></b> Re: [OpenAFS] Investigating 'ca=
lls waiting' from rxdebug<br> </font> </div> <div class=3D"y_msg_container"=
><br>=0AHi,<br>Whenever we get waiting calls it is ~always caused by one or=
 two users hammering a fileserver from batch jobs.<br> <br>To find the culp=
rit(s) you could try debugging the fileserver by sending the TSTP signal:<b=
r>&nbsp;  http://rzdocs.uni-hohenheim.de/afs_3.6/debug/fs/fileserver.html<b=
r><br>We have a script that enables debugging for 3 seconds then parses the=
 output to make a nice summary. It has some dependencies on our local perl =
mgmt api but perhaps you can adapt it to work for you. I copied it here: ht=
tp://pastebin.com/B6De4idS<br><br>Cheers, Dan<br><br>On Aug 16, 2013, at 4:=
33 AM, <a ymailto=3D"mailto:drosih@rpi.edu" href=3D"mailto:drosih@rpi.edu">=
drosih@rpi.edu</a> wrote:<br><br>&gt; Hi.<br>&gt; <br>&gt; In the past week=
 we have had two frustrating periods of significant<br>&gt; performance pro=
blems in our <br>&gt; AFS cell.&nbsp; The first one lasted for maybe two ho=
urs, at which point it<br>&gt; seemed the culprit was <br>&gt; something
 odd-looking on two of our remote-access linux servers.&nbsp; I<br>&gt; reb=
ooted those servers, and <br>&gt; the performance problems disappeared.&nbs=
p; That sounds good, but I was so<br>&gt; busy investigating <br>&gt; vario=
us red-herrings that the performance problems might have stopped<br>&gt; 15=
-20 minutes earlier, <br>&gt; and I just didn't notice until after I had do=
ne that reboot.&nbsp; This<br>&gt; incident, by itself, is not too <br>&gt;=
 worrisome.<br>&gt; <br>&gt; Wednesday the significant (but intermittent) p=
erformance problems<br>&gt; returned, and there was <br>&gt; nothing partic=
ularly odd-looking on any machines I could see.&nbsp; Based on<br>&gt; some=
 google searches, <br>&gt; we zeroed in on the fact that one of our file se=
rvers was reporting<br>&gt; rather high values for 'calls <br>&gt; waiting =
for a thread' in the output of 'rxdebug $fileserver -rxstats'.<br>&gt; The =
other file servers almost <br>&gt; always reported zero calls
 waiting, but on this one file server the value <br>&gt; tended to range be=
tween 5 <br>&gt; and 50.&nbsp; Occasionally it got over 100.&nbsp; And the =
higher the value, the<br>&gt; more likely we would see <br>&gt; performance=
 problems on a wide variety of AFS clients.<br>&gt; <br>&gt; Googling some =
more showed that many people had reported that this value<br>&gt; was indee=
d a good <br>&gt; indicator of performance problems.&nbsp; And looking in l=
og files on the file<br>&gt; servers we saw a few (but <br>&gt; not many) m=
essages which pointed us to problems in our network.&nbsp; Most of<br>&gt; =
those looked like <br>&gt; minor problems, one or two were more significant=
 and were magnified by<br>&gt; some heavy network <br>&gt; traffic which ha=
ppened to be going on at the time.&nbsp; We fixed all of<br>&gt; those, and=
 actually shut down <br>&gt; the process which was (legitimately) doing a l=
ot of network I/O.&nbsp; These<br>&gt; were all good things to do,
 <br>&gt; and none of them made a bit of difference to the values we saw fo=
r 'calls <br>&gt; waiting" on that file <br>&gt; server, or on the very fru=
stratingly hangs we were seeing on AFS clients.<br>&gt; <br>&gt; And then a=
t 7:07am this morning, the problem disappeared.&nbsp; Completely.<br>&gt; T=
he 'calls wating' value <br>&gt; on that server has not gone above zero for=
 the entire rest of the day.<br>&gt; So, the immediate crisis is <br>&gt; o=
ver.&nbsp; Everything is working fine.<br>&gt; <br>&gt; But my question is:=
&nbsp; If this returns, how can I track down what is<br>&gt; *causing* the =
calls-waiting value <br>&gt; to climb?&nbsp; We had over 100 workstations u=
sing AFS at the time, scattered<br>&gt; all around campus.&nbsp; I did <br>=
&gt; a variety of things to try and pinpoint the culprit, but didn't have m=
uch luck.<br>&gt; <br>&gt; So, given a streak of high values for 'call wait=
ing', how can I track<br>&gt; that down to a specific client (or
 <br>&gt; clients), or maybe a specific AFS volume?<br>&gt; <br>&gt; -- <br=
>&gt; Garance Alistair Drosehn<br>&gt; Senior Systems Programmer<br>&gt; RP=
I; Troy NY<br>&gt; <br>&gt; <br>&gt; ______________________________________=
_________<br>&gt; OpenAFS-info mailing list<br>&gt; <a ymailto=3D"mailto:Op=
enAFS-info@openafs.org" href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-in=
fo@openafs.org</a><br>&gt; <a href=3D"https://lists.openafs.org/mailman/lis=
tinfo/openafs-info" target=3D"_blank">https://lists.openafs.org/mailman/lis=
tinfo/openafs-info</a><br><br>_____________________________________________=
__<br>OpenAFS-info mailing list<br><a ymailto=3D"mailto:OpenAFS-info@openaf=
s.org" href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a=
><br><a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" ta=
rget=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a>=
<br><br><br></div> </div> </div>  </div></body></html>
--790511012-771795844-1376680340=:82534--