[OpenAFS] Client running out of thread with high connection number

Ximeng Guan xmgu@royole.com
Wed, 5 Sep 2018 21:50:03 +0000


--_000_ca9aad9dc3a0423abb0295f41a92ffefroyolecom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hello,

We have a heavily-used OpenAFS client (a design server) that has repeatedly=
 shown a message of "Connection timed out" during peak hours in the past we=
ek.

When probed with rxdebug at port 7001, we find a high number of calls that =
had waited for a thread. (90 calls have waited for a thread, noBuffers 11, =
2892 client connections)

According to the users, the loss of connection happens randomly and can las=
t up to 10min. When the loss of connection happens, the server seems to be =
unreachable from outside (ping or ssh). But existing sessions still seem to=
 respond to keyboard strokes from remote users.

The client is running CentOS 6.6, kernel 2.6.32-696.3.2.el6.x86_64 and Open=
AFS 1.6.20.

I am suspecting that the problem may not be related to the client, but to t=
he network service in general. But the returned message by rxdebug seems od=
d too.

Can someone help interpret this message for us?

Trying 172.16.101.82 (port 7001):
Free packets: 424/531, packet reclaims: 0, calls: 184990, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
90 calls have waited for a thread
rx stats: free packets 424, allocs 2067108999, alloc-failures(rcv 0/0,send =
0/0,ack 0)
   greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 11, se=
lects 0, sendSelects 0
   packets read: data 341104352 ack 2316391759 busy 216 abort 736892 ackall=
 0 challenge 49571 response 0 debug 3432 params 0 unused 0 unused 0 unused =
0 version 0
   other read counters: data 313908875, ack 2281732446, dup 7217124 spuriou=
s 61698033 dally 161427
   packets sent: data 1985111690 ack 187255106 busy 0 abort 173 ackall 0 ch=
allenge 0 response 49567 debug 0 params 0 unused 0 unused 0 unused 0 versio=
n 0
   other send counters: ack 187255106, data 1580369647 (not resends), resen=
ds 404742043, pushed 0, acked&ignored 2852065309
            (these should be small) sendFailed 1470595, fatalErrors 153
   Average rtt is 0.000, with 1163176554 samples
   Minimum rtt is 0.000, maximum is 2.396
   3 server connections, 2892 client connections, 77 peer structs, 1127 cal=
l structs, 39 free call structs

We looked at the file servers at port 7000 and didn't see the same congesti=
on. However, we do see a log pattern that seems to be related to the high n=
umber of connections from that same client:

Wed Sep  5 13:02:45 2018 FindClient: stillborn client 00007FE9680116E0(1ced=
d31c); conn 00007FE77C011060 (host 172.16.101.82:7001) had client 00007FE96=
8013AA0(1cedd31c)
Wed Sep  5 13:12:53 2018 FindClient: stillborn client 00007FE97001E400(1ced=
d370); conn 00007FE97401BB70 (host 172.16.101.82:7001) had client 00007FE97=
001E0C0(1cedd370)
Wed Sep  5 13:18:56 2018 FindClient: stillborn client 00007FE98C02A7C0(1ced=
d394); conn 00007FE974032180 (host 172.16.101.82:7001) had client 00007FE96=
8013010(1cedd394)
Wed Sep  5 13:18:57 2018 FindClient: stillborn client 00007FE76401D840(1ced=
d39c); conn 0000000000C0A590 (host 172.16.101.82:7001) had client 00007FE98=
C02A7C0(1cedd39c)
Wed Sep  5 13:26:36 2018 FindClient: stillborn client 00007FE970020F10(1ced=
d3e4); conn 0000000000C07800 (host 172.16.101.82:7001) had client 00007FE96=
8012240(1cedd3e4)
Wed Sep  5 13:37:09 2018 FindClient: stillborn client 00007FE98C029FA0(1ced=
d4c0); conn 00007FE77C0121B0 (host 172.16.101.82:7001) had client 00007FE96=
8013690(1cedd4c0)


Thank you!

Best regards,
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Ximeng (Simon) Guan, Ph.D.
Associate Principal Engineer
Royole Corporation
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D


--_000_ca9aad9dc3a0423abb0295f41a92ffefroyolecom_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:DengXian;
	panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:"\@DengXian";
	panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:#0563C1;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri",sans-serif;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"#0563C1" vlink=3D"#954F72">
<div class=3D"WordSection1">
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Hello,<o:p></o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">We have a heavily-u=
sed OpenAFS client (a design server) that has repeatedly shown a message of=
 &#8220;Connection timed out&#8221; during peak hours in the past week.
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">When probed with rx=
debug at port 7001, we find a high number of calls that had waited for a th=
read. (90 calls have waited for a thread, noBuffers 11, 2892 client connect=
ions)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">According to the us=
ers, the loss of connection happens randomly and can last up to 10min. When=
 the loss of connection happens, the server seems to be unreachable from ou=
tside (ping or ssh). But existing sessions
 still seem to respond to keyboard strokes from remote users. <o:p></o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">The client is runni=
ng CentOS 6.6, kernel 2.6.32-696.3.2.el6.x86_64 and OpenAFS 1.6.20.
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">I am suspecting tha=
t the problem may not be related to the client, but to the network service =
in general. But the returned message by rxdebug seems odd too.
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Can someone help in=
terpret this message for us?
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Trying 172.16.101.8=
2 (port 7001):<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Free packets: 424/5=
31, packet reclaims: 0, calls: 184990, used FDs: 64<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">not waiting for pac=
kets.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">0 calls waiting for=
 a thread<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">1 threads are idle<=
o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">90 calls have waite=
d for a thread<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">rx stats: free pack=
ets 424, allocs 2067108999, alloc-failures(rcv 0/0,send 0/0,ack 0)<o:p></o:=
p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; greedy=
 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 11, selects 0, =
sendSelects 0<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; packet=
s read: data 341104352 ack 2316391759 busy 216 abort 736892 ackall 0 challe=
nge 49571 response 0 debug 3432 params 0 unused 0 unused 0 unused 0 version=
 0
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;&nbsp;o=
ther read counters: data 313908875, ack 2281732446, dup 7217124 spurious 61=
698033 dally 161427<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; packet=
s sent: data 1985111690 ack 187255106 busy 0 abort 173 ackall 0 challenge 0=
 response 49567 debug 0 params 0 unused 0 unused 0 unused 0 version 0
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;&nbsp;o=
ther send counters: ack 187255106, data 1580369647 (not resends), resends 4=
04742043, pushed 0, acked&amp;ignored 2852065309<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; &nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (these should be small) sendFail=
ed 1470595, fatalErrors 153<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; Averag=
e rtt is 0.000, with 1163176554 samples<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; Minimu=
m rtt is 0.000, maximum is 2.396<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">&nbsp;&nbsp; 3 serv=
er connections, 2892 client connections, 77 peer structs, 1127 call structs=
, 39 free call structs<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">We looked at the fi=
le servers at port 7000 and didn&#8217;t see the same congestion. However, =
we do see a log pattern that seems to be related to the high number of conn=
ections from that same client:<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Wed Sep&nbsp; 5 13:=
02:45 2018 FindClient: stillborn client 00007FE9680116E0(1cedd31c); conn 00=
007FE77C011060 (host 172.16.101.82:7001) had client 00007FE968013AA0(1cedd3=
1c)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Wed Sep&nbsp; 5 13:=
12:53 2018 FindClient: stillborn client 00007FE97001E400(1cedd370); conn 00=
007FE97401BB70 (host 172.16.101.82:7001) had client 00007FE97001E0C0(1cedd3=
70)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Wed Sep&nbsp; 5 13:=
18:56 2018 FindClient: stillborn client 00007FE98C02A7C0(1cedd394); conn 00=
007FE974032180 (host 172.16.101.82:7001) had client 00007FE968013010(1cedd3=
94)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Wed Sep&nbsp; 5 13:=
18:57 2018 FindClient: stillborn client 00007FE76401D840(1cedd39c); conn 00=
00000000C0A590 (host 172.16.101.82:7001) had client 00007FE98C02A7C0(1cedd3=
9c)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Wed Sep&nbsp; 5 13:=
26:36 2018 FindClient: stillborn client 00007FE970020F10(1cedd3e4); conn 00=
00000000C07800 (host 172.16.101.82:7001) had client 00007FE968012240(1cedd3=
e4)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Wed Sep&nbsp; 5 13:=
37:09 2018 FindClient: stillborn client 00007FE98C029FA0(1cedd4c0); conn 00=
007FE77C0121B0 (host 172.16.101.82:7001) had client 00007FE968013690(1cedd4=
c0)<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Thank you!<o:p></o:=
p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Best regards,<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<o:=
p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Ximeng (Simon) Guan, Ph.D.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Associate Principal Engineer<o:p></o:p></span></=
p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Royole Corporation<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<o:=
p></o:p></span></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
</body>
</html>

--_000_ca9aad9dc3a0423abb0295f41a92ffefroyolecom_--