[OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))

Ximeng (Simon) Guan xmgu@royole.com
Mon, 7 Jan 2019 19:40:36 +0000


--_000_ea613baf0abc4562a2ec61cfa1b7e255royolecom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hello,

After a power outage on Christmas Eve which forced two database servers and=
 all the network switches in one of our offices to re-boot, our laptop clie=
nts in that office can no longer connect to one of the AFS servers hosted i=
n the same office.

I am leaning towards the possibility that it is a network problem instead o=
f an OpenAFS service problem because:

  1.  Remote offices can access the full AFS space, including those volumes=
 hosted on the re-booted servers.
  2.  Between the servers there is no access problem. Nothing wrong with th=
e result of "bos status", "rxdebug" or "udebug". "fs checkservers" show tha=
t all servers are running.
  3.  On the problematic laptops "fs checkservers" show that "All servers a=
re running".
  4.  On the problematic laptops "bos status afssrv1" returns a message:

"bos: failed to contact host's bosserver (communications failure (-1))."

But on the servers both in that office and in the remote offices, the same =
command shows that all services are up:

"Instance ptserver, currently running normally.

Instance vlserver, currently running normally.

Instance buserver, currently running normally.

Instance upserver, currently running normally.

Instance backupusers, currently running normally.

    Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.

Instance dafs, currently running normally.

Auxiliary status is: file server running."

  1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns *norm=
al* output, for example:

"Trying 10.12.8.33 (port 7000):

Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36

not waiting for packets.

0 calls waiting for a thread

125 threads are idle

1 calls have waited for a thread

Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104

  serial 12,  natMTU 1344, security index 0, client conn

    call 0: # 4, state dally, mode: receiving, flags: receive_done

    call 1: # 0, state not initialized

    call 2: # 0, state not initialized

    call 3: # 0, state not initialized

Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114

  serial 21,  natMTU 1344, security index 0, client conn

    call 0: # 7, state dally, mode: receiving, flags: receive_done

    call 1: # 0, state not initialized

    call 2: # 0, state not initialized

    call 3: # 0, state not initialized

Done."

I do not administer the network. Can I have some advice on how to futher de=
bug the connection problem? Which udp port does the command "bos status" us=
e?

Thank you!

Best regards,
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Ximeng (Simon) Guan, Ph.D.
Associate Principal Engineer
Royole Corporation
48025 Fremont Blvd, Fremont, CA 94538
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D


--_000_ea613baf0abc4562a2ec61cfa1b7e255royolecom_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:DengXian;
	panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:"\@DengXian";
	panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:#0563C1;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri",sans-serif;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:1148977309;
	mso-list-type:hybrid;
	mso-list-template-ids:760509940 67698703 67698713 67698715 67698703 676987=
13 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level2
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level3
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
@list l0:level4
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level5
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level6
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
@list l0:level7
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level8
	{mso-level-number-format:alpha-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l0:level9
	{mso-level-number-format:roman-lower;
	mso-level-tab-stop:none;
	mso-level-number-position:right;
	text-indent:-9.0pt;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"#0563C1" vlink=3D"#954F72">
<div class=3D"WordSection1">
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Hello,<o:p></o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">After a power outag=
e on Christmas Eve which forced two database servers and all the network sw=
itches in one of our offices to re-boot, our laptop clients in that office =
can no longer connect to one of the
 AFS servers hosted in the same office. <o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">I am leaning toward=
s the possibility that it is a network problem instead of an OpenAFS servic=
e problem because:<o:p></o:p></span></p>
<ol style=3D"margin-top:0in" start=3D"1" type=3D"1">
<li class=3D"MsoListParagraph" style=3D"margin-left:0in;mso-list:l0 level1 =
lfo1"><span style=3D"font-size:12.0pt">Remote offices can access the full A=
FS space, including those volumes hosted on the re-booted servers.
<o:p></o:p></span></li><li class=3D"MsoListParagraph" style=3D"margin-left:=
0in;mso-list:l0 level1 lfo1"><span style=3D"font-size:12.0pt">Between the s=
ervers there is no access problem. Nothing wrong with the result of &#8220;=
bos status&#8221;, &#8220;rxdebug&#8221; or &#8220;udebug&#8221;. &#8220;fs=
 checkservers&#8221; show that all
 servers are running. <o:p></o:p></span></li><li class=3D"MsoListParagraph"=
 style=3D"margin-left:0in;mso-list:l0 level1 lfo1"><span style=3D"font-size=
:12.0pt">On the problematic laptops &#8220;fs checkservers&#8221; show that=
 &#8220;All servers are running&#8221;.<o:p></o:p></span></li><li class=3D"=
MsoListParagraph" style=3D"margin-left:0in;mso-list:l0 level1 lfo1"><span s=
tyle=3D"font-size:12.0pt">On the problematic laptops &#8220;bos status afss=
rv1&#8221; returns a message:<o:p></o:p></span></li></ol>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&#8220;bos</=
span><span style=3D"font-size:12.0pt">: failed to contact host's bosserver =
(communications failure (-1)).&#8221;<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">But on the s=
ervers both in that office and in the remote offices, the same command show=
s that all services are up:<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&#8220;Insta=
nce ptserver, currently running normally.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Instance vls=
erver, currently running normally.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Instance bus=
erver, currently running normally.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Instance ups=
erver, currently running normally.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Instance bac=
kupusers, currently running normally.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; Auxiliary status is: run next at Tue Jan&nbsp; 8 04:00:00 2019.<o:p>=
</o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Instance daf=
s, currently running normally.<o:p></o:p></span></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:10.5pt"><span style=3D"f=
ont-size:12.0pt">Auxiliary status is: file server running.&#8221;<o:p></o:p=
></span></p>
<ol style=3D"margin-top:0in" start=3D"5" type=3D"1">
<li class=3D"MsoListParagraph" style=3D"margin-left:0in;mso-list:l0 level1 =
lfo1"><span style=3D"font-size:12.0pt">On the problematic laptops &#8220;rx=
debug afssrv1 -port 7000&#8221; returns *<b>normal</b>* output, for example=
:<o:p></o:p></span></li></ol>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&#8220;Tryin=
g 10.12.8.33 (port 7000):<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Free packets=
: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36<o:p></o:p></span><=
/p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">not waiting =
for packets.</span><span style=3D"font-size:12.0pt"><o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">0 calls wait=
ing for a thread<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">125 threads =
are idle<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">1</span><spa=
n style=3D"font-size:12.0pt"> calls have waited for a thread<o:p></o:p></sp=
an></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Connection f=
rom host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104<o:p></o:p></span></=
p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp; seria=
l 12,&nbsp; natMTU 1344, security index 0, client conn<o:p></o:p></span></p=
>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 0: # 4, state dally, mode: receiving, flags: receive_done<o:p><=
/o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 1: # 0, state not initialized<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 2: # 0, state not initialized<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 3: # 0, state not initialized<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Connection f=
rom host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114<o:p></o:p></span></p=
>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp; seria=
l 21,&nbsp; natMTU 1344, security index 0, client conn<o:p></o:p></span></p=
>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 0: # 7, state dally, mode: receiving, flags: receive_done<o:p><=
/o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 1: # 0, state not initialized<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 2: # 0, state not initialized<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">&nbsp;&nbsp;=
&nbsp; call 3: # 0, state not initialized<o:p></o:p></span></p>
<p class=3D"MsoListParagraph"><span style=3D"font-size:12.0pt">Done.&#8221;=
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">I do not administer=
 the network. Can I have some advice on how to futher debug the connection =
problem? Which udp port does the command &#8220;bos status&#8221; use?<o:p>=
</o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt">Thank you!<o:p></o:=
p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt"><o:p>&nbsp;</o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Best regards,<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<o:=
p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Ximeng (Simon) Guan, Ph.D.<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Associate Principal Engineer<o:p></o:p></span></=
p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">Royole Corporation<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">48025 Fremont Blvd, Fremont, CA 94538<o:p></o:p>=
</span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,serif">=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<o:=
p></o:p></span></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
</body>
</html>

--_000_ea613baf0abc4562a2ec61cfa1b7e255royolecom_--