[OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))

Ximeng (Simon) Guan xmgu@royole.com
Wed, 9 Jan 2019 02:26:19 +0000


Thanks. Yes, the bos invocation did hang for a minimute or two before repot=
ing that failure.=20

We just figured out the reason for the failure. It is still MTU-related:

1. Between offices we use IPsec for VPN and that limits the path MTU to be =
1400.=20
2. To accommodate the reduced MTU we did the following:
    2.1 Apply -rxmaxmtu 1400 in BosConfig
    2.2 Adjust the ifcfg-xxx config in the host machine of the failed datab=
ase server to be 1400.=20

It turns out that it is 2.2 that caused the problem. The database machine i=
s hosted as a KVM VM. When we adjusted the MTU of the ifcfg in the host to =
1400 and the power outage caused the server to reboot, the server started t=
o drop incoming 1500 UDP packets.=20

The server and office laptops are connected through a L2 switch that does n=
ot handle fragmentation. All remote traffic goes through a L3 router which =
does, and re-packs them to 1400. That's why all the local clients had probl=
em accessing AFS but the remote servers and clients did not...=20

Thank you!

Simon

-----Original Message-----
From: Benjamin Kaduk <kaduk@mit.edu>=20
Sent: Tuesday, January 8, 2019 6:13 PM
To: Ximeng (Simon) Guan <xmgu@royole.com>
Cc: OpenAFS-info@openafs.org
Subject: Re: [OpenAFS] Client connection failure: bos failed to contact hos=
t's bosserver (communication failure (-1))

On Mon, Jan 07, 2019 at 08:00:27PM +0000, Ximeng (Simon) Guan wrote:
> We do have NetInfo properly set up to include the only one IP that is use=
d.=20

Good to know, thanks.

I couldn't rule out MTU issues offhand, but don't have time to dig in furth=
er right now. =20

Do the problematic bos invocations hang for a minute or two before reportin=
g the "communications failure"?

The bosserver listens on port 7007, if you hadn't found that already -- a p=
acket capture would help show what's going on, if you have the ability to g=
et one of those.

-Ben

> Can the connection failure somehow come from the non-default MTU settings=
 we are using? That thing constantly bit us in the past in different places=
. We have  "-rxmaxmtu 1344" used across the board for all ptservers, vlserv=
er, davolserver and dafileserver instances. I was told by the network folks=
 that they could not manage default MTU of 1500 but has to use 1400 because=
 of the IPSec requirement...
>=20
> Thank you!
> Simon
>=20
> -----Original Message-----
> From: openafs-info-admin@openafs.org <openafs-info-admin@openafs.org>=20
> On Behalf Of Benjamin Kaduk
> Sent: Monday, January 7, 2019 11:44 AM
> To: Ximeng (Simon) Guan <xmgu@royole.com>
> Cc: OpenAFS-info@openafs.org
> Subject: Re: [OpenAFS] Client connection failure: bos failed to=20
> contact host's bosserver (communication failure (-1))
>=20
> On Mon, Jan 07, 2019 at 07:40:36PM +0000, Ximeng (Simon) Guan wrote:
> > Hello,
> >=20
> > After a power outage on Christmas Eve which forced two database servers=
 and all the network switches in one of our offices to re-boot, our laptop =
clients in that office can no longer connect to one of the AFS servers host=
ed in the same office.
> >=20
> > I am leaning towards the possibility that it is a network problem inste=
ad of an OpenAFS service problem because:
> >=20
> >   1.  Remote offices can access the full AFS space, including those vol=
umes hosted on the re-booted servers.
> >   2.  Between the servers there is no access problem. Nothing wrong wit=
h the result of "bos status", "rxdebug" or "udebug". "fs checkservers" show=
 that all servers are running.
> >   3.  On the problematic laptops "fs checkservers" show that "All serve=
rs are running".
> >   4.  On the problematic laptops "bos status afssrv1" returns a message=
:
> >=20
> > "bos: failed to contact host's bosserver (communications failure (-1)).=
"
> >=20
> > But on the servers both in that office and in the remote offices, the s=
ame command shows that all services are up:
> >=20
> > "Instance ptserver, currently running normally.
> >=20
> > Instance vlserver, currently running normally.
> >=20
> > Instance buserver, currently running normally.
> >=20
> > Instance upserver, currently running normally.
> >=20
> > Instance backupusers, currently running normally.
> >=20
> >     Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.
> >=20
> > Instance dafs, currently running normally.
> >=20
> > Auxiliary status is: file server running."
> >=20
> >   1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns *=
normal* output, for example:
> >=20
> > "Trying 10.12.8.33 (port 7000):
> >=20
> > Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36
> >=20
> > not waiting for packets.
> >=20
> > 0 calls waiting for a thread
> >=20
> > 125 threads are idle
> >=20
> > 1 calls have waited for a thread
> >=20
> > Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104
> >=20
> >   serial 12,  natMTU 1344, security index 0, client conn
> >=20
> >     call 0: # 4, state dally, mode: receiving, flags: receive_done
> >=20
> >     call 1: # 0, state not initialized
> >=20
> >     call 2: # 0, state not initialized
> >=20
> >     call 3: # 0, state not initialized
> >=20
> > Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114
> >=20
> >   serial 21,  natMTU 1344, security index 0, client conn
> >=20
> >     call 0: # 7, state dally, mode: receiving, flags: receive_done
> >=20
> >     call 1: # 0, state not initialized
> >=20
> >     call 2: # 0, state not initialized
> >=20
> >     call 3: # 0, state not initialized
> >=20
> > Done."
> >=20
> > I do not administer the network. Can I have some advice on how to futhe=
r debug the connection problem? Which udp port does the command "bos status=
" use?
>=20
> My instinct would be that there is some multihoming going on and that htt=
p://docs.openafs.org/Reference/5/NetRestrict.html and/or http://docs.openaf=
s.org/Reference/5/NetInfo.html are not properly configured.
>=20
> -Ben
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info