[OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))

Benjamin Kaduk kaduk@mit.edu
Tue, 8 Jan 2019 20:41:26 -0600


Glad to hear you got things figured out!

-Ben

On Wed, Jan 09, 2019 at 02:26:19AM +0000, Ximeng (Simon) Guan wrote:
> Thanks. Yes, the bos invocation did hang for a minimute or two before repoting that failure. 
> 
> We just figured out the reason for the failure. It is still MTU-related:
> 
> 1. Between offices we use IPsec for VPN and that limits the path MTU to be 1400. 
> 2. To accommodate the reduced MTU we did the following:
>     2.1 Apply -rxmaxmtu 1400 in BosConfig
>     2.2 Adjust the ifcfg-xxx config in the host machine of the failed database server to be 1400. 
> 
> It turns out that it is 2.2 that caused the problem. The database machine is hosted as a KVM VM. When we adjusted the MTU of the ifcfg in the host to 1400 and the power outage caused the server to reboot, the server started to drop incoming 1500 UDP packets. 
> 
> The server and office laptops are connected through a L2 switch that does not handle fragmentation. All remote traffic goes through a L3 router which does, and re-packs them to 1400. That's why all the local clients had problem accessing AFS but the remote servers and clients did not... 
> 
> Thank you!
> 
> Simon
> 
> -----Original Message-----
> From: Benjamin Kaduk <kaduk@mit.edu> 
> Sent: Tuesday, January 8, 2019 6:13 PM
> To: Ximeng (Simon) Guan <xmgu@royole.com>
> Cc: OpenAFS-info@openafs.org
> Subject: Re: [OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))
> 
> On Mon, Jan 07, 2019 at 08:00:27PM +0000, Ximeng (Simon) Guan wrote:
> > We do have NetInfo properly set up to include the only one IP that is used. 
> 
> Good to know, thanks.
> 
> I couldn't rule out MTU issues offhand, but don't have time to dig in further right now.  
> 
> Do the problematic bos invocations hang for a minute or two before reporting the "communications failure"?
> 
> The bosserver listens on port 7007, if you hadn't found that already -- a packet capture would help show what's going on, if you have the ability to get one of those.
> 
> -Ben
> 
> > Can the connection failure somehow come from the non-default MTU settings we are using? That thing constantly bit us in the past in different places. We have  "-rxmaxmtu 1344" used across the board for all ptservers, vlserver, davolserver and dafileserver instances. I was told by the network folks that they could not manage default MTU of 1500 but has to use 1400 because of the IPSec requirement...
> > 
> > Thank you!
> > Simon
> > 
> > -----Original Message-----
> > From: openafs-info-admin@openafs.org <openafs-info-admin@openafs.org> 
> > On Behalf Of Benjamin Kaduk
> > Sent: Monday, January 7, 2019 11:44 AM
> > To: Ximeng (Simon) Guan <xmgu@royole.com>
> > Cc: OpenAFS-info@openafs.org
> > Subject: Re: [OpenAFS] Client connection failure: bos failed to 
> > contact host's bosserver (communication failure (-1))
> > 
> > On Mon, Jan 07, 2019 at 07:40:36PM +0000, Ximeng (Simon) Guan wrote:
> > > Hello,
> > > 
> > > After a power outage on Christmas Eve which forced two database servers and all the network switches in one of our offices to re-boot, our laptop clients in that office can no longer connect to one of the AFS servers hosted in the same office.
> > > 
> > > I am leaning towards the possibility that it is a network problem instead of an OpenAFS service problem because:
> > > 
> > >   1.  Remote offices can access the full AFS space, including those volumes hosted on the re-booted servers.
> > >   2.  Between the servers there is no access problem. Nothing wrong with the result of "bos status", "rxdebug" or "udebug". "fs checkservers" show that all servers are running.
> > >   3.  On the problematic laptops "fs checkservers" show that "All servers are running".
> > >   4.  On the problematic laptops "bos status afssrv1" returns a message:
> > > 
> > > "bos: failed to contact host's bosserver (communications failure (-1))."
> > > 
> > > But on the servers both in that office and in the remote offices, the same command shows that all services are up:
> > > 
> > > "Instance ptserver, currently running normally.
> > > 
> > > Instance vlserver, currently running normally.
> > > 
> > > Instance buserver, currently running normally.
> > > 
> > > Instance upserver, currently running normally.
> > > 
> > > Instance backupusers, currently running normally.
> > > 
> > >     Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.
> > > 
> > > Instance dafs, currently running normally.
> > > 
> > > Auxiliary status is: file server running."
> > > 
> > >   1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns *normal* output, for example:
> > > 
> > > "Trying 10.12.8.33 (port 7000):
> > > 
> > > Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36
> > > 
> > > not waiting for packets.
> > > 
> > > 0 calls waiting for a thread
> > > 
> > > 125 threads are idle
> > > 
> > > 1 calls have waited for a thread
> > > 
> > > Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104
> > > 
> > >   serial 12,  natMTU 1344, security index 0, client conn
> > > 
> > >     call 0: # 4, state dally, mode: receiving, flags: receive_done
> > > 
> > >     call 1: # 0, state not initialized
> > > 
> > >     call 2: # 0, state not initialized
> > > 
> > >     call 3: # 0, state not initialized
> > > 
> > > Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114
> > > 
> > >   serial 21,  natMTU 1344, security index 0, client conn
> > > 
> > >     call 0: # 7, state dally, mode: receiving, flags: receive_done
> > > 
> > >     call 1: # 0, state not initialized
> > > 
> > >     call 2: # 0, state not initialized
> > > 
> > >     call 3: # 0, state not initialized
> > > 
> > > Done."
> > > 
> > > I do not administer the network. Can I have some advice on how to futher debug the connection problem? Which udp port does the command "bos status" use?
> > 
> > My instinct would be that there is some multihoming going on and that http://docs.openafs.org/Reference/5/NetRestrict.html and/or http://docs.openafs.org/Reference/5/NetInfo.html are not properly configured.
> > 
> > -Ben
> > _______________________________________________
> > OpenAFS-info mailing list
> > OpenAFS-info@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-info
> > _______________________________________________
> > OpenAFS-info mailing list
> > OpenAFS-info@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-info