[OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))

Benjamin Kaduk kaduk@mit.edu
Tue, 8 Jan 2019 20:13:09 -0600


On Mon, Jan 07, 2019 at 08:00:27PM +0000, Ximeng (Simon) Guan wrote:
> We do have NetInfo properly set up to include the only one IP that is used. 

Good to know, thanks.

I couldn't rule out MTU issues offhand, but don't have time to dig in
further right now.  

Do the problematic bos invocations hang for a minute or two before
reporting the "communications failure"?

The bosserver listens on port 7007, if you hadn't found that already -- a
packet capture would help show what's going on, if you have the ability to
get one of those.

-Ben

> Can the connection failure somehow come from the non-default MTU settings we are using? That thing constantly bit us in the past in different places. We have  "-rxmaxmtu 1344" used across the board for all ptservers, vlserver, davolserver and dafileserver instances. I was told by the network folks that they could not manage default MTU of 1500 but has to use 1400 because of the IPSec requirement...
> 
> Thank you!
> Simon
> 
> -----Original Message-----
> From: openafs-info-admin@openafs.org <openafs-info-admin@openafs.org> On Behalf Of Benjamin Kaduk
> Sent: Monday, January 7, 2019 11:44 AM
> To: Ximeng (Simon) Guan <xmgu@royole.com>
> Cc: OpenAFS-info@openafs.org
> Subject: Re: [OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))
> 
> On Mon, Jan 07, 2019 at 07:40:36PM +0000, Ximeng (Simon) Guan wrote:
> > Hello,
> > 
> > After a power outage on Christmas Eve which forced two database servers and all the network switches in one of our offices to re-boot, our laptop clients in that office can no longer connect to one of the AFS servers hosted in the same office.
> > 
> > I am leaning towards the possibility that it is a network problem instead of an OpenAFS service problem because:
> > 
> >   1.  Remote offices can access the full AFS space, including those volumes hosted on the re-booted servers.
> >   2.  Between the servers there is no access problem. Nothing wrong with the result of "bos status", "rxdebug" or "udebug". "fs checkservers" show that all servers are running.
> >   3.  On the problematic laptops "fs checkservers" show that "All servers are running".
> >   4.  On the problematic laptops "bos status afssrv1" returns a message:
> > 
> > "bos: failed to contact host's bosserver (communications failure (-1))."
> > 
> > But on the servers both in that office and in the remote offices, the same command shows that all services are up:
> > 
> > "Instance ptserver, currently running normally.
> > 
> > Instance vlserver, currently running normally.
> > 
> > Instance buserver, currently running normally.
> > 
> > Instance upserver, currently running normally.
> > 
> > Instance backupusers, currently running normally.
> > 
> >     Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.
> > 
> > Instance dafs, currently running normally.
> > 
> > Auxiliary status is: file server running."
> > 
> >   1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns *normal* output, for example:
> > 
> > "Trying 10.12.8.33 (port 7000):
> > 
> > Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36
> > 
> > not waiting for packets.
> > 
> > 0 calls waiting for a thread
> > 
> > 125 threads are idle
> > 
> > 1 calls have waited for a thread
> > 
> > Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104
> > 
> >   serial 12,  natMTU 1344, security index 0, client conn
> > 
> >     call 0: # 4, state dally, mode: receiving, flags: receive_done
> > 
> >     call 1: # 0, state not initialized
> > 
> >     call 2: # 0, state not initialized
> > 
> >     call 3: # 0, state not initialized
> > 
> > Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114
> > 
> >   serial 21,  natMTU 1344, security index 0, client conn
> > 
> >     call 0: # 7, state dally, mode: receiving, flags: receive_done
> > 
> >     call 1: # 0, state not initialized
> > 
> >     call 2: # 0, state not initialized
> > 
> >     call 3: # 0, state not initialized
> > 
> > Done."
> > 
> > I do not administer the network. Can I have some advice on how to futher debug the connection problem? Which udp port does the command "bos status" use?
> 
> My instinct would be that there is some multihoming going on and that http://docs.openafs.org/Reference/5/NetRestrict.html and/or http://docs.openafs.org/Reference/5/NetInfo.html are not properly configured.
> 
> -Ben
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info