[OpenAFS] Client connection failure: bos failed to contact host's bosserver (communication failure (-1))

Benjamin Kaduk kaduk@mit.edu
Mon, 7 Jan 2019 13:44:20 -0600


On Mon, Jan 07, 2019 at 07:40:36PM +0000, Ximeng (Simon) Guan wrote:
> Hello,
> 
> After a power outage on Christmas Eve which forced two database servers and all the network switches in one of our offices to re-boot, our laptop clients in that office can no longer connect to one of the AFS servers hosted in the same office.
> 
> I am leaning towards the possibility that it is a network problem instead of an OpenAFS service problem because:
> 
>   1.  Remote offices can access the full AFS space, including those volumes hosted on the re-booted servers.
>   2.  Between the servers there is no access problem. Nothing wrong with the result of "bos status", "rxdebug" or "udebug". "fs checkservers" show that all servers are running.
>   3.  On the problematic laptops "fs checkservers" show that "All servers are running".
>   4.  On the problematic laptops "bos status afssrv1" returns a message:
> 
> "bos: failed to contact host's bosserver (communications failure (-1))."
> 
> But on the servers both in that office and in the remote offices, the same command shows that all services are up:
> 
> "Instance ptserver, currently running normally.
> 
> Instance vlserver, currently running normally.
> 
> Instance buserver, currently running normally.
> 
> Instance upserver, currently running normally.
> 
> Instance backupusers, currently running normally.
> 
>     Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.
> 
> Instance dafs, currently running normally.
> 
> Auxiliary status is: file server running."
> 
>   1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns *normal* output, for example:
> 
> "Trying 10.12.8.33 (port 7000):
> 
> Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36
> 
> not waiting for packets.
> 
> 0 calls waiting for a thread
> 
> 125 threads are idle
> 
> 1 calls have waited for a thread
> 
> Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104
> 
>   serial 12,  natMTU 1344, security index 0, client conn
> 
>     call 0: # 4, state dally, mode: receiving, flags: receive_done
> 
>     call 1: # 0, state not initialized
> 
>     call 2: # 0, state not initialized
> 
>     call 3: # 0, state not initialized
> 
> Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114
> 
>   serial 21,  natMTU 1344, security index 0, client conn
> 
>     call 0: # 7, state dally, mode: receiving, flags: receive_done
> 
>     call 1: # 0, state not initialized
> 
>     call 2: # 0, state not initialized
> 
>     call 3: # 0, state not initialized
> 
> Done."
> 
> I do not administer the network. Can I have some advice on how to futher debug the connection problem? Which udp port does the command "bos status" use?

My instinct would be that there is some multihoming going on and that
http://docs.openafs.org/Reference/5/NetRestrict.html and/or
http://docs.openafs.org/Reference/5/NetInfo.html are not properly
configured.

-Ben