[OpenAFS] aklog and AFS DB server timeouts

A. Lewenberg deb251@lewenberg.com
Fri, 29 Jan 2021 10:32:38 -0800


On our buster servers the OpenAFS client (1.8.2) has an issue with 
provisioning an AFS token. When I attempt to get an AFS token it very 
often takes a long time.

$ aklog (this can up to 30 seconds or more)

After some investigation it looks like aklog is trying the AFS DB 
servers listed in /etc/openafs/CellSrvDB and timing out on some of the 
DB servers. Here is the relevant contents of that file:

 >example.com           # My Company
192.168.1.102                    #afsdb1.example.com
192.168.1.104                    #afsdb2.example.com
192.168.1.106                    #afsdb3.example.com

Running aklog and sniffing the network I see that the client attempts to 
contact one of the three afsdb servers. If the one it chooses to contact 
first is afsdb2 or afsdb3 the connection does not succeed until it 
finally gives up and tries anther one. If the second one it tries is 
afsdb2 or afsdb3 it gives up and tries the only remaining one: afsdb1. 
In other words:

afsdb3 (fail), afsdb2 (fail), afsdb1 (succeeds)
afsdb2 (fail), afsdb3 (fail), afsdb1 (succeeds)
afsdb3 (fail), afsdb1 (succeeds)
afsdb2 (fail), afsdb1 (succeeds)
afsdb1 (succeeds)

This sounds like both afsdb2 and afsdb3 are simply not working. However...

If I remove afsdb1 and afsdb2 from the CellSrvDB leaving only afsdb3 it 
works instantly every time! That is, the following CellSrvDB works 
without delay:

 >ir.example.com           # My Company
192.168.1.106                    #afsdb3.example.com

Similarly, if afsdb2 is the only entry in CellSrvDB running aklog works 
without delay. So it cannot be that afsdb2 and afsdb3 are completely 
broken.

The AFS DB servers are running OpenAFS version 1.6.9.

What the heck is going on?