[OpenAFS] aklog and AFS DB server timeouts
A. Lewenberg
deb251@lewenberg.com
Fri, 29 Jan 2021 10:32:38 -0800
On our buster servers the OpenAFS client (1.8.2) has an issue with
provisioning an AFS token. When I attempt to get an AFS token it very
often takes a long time.
$ aklog (this can up to 30 seconds or more)
After some investigation it looks like aklog is trying the AFS DB
servers listed in /etc/openafs/CellSrvDB and timing out on some of the
DB servers. Here is the relevant contents of that file:
>example.com # My Company
192.168.1.102 #afsdb1.example.com
192.168.1.104 #afsdb2.example.com
192.168.1.106 #afsdb3.example.com
Running aklog and sniffing the network I see that the client attempts to
contact one of the three afsdb servers. If the one it chooses to contact
first is afsdb2 or afsdb3 the connection does not succeed until it
finally gives up and tries anther one. If the second one it tries is
afsdb2 or afsdb3 it gives up and tries the only remaining one: afsdb1.
In other words:
afsdb3 (fail), afsdb2 (fail), afsdb1 (succeeds)
afsdb2 (fail), afsdb3 (fail), afsdb1 (succeeds)
afsdb3 (fail), afsdb1 (succeeds)
afsdb2 (fail), afsdb1 (succeeds)
afsdb1 (succeeds)
This sounds like both afsdb2 and afsdb3 are simply not working. However...
If I remove afsdb1 and afsdb2 from the CellSrvDB leaving only afsdb3 it
works instantly every time! That is, the following CellSrvDB works
without delay:
>ir.example.com # My Company
192.168.1.106 #afsdb3.example.com
Similarly, if afsdb2 is the only entry in CellSrvDB running aklog works
without delay. So it cannot be that afsdb2 and afsdb3 are completely
broken.
The AFS DB servers are running OpenAFS version 1.6.9.
What the heck is going on?