[OpenAFS] fileserver mistakenly considered unavailable by a single client
tegularius@mail2tor.com
tegularius@mail2tor.com
Mon, 10 Feb 2025 14:31:47 -0500
Dear OpenAFS team
I have been running into a problem. I have three OpenAFS fileservers in
my cell, which happen to also be the VLDB servers.
Occasionally, due to circumstances unrelated to OpenAFS, one of the
fileservers becomes unreachable on the network for a brief period of time,
say 30 minutes. During this time, clients cannot access files hosted at
this fileserver, as I would expect. On one of the other two fileservers,
I see the message 'afs: Lost contact with file server...' in the logs of
its client, as I would expect. On the third fileserver, I see no such
entry, which I assume simply means that it did not have a client that was
trying to access any of the files on the server that is temporarily
unavailable due to network reasons (note: not because of server downtime,
in case this is important).
But when the network outage ends, recovery is only partial. The OpenAFS
client on the fileserver that did not notice the outage continues to work
just fine. But the OpenAFS client on the fileserver with the 'Lost
contact' message never prints an 'is back up' message, and when I run 'fs
checks' on the fileserver that noticed the outage, the following is
printed: 'These servers unavailable due to network or server problems:'
followed by the name of the server with the outage.
Every time this happens, I try restarting the fileserver that was
unreachable, and I even try restarting all of the fileservers in the whole
cell. I'm running these servers on Debian machines with systemd, so I try
shutting them down and bringing them back up with systemd, and I also try
shutting them down and bringing them back up with 'bos', e.g. 'bos
shutdown'. I try shutting them down one by one, and all at once, and all
at once with long lags of two minutes before restarting them all. I try
'fs flush -all' on the client and the server, but to no avail. I try
removing the IP address of the unreachable fileserver with vos remaddrs
and then putting it back with vos setaddrs, but to no avail. Nothing
seems to convince the client that noticed the outage from believing in the
fileserver that was briefly unavailable.
The only thing that ever works is rebooting the machine with the affected
client. I hate rebooting, and as far as I am aware, it is not possible to
shut down an OpenAFS client otherwise.
I have read suggestions that this could be an issue on the fileserver
side, but 'vos status' shows no transactions.
Is there a way to force the client and the fileserver to rediscover each
other?
thanks again --
FT