[OpenAFS] Help: OpenAFS suddenly completely stopped working

Valtteri Vuorikoski vuori@notcom.org
Thu, 14 Jan 2021 14:45:17 +0200

I have a small OpenAFS 1.8.6 setup using the Debian and Ubuntu packages.
Last night everything was working fine, this morning machines were
timing out trying to talk to volume servers. Database replication was
also stuck.

While there is a single backup database and file server, databases and
volumes are primarily on a single server. I logged in to that server
("afs1"), made it the only machine in the cell by editing client and
server CellServDB and set out trying to restore things.

afs1 is running Debian bullseye. Kernel 5.8 (running at the time when
things broke) and 5.10 result in an equally non-functional system. There
are no iptables rules on the system.

OpenAFS is almost 100% dead for no apparent reason:

- "pts listentries" and "vos listvldb localhost" work. udebug shows both
  servers in recovery state 1f, site is sync site and there are no
  replicas (as expected at this point).

- After restarting services, vos status -localauth -server localhost
  prints the following:

Could not access status information about the server
Possible communication failure
Error in vos status command.
Possible communication failure

- After a while, vos status no longer prints anything, just hangs. All
  AFS client access times out. 

- There is mostly nothing in the logs. Starting
  vlserver/ptserver/dafileserver with -d 125 doesn't lead to any extra
  output. Nothing out of the ordinary (except AFS client errors) appears
  in dmesg or journalctl -b. After starting dafileserver -L, the following log appears:

Thu Jan 14 11:59:54 2021 File server starting (/usr/lib/openafs/dafileserver -L)
Thu Jan 14 11:59:54 2021 VL_RegisterAddrs rpc failed; will retry periodically (code=5376, err=0)
Thu Jan 14 12:01:04 2021 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1.
Thu Jan 14 12:02:09 2021 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1.
 [the last message keeps repeating]

- dasalvager appears to run successfully. I'm currently running a
  voldump to recover data and it's running fine so far. There is plenty
  of disk space.

- Kerberos appears to be working. kinit works, aklog works, pts/vos commands without
  -localauth work when a superuser token is present. KDC (Samba) doesn't
  show any problems related to the afs principal. Clocks are accurate.

- Rebooting the whole system (a qemu VM) makes no difference.

After four hours of debugging, I'm at the end of my wits. Even
temporarily removing all databases, restarting ptserver and vlserver and
touching NoAuth won't make fileserver/volserver happy. It seems like RX
communication is failing somehow, but I have no idea why.

Any ideas what's going on here?