[OpenAFS] Non-functional fileserver
Stephan Wonczak
a0033@rrz.uni-koeln.de
Thu, 11 Jul 2024 15:50:06 +0200 (CEST)
Dear all,
Sorry for the longish post, but I wanted to provide a bit of
background.
Today we had a strange problem with two of our test-AFS-Servers. Apart
from our normal cell we created two additional cells, each one consisting
of a single server that servers as both DB-Server and Fileserver. These
servers were created about two years back, and were working fine then.
Yesterday we had need to test something new and we revisited the servers.
"bos status" came back fine with "all servers running".
However, "vos listvol -server xxx" resulted in "possible communication
failure" Digging a bit, we had numerous log entries in VolSerLog
"SYNC_connect: temporary failure on circuit 'FSSYNC' (will retry)". This
pointed to the fact, that the fssync.sock socket file was missing. Indeed,
/var/log/messages showed that the fileserver-process had dumped core
during startup. Interestingly, though, a fileserver process -was- running,
just not really functioning.
Several unsuccessful hours of debugging, tracing and googling later, I
was ready to give up and trash the test cell and create a new one from
scratch. During the process of purging the files I thought "OK,
/usr/afs/etc/CellServDB for this cell stays the same, so I can keep that."
On a hunch, I actually looked what was inside: Lo and behold! The
configured DB-server adress for the cell had the wrong IP.
This is when I remembered that both problematic machines were moved to a
different network segment. We had corrected the -client- CellervDB during
that move, but forgot about the server CellServDB.
Now, the whole point of this story:
The logs were spectacularily unhelpful in pinpointing this
misconfiguration. Indeed, I would not have expected the fileserver to dump
core instead of refusing to run at all. At the very least there should be
a log entry that no DB-Server could be reached (and CellServDB should be
checked).
Recreating this behaviour is easy:
Take a working single-server cell, and change the IP in
/usr/afs/etc/CellServDB. Restart the fileserver and watch things go south.
Thanks for reading my long ramble :-)
Dipl. Chem. Dr. Stephan Wonczak
Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
Universitaet zu Koeln, Weyertal 121, 50931 Koeln
Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625