[OpenAFS] Non-functional fileserver
MS Vitale
mvitale@sinenomine.net
Thu, 11 Jul 2024 11:57:06 -0400
Dr. Wonczak,
Thank you for your report. Please see my interleaved replies below:
> On Jul 11, 2024, at 9:50 AM, Stephan Wonczak <a0033@rrz.uni-koeln.de> wro=
te:
>=20
> Today we had a strange problem with two of our test-AFS-Servers. Apart f=
rom our normal cell we created two additional cells, each one consisting of=
a single server that servers as both DB-Server and Fileserver. These serve=
rs were created about two years back, and were working fine then. Yesterday=
we had need to test something new and we revisited the servers.
> "bos status" came back fine with "all servers running".
'bos status <host> -long' is useful in this situation, and may report that =
a core file is present.
> However, "vos listvol -server xxx" resulted in "possible communication f=
ailure" Digging a bit, we had numerous log entries in VolSerLog "SYNC_conne=
ct: temporary failure on circuit 'FSSYNC' (will retry)". This pointed to th=
e fact, that the fssync.sock socket file was missing. Indeed, /var/log/mess=
ages showed that the fileserver-process had dumped core during startup. Int=
erestingly, though, a fileserver process -was- running, just not really fun=
ctioning.
> Several unsuccessful hours of debugging, tracing and googling later, I w=
as ready to give up and trash the test cell and create a new one from scrat=
ch. During the process of purging the files I thought "OK, /usr/afs/etc/Cel=
lServDB for this cell stays the same, so I can keep that." On a hunch, I ac=
tually looked what was inside: Lo and behold! The configured DB-server adre=
ss for the cell had the wrong IP.
> This is when I remembered that both problematic machines were moved to a=
different network segment. We had corrected the -client- CellervDB during =
that move, but forgot about the server CellServDB.
> Now, the whole point of this story:
> The logs were spectacularily unhelpful in pinpointing this misconfigurat=
ion. Indeed, I would not have expected the fileserver to dump core instead =
of refusing to run at all. At the very least there should be a log entry th=
at no DB-Server could be reached (and CellServDB should be checked).
> Recreating this behaviour is easy:
> Take a working single-server cell, and change the IP in /usr/afs/etc/Cel=
lServDB. Restart the fileserver and watch things go south.
I tried this (running master) and was able to reproduce some of your sympto=
ms,=20
as expected - but not all of them.
In this case, when the CSDB has the wrong IP address, the fileserver
will never be fully functional even though it is "running".
When a fileserver is in this state, the fileserver FSSYNC channel is indeed=
blocked
until the fileserver is able to complete registration with the vlserver. A=
s you=20
observed, this in turn affects any volserver operation that requires the FS=
SYNC channel.=20
The fileserver will also be unable to obtain required authorization informa=
tion from the ptserver.
However, I did NOT experience a fileserver crash.
And I also see these expected messages in FileLog:
...
Thu Jul 11 11:34:57 2024 VL_RegisterAddrs rpc failed; will retry periodic=
ally (code=3D-1, err=3D0)
Thu Jul 11 11:36:07 2024 Couldn't get CPS for AnyUser, will try again in =
30 seconds; code=3D-1.
Thu Jul 11 11:37:12 2024 Couldn't get CPS for AnyUser, will try again in =
30 seconds; code=3D-1.
...
Admittedly, these message are not as helpful as they could be; they should =
mention which=20
IP addrs it is trying to reach.
What version of OpenAFS are you running?
Regards,
--
Mark Vitale
Sine Nomine Associates