[OpenAFS] Non-functional fileserver

MS Vitale mvitale@sinenomine.net
Thu, 11 Jul 2024 11:57:06 -0400


Dr. Wonczak,

Thank you for your report.  Please see my interleaved replies below:

> On Jul 11, 2024, at 9:50 AM, Stephan Wonczak <a0033@rrz.uni-koeln.de> wro=
te:
>=20
>  Today we had a strange problem with two of our test-AFS-Servers. Apart f=
rom our normal cell we created two additional cells, each one consisting of=
 a single server that servers as both DB-Server and Fileserver. These serve=
rs were created about two years back, and were working fine then. Yesterday=
 we had need to test something new and we revisited the servers.
>  "bos status" came back fine with "all servers running".

'bos status <host> -long' is useful in this situation, and may report that =
a core file is present.

>  However, "vos listvol -server xxx" resulted in "possible communication f=
ailure" Digging a bit, we had numerous log entries in VolSerLog "SYNC_conne=
ct: temporary failure on circuit 'FSSYNC' (will retry)". This pointed to th=
e fact, that the fssync.sock socket file was missing. Indeed, /var/log/mess=
ages showed that the fileserver-process had dumped core during startup. Int=
erestingly, though, a fileserver process -was- running, just not really fun=
ctioning.
>  Several unsuccessful hours of debugging, tracing and googling later, I w=
as ready to give up and trash the test cell and create a new one from scrat=
ch. During the process of purging the files I thought "OK, /usr/afs/etc/Cel=
lServDB for this cell stays the same, so I can keep that." On a hunch, I ac=
tually looked what was inside: Lo and behold! The configured DB-server adre=
ss for the cell had the wrong IP.
>  This is when I remembered that both problematic machines were moved to a=
 different network segment. We had corrected the -client- CellervDB during =
that move, but forgot about the server CellServDB.
>  Now, the whole point of this story:
>  The logs were spectacularily unhelpful in pinpointing this misconfigurat=
ion. Indeed, I would not have expected the fileserver to dump core instead =
of refusing to run at all. At the very least there should be a log entry th=
at no DB-Server could be reached (and CellServDB should be checked).
>  Recreating this behaviour is easy:
>  Take a working single-server cell, and change the IP in /usr/afs/etc/Cel=
lServDB. Restart the fileserver and watch things go south.

I tried this (running master) and was able to reproduce some of your sympto=
ms,=20
as expected - but not all of them.

In this case, when the CSDB has the wrong IP address, the fileserver
will never be fully functional even though it is "running".

When a fileserver is in this state, the fileserver FSSYNC channel is indeed=
 blocked
until the fileserver is able to complete registration with the vlserver.  A=
s you=20
observed, this in turn affects any volserver operation that requires the FS=
SYNC channel.=20

The fileserver will also be unable to obtain required authorization informa=
tion from the ptserver.

However, I did NOT experience a fileserver crash.
And I also see these expected messages in FileLog:
  ...
  Thu Jul 11 11:34:57 2024 VL_RegisterAddrs rpc failed; will retry periodic=
ally (code=3D-1, err=3D0)
  Thu Jul 11 11:36:07 2024 Couldn't get CPS for AnyUser, will try again in =
30 seconds; code=3D-1.
  Thu Jul 11 11:37:12 2024 Couldn't get CPS for AnyUser, will try again in =
30 seconds; code=3D-1.
  ...

Admittedly, these message are not as helpful as they could be; they should =
mention which=20
IP addrs it is trying to reach.

What version of OpenAFS are you running?

Regards,
--
Mark Vitale
Sine Nomine Associates