[OpenAFS] Non-functional fileserver
Stephan Wonczak
a0033@rrz.uni-koeln.de
Thu, 18 Jul 2024 12:56:34 +0200 (CEST)
Hi Mark,
Comments inline.
On Thu, 11 Jul 2024, MS Vitale wrote:
> Dr. Wonczak,
>
> Thank you for your report. Please see my interleaved replies below:
>
>> On Jul 11, 2024, at 9:50 AM, Stephan Wonczak <a0033@rrz.uni-koeln.de> wrote:
>>
>> Today we had a strange problem with two of our test-AFS-Servers. Apart
>> from our normal cell we created two additional cells, each one
>> consisting of a single server that servers as both DB-Server and
>> Fileserver. These servers were created about two years back, and were
>> working fine then. Yesterday we had need to test something new and we
>> revisited the servers.
>> "bos status" came back fine with "all servers running".
>
> 'bos status <host> -long' is useful in this situation, and may report
> that a core file is present.
Yes. probably. I indeed neglected to use the "long" option. However, the
info that a core file is present is not really helpful in itself.
>
>> However, "vos listvol -server xxx" resulted in "possible communication
>> failure" Digging a bit, we had numerous log entries in VolSerLog
>> "SYNC_connect: temporary failure on circuit 'FSSYNC' (will retry)".
>> This pointed to the fact, that the fssync.sock socket file was
>> missing. Indeed, /var/log/messages showed that the fileserver-process
>> had dumped core during startup. Interestingly, though, a fileserver
>> process -was- running, just not really functioning.
>> Several unsuccessful hours of debugging, tracing and googling later, I
>> was ready to give up and trash the test cell and create a new one from
>> scratch. During the process of purging the files I thought "OK,
>> /usr/afs/etc/CellServDB for this cell stays the same, so I can keep
>> that." On a hunch, I actually looked what was inside: Lo and behold!
>> The configured DB-server adress for the cell had the wrong IP.
>> This is when I remembered that both problematic machines were moved to a different network segment. We had corrected the -client- CellervDB during that move, but forgot about the server CellServDB.
>> Now, the whole point of this story:
>> The logs were spectacularily unhelpful in pinpointing this misconfiguration. Indeed, I would not have expected the fileserver to dump core instead of refusing to run at all. At the very least there should be a log entry that no DB-Server could be reached (and CellServDB should be checked).
>> Recreating this behaviour is easy:
>> Take a working single-server cell, and change the IP in
>> /usr/afs/etc/CellServDB. Restart the fileserver and watch things go
> south.
>
> I tried this (running master) and was able to reproduce some of your
> symptoms,as expected - but not all of them.
>
> In this case, when the CSDB has the wrong IP address, the fileserver
> will never be fully functional even though it is "running".
Yes, of course. Failure in this case is expected and correct.
> When a fileserver is in this state, the fileserver FSSYNC channel is
> indeed blocked until the fileserver is able to complete registration
> with the vlserver. As you observed, this in turn affects any volserver
> operation that requires the FSSYNC channel.
Also expected :-)
> The fileserver will also be unable to obtain required authorization
> information from the ptserver.
>
> However, I did NOT experience a fileserver crash.
I tried several times, and each time I had a crash/coredump during
startup. This was even in the logs (BosLog):
Thu Jul 11 14:57:29 2024: fs started pid 65412: /usr/afs/bin/salvager
Thu Jul 11 14:57:29 2024: Listening on 0.0.0.0:7007
Thu Jul 11 14:57:29 2024: fs:salv exited with code 0
Thu Jul 11 14:57:29 2024: fs started pid 65423: /usr/afs/bin/fileserver
Thu Jul 11 14:57:29 2024: fs started pid 65424: /usr/afs/bin/volserver
Thu Jul 11 14:58:05 2024: fs:vol exited on signal 15
Thu Jul 11 14:58:05 2024: fs:file exited on signal 3 (core dumped)
> And I also see these expected messages in FileLog:
> ...
> Thu Jul 11 11:34:57 2024 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
> Thu Jul 11 11:36:07 2024 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1.
> Thu Jul 11 11:37:12 2024 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1.
> ...
>
> Admittedly, these message are not as helpful as they could be; they
> should mention which IP addrs it is trying to reach.
Some hint to "check CellServDB" would be -really- useful here, too.
> What version of OpenAFS are you running?
openafs-1.8.11
I just noticed: There still seems to be something not working correctly.
Although everything is working correcty (at least -I- did not find
anything amiss), I still get these messages in FileLog every five minutes:
Thu Jul 18 12:36:59 2024 VL_RegisterAddrs rpc failed; will retry
periodically (code=5376, err=0)
Thu Jul 18 12:41:59 2024 VL_RegisterAddrs rpc failed; will retry
periodically (code=5376, err=0)
Thu Jul 18 12:46:59 2024 VL_RegisterAddrs rpc failed; will retry
periodically (code=5376, err=0)
Any ideas as to that?
Dipl. Chem. Dr. Stephan Wonczak
Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
Universitaet zu Koeln, Weyertal 121, 50931 Koeln
Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625