[OpenAFS] Non-functional fileserver

Stephan Wonczak a0033@rrz.uni-koeln.de
Thu, 18 Jul 2024 12:56:34 +0200 (CEST)


   Hi Mark,
   Comments inline.

On Thu, 11 Jul 2024, MS Vitale wrote:

> Dr. Wonczak,
>
> Thank you for your report.  Please see my interleaved replies below:
>
>> On Jul 11, 2024, at 9:50 AM, Stephan Wonczak <a0033@rrz.uni-koeln.de> wrote:
>>
>>  Today we had a strange problem with two of our test-AFS-Servers. Apart 
>> from our normal cell we created two additional cells, each one 
>> consisting of a single server that servers as both DB-Server and 
>> Fileserver. These servers were created about two years back, and were 
>> working fine then. Yesterday we had need to test something new and we 
>> revisited the servers.
>>  "bos status" came back fine with "all servers running".
>
> 'bos status <host> -long' is useful in this situation, and may report 
> that a core file is present.

   Yes. probably. I indeed neglected to use the "long" option. However, the 
info that a core file is present is not really helpful in itself.

>
>>  However, "vos listvol -server xxx" resulted in "possible communication 
>> failure" Digging a bit, we had numerous log entries in VolSerLog 
>> "SYNC_connect: temporary failure on circuit 'FSSYNC' (will retry)". 
>> This pointed to the fact, that the fssync.sock socket file was 
>> missing. Indeed, /var/log/messages showed that the fileserver-process 
>> had dumped core during startup. Interestingly, though, a fileserver 
>> process -was- running, just not really functioning.
>>  Several unsuccessful hours of debugging, tracing and googling later, I 
>> was ready to give up and trash the test cell and create a new one from 
>> scratch. During the process of purging the files I thought "OK, 
>> /usr/afs/etc/CellServDB for this cell stays the same, so I can keep 
>> that." On a hunch, I actually looked what was inside: Lo and behold! 
>> The configured DB-server adress for the cell had the wrong IP.
>>  This is when I remembered that both problematic machines were moved to a different network segment. We had corrected the -client- CellervDB during that move, but forgot about the server CellServDB.
>>  Now, the whole point of this story:
>>  The logs were spectacularily unhelpful in pinpointing this misconfiguration. Indeed, I would not have expected the fileserver to dump core instead of refusing to run at all. At the very least there should be a log entry that no DB-Server could be reached (and CellServDB should be checked).
>>  Recreating this behaviour is easy:
>>  Take a working single-server cell, and change the IP in
>>  /usr/afs/etc/CellServDB. Restart the fileserver and watch things go
>   south.
>
>  I tried this (running master) and was able to reproduce some of your 
> symptoms,as expected - but not all of them.
>
> In this case, when the CSDB has the wrong IP address, the fileserver
> will never be fully functional even though it is "running".

   Yes, of course. Failure in this case is expected and correct.

> When a fileserver is in this state, the fileserver FSSYNC channel is 
> indeed blocked until the fileserver is able to complete registration 
> with the vlserver.  As you observed, this in turn affects any volserver 
> operation that requires the FSSYNC channel.

   Also expected :-)

> The fileserver will also be unable to obtain required authorization 
> information from the ptserver.
>
> However, I did NOT experience a fileserver crash.

   I tried several times, and each time I had a crash/coredump during 
startup. This was even in the logs (BosLog):

Thu Jul 11 14:57:29 2024: fs started pid 65412: /usr/afs/bin/salvager
Thu Jul 11 14:57:29 2024: Listening on 0.0.0.0:7007
Thu Jul 11 14:57:29 2024: fs:salv exited with code 0
Thu Jul 11 14:57:29 2024: fs started pid 65423: /usr/afs/bin/fileserver
Thu Jul 11 14:57:29 2024: fs started pid 65424: /usr/afs/bin/volserver
Thu Jul 11 14:58:05 2024: fs:vol exited on signal 15
Thu Jul 11 14:58:05 2024: fs:file exited on signal 3 (core dumped)

> And I also see these expected messages in FileLog:
>  ...
>  Thu Jul 11 11:34:57 2024 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
>  Thu Jul 11 11:36:07 2024 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1.
>  Thu Jul 11 11:37:12 2024 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1.
>  ...
>
> Admittedly, these message are not as helpful as they could be; they 
> should mention which IP addrs it is trying to reach.

   Some hint to "check CellServDB" would be -really- useful here, too.

> What version of OpenAFS are you running?

openafs-1.8.11

   I just noticed: There still seems to be something not working correctly. 
Although everything is working correcty (at least -I- did not find 
anything amiss), I still get these messages in FileLog every five minutes:

Thu Jul 18 12:36:59 2024 VL_RegisterAddrs rpc failed; will retry 
periodically (code=5376, err=0)
Thu Jul 18 12:41:59 2024 VL_RegisterAddrs rpc failed; will retry 
periodically (code=5376, err=0)
Thu Jul 18 12:46:59 2024 VL_RegisterAddrs rpc failed; will retry 
periodically (code=5376, err=0)

   Any ideas as to that?

 	Dipl. Chem. Dr. Stephan Wonczak

         Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
         Universitaet zu Koeln, Weyertal 121, 50931 Koeln
         Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625