[OpenAFS-devel] Sysid probmens when upgrading FreeBSD server to 1.3.80

Tom Keiser Tom Keiser <tkeiser@gmail.com>
Sat, 26 Mar 2005 02:11:23 -0500


On Sat, 26 Mar 2005 01:35:41 +0100 (MET), Harald Barth <haba@pdc.kth.se> wrote:
> 
> Our (Stacken) server was not of newest vintage and looped over and
> over sending the same probe rx packet to one client in the world which
> probably didn't deserve it.
> 
> $ /usr/arla/bin/rxdebug kvikklunsj -version
> Trying 130.237.234.46 (port 7000):
> AFS version:  OpenAFS devel built  2003-04-10
> 
> So I decided to upgrade. Download 1.3.80,
> 
> # ./configure  --prefix=/usr/afs --enable-full-vos-listvol-switch --enable-largefile-fileserver --enable-debug --enable-bitmap-later --enable-fast-restart --disable-kernel-module --with-afs-sysname=i386_fbsd_47
> 
> and make install and here we go. I thought. In spite not changing
> anything, the server was very unhappy with the UUID, but the error
> message is totally worthless:
> 
> >> The ethernet address exist on a different server; repair it
> 
> Nope, "ethernet" is wrong. Maybe "IP" or "network".
> And could we please see the offending evidence?

The problem is for a large cell we don't want to send that much
information to a fileserver.
The full debug trace needs to be computed on the vlserver.

> 
> >>  VL_RegisterAddrs rpc failed; See VLLog for details
> 
> Nope, nothing in any VLLog on any vlserver.
> 

Going all the way back to the transarc code the vlserver has dumped a
big debug trace to stdout when failing on VL_MULTIPADDR.  The line in
FSLog referring to VLLog is correct, and I'm pretty sure I've found
helpful information there in the past.  Perhaps you're not seeing
anything because stdout is a buffered descriptor, and
SVL_RegisterAddrs is the only rpc that uses printf?  I suppose it's
also possible older versions of vlserver closed stdout on startup, all
I know is recent versions of vlserver do have VLLog open on fd=1. 
Usage of printf in the vlserver code is pretty specific to
SVL_RegisterAddrs -- everything else uses the VLog macro.  Perhaps
these calls should be switched over to VLog for consistency's sake?

> The not so funny thing was the salvage-loop:
> 
>  +->   fs crashes with message above
>  |     need salvage
>  |     run salvage for long time
>  +--   bos tries to restart fs
> 
> The first step in the right direction is the patch below
> which hopefully prints out some useful message. Another
> good thing would be to detect the looping. The final
> thing would be to find out why suddenly the sysid is
> was not accepted any more. Unfortunately, I discovered
> the check_sysid program today, not yesterday when I was
> debugging the looping server. With some more hacking,
> the check_sysid program could actually check with the
> vlserver, too.
> 
> $ /usr/arla/bin/rxdebug kvikklunsj -version
> Trying 130.237.234.46 (port 7000):
> AFS version:  OpenAFS 1.3.80 built  2005-03-24
> 
> As you can see, I managed to start the fileserver anyway.
> So what did I do? First I unmounted /vicep* to get
> shorter looptime to be able to debug. The I made
> a NetInfo file containing the one and only IP addr
> that is on the servers interface. And lo and behold,
> it got happy with the UUID-IP combination and started.
> Then a quick shutdown, mount /vicep*, restart got
> me going again. So what is the difference between
> with and without NetInfo file?
> 

Does vos lista show any 127.0.0.0/8 addresses?  We might be looking at
a case where rx_getAllAddr isn't working properly on fbsd.  All the
vlserver needs to find are two srvidx's with ip's that match ones from
your bulkaddrs vector, and the game is over due to ambiguity.

> With NetInfo file, FS_Host_Addrs is filled in by:
> 
>  code = parseNetFiles(FS_HostAddrs, NULL, NULL,
>                                        ADDRSPERSITE, reason,
>                                        AFSDIR_SERVER_NETINFO_FILEPATH,
>                                        AFSDIR_SERVER_NETRESTRICT_FILEPATH);
> 
> Without it is filled in by:
> 
>  FS_HostAddr_cnt = rx_getAllAddr(FS_HostAddrs, ADDRSPERSITE);
> 
> The strange thing is that I have been running rx_getAllAddr()
> standalone and it seems to do the "right thing": buffer[0] is
> 0x2eeaed82 and FS_HostAddr_cnt = 1. As much as I'd like to go
> deeper with this, I will not run a lot of experients on Stacken's
> main AFS servers. I might get another chance when I upgrade
> the other one.
> 
> So for the record: Created a NetInfo file and got lucky. Don't
> know exactly what made the difference. This is on FreeBSD 4.8.
> 
> Harald.
> 
> --- src/viced/viced.c.~1.59.~   2004-09-08 23:35:54.000000000 +0200
> +++ src/viced/viced.c   2005-03-26 00:26:01.000000000 +0100
> @@ -1462,10 +1462,22 @@
>      code = ubik_Call(VL_RegisterAddrs, cstruct, 0, &FS_HostUUID, 0, &addrs);
>      if (code) {
>         if (code == VL_MULTIPADDR) {
> +           char uuid[1024];
> +           int n;
> +
> +           afsUUID_to_string(FS_HostUUID, uuid, sizeof(uuid));
>             ViceLog(0,
> -                   ("VL_RegisterAddrs rpc failed; The ethernet address exist on a different server; repair it\n"));
> +                   ("VL_RegisterAddrs rpc failed: The IP address(es) conflicted with the registered UUID\n"));

This is definitely an improvement over the old error, but it only
describes one of the failure modes.  The two failure modes I'm aware
of are:
(1) the UUID is registered, but at least one address in FS_HostAddrs
is registered to another server
(2) the UUID is not registered, and the addrs in FS_HostAddrs are
registered to at least two servers


>             ViceLog(0,
> -                   ("VL_RegisterAddrs rpc failed; See VLLog for details\n"));

since there is extensive logging in SVL_RegisterAddrs, i'd prefer to
see this line remain.

> +                   ("UUID: %s\n",uuid));
> +           for (n = 0; n < FS_HostAddr_cnt; n++) {
> +               Vicelog(0,
> +                       ("IP %d: %d.%d.%d.%d\n", n+1,
> +                        (addr) & 0xff,
> +                        (addr >> 8) & 0xff,
> +                        (addr >> 16) & 0xff,
> +                        (addr >> 24) & 0xff));

I think the previous four lines should be replaced by the following:

(FS_HostAddrs_HBO[n] >>24) & 0xff,
(FS_HostAddrs_HBO[n] >>16) & 0xff,
(FS_HostAddrs_HBO[n] >>8) & 0xff,
(FS_HostAddrs_HBO[n]) & 0xff));


Regards,

--
Tom Keiser
tkeiser@gmail.com