[OpenAFS-devel] Sysid probmens when upgrading FreeBSD server to 1.3.80

Harald Barth haba@pdc.kth.se
Sat, 26 Mar 2005 01:35:41 +0100 (MET)


Our (Stacken) server was not of newest vintage and looped over and
over sending the same probe rx packet to one client in the world which
probably didn't deserve it.

$ /usr/arla/bin/rxdebug kvikklunsj -version
Trying 130.237.234.46 (port 7000):
AFS version:  OpenAFS devel built  2003-04-10 

So I decided to upgrade. Download 1.3.80, 

# ./configure  --prefix=/usr/afs --enable-full-vos-listvol-switch --enable-largefile-fileserver --enable-debug --enable-bitmap-later --enable-fast-restart --disable-kernel-module --with-afs-sysname=i386_fbsd_47

and make install and here we go. I thought. In spite not changing
anything, the server was very unhappy with the UUID, but the error
message is totally worthless: 

>> The ethernet address exist on a different server; repair it

Nope, "ethernet" is wrong. Maybe "IP" or "network".
And could we please see the offending evidence?

>>  VL_RegisterAddrs rpc failed; See VLLog for details

Nope, nothing in any VLLog on any vlserver.

The not so funny thing was the salvage-loop:

 +->   fs crashes with message above
 |     need salvage
 |     run salvage for long time
 +--   bos tries to restart fs

The first step in the right direction is the patch below
which hopefully prints out some useful message. Another
good thing would be to detect the looping. The final
thing would be to find out why suddenly the sysid is
was not accepted any more. Unfortunately, I discovered
the check_sysid program today, not yesterday when I was
debugging the looping server. With some more hacking,
the check_sysid program could actually check with the
vlserver, too.

$ /usr/arla/bin/rxdebug kvikklunsj -version
Trying 130.237.234.46 (port 7000):
AFS version:  OpenAFS 1.3.80 built  2005-03-24 

As you can see, I managed to start the fileserver anyway.
So what did I do? First I unmounted /vicep* to get
shorter looptime to be able to debug. The I made
a NetInfo file containing the one and only IP addr
that is on the servers interface. And lo and behold,
it got happy with the UUID-IP combination and started.
Then a quick shutdown, mount /vicep*, restart got
me going again. So what is the difference between
with and without NetInfo file?

With NetInfo file, FS_Host_Addrs is filled in by:

 code = parseNetFiles(FS_HostAddrs, NULL, NULL,
				       ADDRSPERSITE, reason,
				       AFSDIR_SERVER_NETINFO_FILEPATH,
				       AFSDIR_SERVER_NETRESTRICT_FILEPATH);

Without it is filled in by:

 FS_HostAddr_cnt = rx_getAllAddr(FS_HostAddrs, ADDRSPERSITE);

The strange thing is that I have been running rx_getAllAddr()
standalone and it seems to do the "right thing": buffer[0] is
0x2eeaed82 and FS_HostAddr_cnt = 1. As much as I'd like to go
deeper with this, I will not run a lot of experients on Stacken's
main AFS servers. I might get another chance when I upgrade
the other one.

So for the record: Created a NetInfo file and got lucky. Don't
know exactly what made the difference. This is on FreeBSD 4.8.

Harald.

--- src/viced/viced.c.~1.59.~   2004-09-08 23:35:54.000000000 +0200
+++ src/viced/viced.c   2005-03-26 00:26:01.000000000 +0100
@@ -1462,10 +1462,22 @@
     code = ubik_Call(VL_RegisterAddrs, cstruct, 0, &FS_HostUUID, 0, &addrs);
     if (code) {
        if (code == VL_MULTIPADDR) {
+           char uuid[1024];
+           int n;
+
+           afsUUID_to_string(FS_HostUUID, uuid, sizeof(uuid));
            ViceLog(0,
-                   ("VL_RegisterAddrs rpc failed; The ethernet address exist on a different server; repair it\n"));
+                   ("VL_RegisterAddrs rpc failed: The IP address(es) conflicted with the registered UUID\n"));
            ViceLog(0,
-                   ("VL_RegisterAddrs rpc failed; See VLLog for details\n"));
+                   ("UUID: %s\n",uuid));
+           for (n = 0; n < FS_HostAddr_cnt; n++) {
+               Vicelog(0,
+                       ("IP %d: %d.%d.%d.%d\n", n+1, 
+                        (addr) & 0xff,
+                        (addr >> 8) & 0xff,
+                        (addr >> 16) & 0xff,
+                        (addr >> 24) & 0xff));
+           }
            return code;
        } else if (code == RXGEN_OPCODE) {
            ViceLog(0,