[OpenAFS-devel] Sysid probmens when upgrading FreeBSD server to 1.3.80

Sandesh V Chopdekar sandesh_vc@in.ibm.com
Fri, 25 Mar 2005 22:00:44 -0500


This is a multipart message in MIME format.
--=_alternative 0010E5C185256FD0_=
Content-Type: text/plain; charset="US-ASCII"

> The not so funny thing was the salvage-loop:
> 
>  +->   fs crashes with message above
>  |     need salvage
>  |     run salvage for long time
>  +--   bos tries to restart fs

 I ran into a similar loop problem today ( IBM AFS, on Solaris 9)

 The machine in question had 2 interfaces, someone was playing with its 
 network interfaces,  ( making them up/down )  and today FileServer won't 
  come up, with the same error message.

 In VLDB, all of the volumes were reported under the offending(new)  IP, 
except one , 
 which  was   under the old IP. 
 I created a NetRestrict file, and added the newer interface there, 
stop/start FileServer and
 FileServer came up fine.
 By Old I mean, this was its IP in the VLDB until a few days back, when I 
had last checked.

I am trying to understand, why it happened.

 Thanks and  Regards,
 Sandesh Chopdekar , 
 Email : sandesh_vc@in.ibm.com 




Harald Barth <haba@pdc.kth.se> 
Sent by: openafs-devel-admin@openafs.org
03/25/2005 07:35 PM

To
openafs-devel@openafs.org
cc
openafs-bugs@openafs.org
Subject
[OpenAFS-devel] Sysid probmens when upgrading FreeBSD server to 1.3.80







Our (Stacken) server was not of newest vintage and looped over and
over sending the same probe rx packet to one client in the world which
probably didn't deserve it.

$ /usr/arla/bin/rxdebug kvikklunsj -version
Trying 130.237.234.46 (port 7000):
AFS version:  OpenAFS devel built  2003-04-10 

So I decided to upgrade. Download 1.3.80, 

# ./configure  --prefix=/usr/afs --enable-full-vos-listvol-switch 
--enable-largefile-fileserver --enable-debug --enable-bitmap-later 
--enable-fast-restart --disable-kernel-module 
--with-afs-sysname=i386_fbsd_47

and make install and here we go. I thought. In spite not changing
anything, the server was very unhappy with the UUID, but the error
message is totally worthless: 

>> The ethernet address exist on a different server; repair it

Nope, "ethernet" is wrong. Maybe "IP" or "network".
And could we please see the offending evidence?

>>  VL_RegisterAddrs rpc failed; See VLLog for details

Nope, nothing in any VLLog on any vlserver.

The not so funny thing was the salvage-loop:

 +->   fs crashes with message above
 |     need salvage
 |     run salvage for long time
 +--   bos tries to restart fs

The first step in the right direction is the patch below
which hopefully prints out some useful message. Another
good thing would be to detect the looping. The final
thing would be to find out why suddenly the sysid is
was not accepted any more. Unfortunately, I discovered
the check_sysid program today, not yesterday when I was
debugging the looping server. With some more hacking,
the check_sysid program could actually check with the
vlserver, too.

$ /usr/arla/bin/rxdebug kvikklunsj -version
Trying 130.237.234.46 (port 7000):
AFS version:  OpenAFS 1.3.80 built  2005-03-24 

As you can see, I managed to start the fileserver anyway.
So what did I do? First I unmounted /vicep* to get
shorter looptime to be able to debug. The I made
a NetInfo file containing the one and only IP addr
that is on the servers interface. And lo and behold,
it got happy with the UUID-IP combination and started.
Then a quick shutdown, mount /vicep*, restart got
me going again. So what is the difference between
with and without NetInfo file?

With NetInfo file, FS_Host_Addrs is filled in by:

 code = parseNetFiles(FS_HostAddrs, NULL, NULL,
 ADDRSPERSITE, reason,
 AFSDIR_SERVER_NETINFO_FILEPATH,
 AFSDIR_SERVER_NETRESTRICT_FILEPATH);

Without it is filled in by:

 FS_HostAddr_cnt = rx_getAllAddr(FS_HostAddrs, ADDRSPERSITE);

The strange thing is that I have been running rx_getAllAddr()
standalone and it seems to do the "right thing": buffer[0] is
0x2eeaed82 and FS_HostAddr_cnt = 1. As much as I'd like to go
deeper with this, I will not run a lot of experients on Stacken's
main AFS servers. I might get another chance when I upgrade
the other one.

So for the record: Created a NetInfo file and got lucky. Don't
know exactly what made the difference. This is on FreeBSD 4.8.

Harald.

--- src/viced/viced.c.~1.59.~   2004-09-08 23:35:54.000000000 +0200
+++ src/viced/viced.c   2005-03-26 00:26:01.000000000 +0100
@@ -1462,10 +1462,22 @@
     code = ubik_Call(VL_RegisterAddrs, cstruct, 0, &FS_HostUUID, 0, 
&addrs);
     if (code) {
        if (code == VL_MULTIPADDR) {
+           char uuid[1024];
+           int n;
+
+           afsUUID_to_string(FS_HostUUID, uuid, sizeof(uuid));
            ViceLog(0,
-                   ("VL_RegisterAddrs rpc failed; The ethernet address 
exist on a different server; repair it\n"));
+                   ("VL_RegisterAddrs rpc failed: The IP address(es) 
conflicted with the registered UUID\n"));
            ViceLog(0,
-                   ("VL_RegisterAddrs rpc failed; See VLLog for 
details\n"));
+                   ("UUID: %s\n",uuid));
+           for (n = 0; n < FS_HostAddr_cnt; n++) {
+               Vicelog(0,
+                       ("IP %d: %d.%d.%d.%d\n", n+1, 
+                        (addr) & 0xff,
+                        (addr >> 8) & 0xff,
+                        (addr >> 16) & 0xff,
+                        (addr >> 24) & 0xff));
+           }
            return code;
        } else if (code == RXGEN_OPCODE) {
            ViceLog(0,
_______________________________________________
OpenAFS-devel mailing list
OpenAFS-devel@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-devel


--=_alternative 0010E5C185256FD0_=
Content-Type: text/html; charset="US-ASCII"


<br><font size=2 face="sans-serif">&gt; The not so funny thing was the
salvage-loop:<br>
&gt; <br>
&gt; &nbsp;+-&gt; &nbsp; fs crashes with message above<br>
&gt; &nbsp;| &nbsp; &nbsp; need salvage<br>
&gt; &nbsp;| &nbsp; &nbsp; run salvage for long time<br>
&gt; &nbsp;+-- &nbsp; bos tries to restart fs<br>
</font>
<br><font size=2 face="sans-serif">&nbsp;I ran into a similar loop problem
today ( IBM AFS, on Solaris 9)</font>
<br>
<br><font size=2 face="sans-serif">&nbsp;The machine in question had 2
interfaces, someone was playing with its </font>
<br><font size=2 face="sans-serif">&nbsp;network interfaces, &nbsp;( making
them up/down ) &nbsp;and today FileServer won't </font>
<br><font size=2 face="sans-serif">&nbsp; come up, with the same error
message.</font>
<br>
<br><font size=2 face="sans-serif">&nbsp;In VLDB, all of the volumes were
reported under the offending(new) &nbsp;IP, except one , </font>
<br><font size=2 face="sans-serif">&nbsp;which &nbsp;was &nbsp; under the
old IP. </font>
<br><font size=2 face="sans-serif">&nbsp;I created a NetRestrict file,
and added the newer interface there, stop/start FileServer and</font>
<br><font size=2 face="sans-serif">&nbsp;FileServer came up fine.</font>
<br><font size=2 face="sans-serif">&nbsp;By Old I mean, this was its IP
in the VLDB until a few days back, when I had last checked.</font>
<br>
<br><font size=2 face="sans-serif">I am trying to understand, why it happened.</font>
<br>
<br><font size=2 face="sans-serif">&nbsp;Thanks and &nbsp;Regards,<br>
 Sandesh Chopdekar , <br>
 Email : sandesh_vc@in.ibm.com &nbsp;</font>
<br>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Harald Barth &lt;haba@pdc.kth.se&gt;</b>
</font>
<br><font size=1 face="sans-serif">Sent by: openafs-devel-admin@openafs.org</font>
<p><font size=1 face="sans-serif">03/25/2005 07:35 PM</font>
<td width=59%>
<table width=100%>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td valign=top><font size=1 face="sans-serif">openafs-devel@openafs.org</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td valign=top><font size=1 face="sans-serif">openafs-bugs@openafs.org</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td valign=top><font size=1 face="sans-serif">[OpenAFS-devel] Sysid probmens
when upgrading FreeBSD server to 1.3.80</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><font size=2><tt><br>
Our (Stacken) server was not of newest vintage and looped over and<br>
over sending the same probe rx packet to one client in the world which<br>
probably didn't deserve it.<br>
<br>
$ /usr/arla/bin/rxdebug kvikklunsj -version<br>
Trying 130.237.234.46 (port 7000):<br>
AFS version: &nbsp;OpenAFS devel built &nbsp;2003-04-10 <br>
<br>
So I decided to upgrade. Download 1.3.80, <br>
<br>
# ./configure &nbsp;--prefix=/usr/afs --enable-full-vos-listvol-switch
--enable-largefile-fileserver --enable-debug --enable-bitmap-later --enable-fast-restart
--disable-kernel-module --with-afs-sysname=i386_fbsd_47<br>
<br>
and make install and here we go. I thought. In spite not changing<br>
anything, the server was very unhappy with the UUID, but the error<br>
message is totally worthless: <br>
<br>
&gt;&gt; The ethernet address exist on a different server; repair it<br>
<br>
Nope, &quot;ethernet&quot; is wrong. Maybe &quot;IP&quot; or &quot;network&quot;.<br>
And could we please see the offending evidence?<br>
<br>
&gt;&gt; &nbsp;VL_RegisterAddrs rpc failed; See VLLog for details<br>
<br>
Nope, nothing in any VLLog on any vlserver.<br>
<br>
The not so funny thing was the salvage-loop:<br>
<br>
 +-&gt; &nbsp; fs crashes with message above<br>
 | &nbsp; &nbsp; need salvage<br>
 | &nbsp; &nbsp; run salvage for long time<br>
 +-- &nbsp; bos tries to restart fs<br>
<br>
The first step in the right direction is the patch below<br>
which hopefully prints out some useful message. Another<br>
good thing would be to detect the looping. The final<br>
thing would be to find out why suddenly the sysid is<br>
was not accepted any more. Unfortunately, I discovered<br>
the check_sysid program today, not yesterday when I was<br>
debugging the looping server. With some more hacking,<br>
the check_sysid program could actually check with the<br>
vlserver, too.<br>
<br>
$ /usr/arla/bin/rxdebug kvikklunsj -version<br>
Trying 130.237.234.46 (port 7000):<br>
AFS version: &nbsp;OpenAFS 1.3.80 built &nbsp;2005-03-24 <br>
<br>
As you can see, I managed to start the fileserver anyway.<br>
So what did I do? First I unmounted /vicep* to get<br>
shorter looptime to be able to debug. The I made<br>
a NetInfo file containing the one and only IP addr<br>
that is on the servers interface. And lo and behold,<br>
it got happy with the UUID-IP combination and started.<br>
Then a quick shutdown, mount /vicep*, restart got<br>
me going again. So what is the difference between<br>
with and without NetInfo file?<br>
<br>
With NetInfo file, FS_Host_Addrs is filled in by:<br>
<br>
 code = parseNetFiles(FS_HostAddrs, NULL, NULL,<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; ADDRSPERSITE, reason,<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; AFSDIR_SERVER_NETINFO_FILEPATH,<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; AFSDIR_SERVER_NETRESTRICT_FILEPATH);<br>
<br>
Without it is filled in by:<br>
<br>
 FS_HostAddr_cnt = rx_getAllAddr(FS_HostAddrs, ADDRSPERSITE);<br>
<br>
The strange thing is that I have been running rx_getAllAddr()<br>
standalone and it seems to do the &quot;right thing&quot;: buffer[0] is<br>
0x2eeaed82 and FS_HostAddr_cnt = 1. As much as I'd like to go<br>
deeper with this, I will not run a lot of experients on Stacken's<br>
main AFS servers. I might get another chance when I upgrade<br>
the other one.<br>
<br>
So for the record: Created a NetInfo file and got lucky. Don't<br>
know exactly what made the difference. This is on FreeBSD 4.8.<br>
<br>
Harald.<br>
<br>
--- src/viced/viced.c.~1.59.~ &nbsp; 2004-09-08 23:35:54.000000000 +0200<br>
+++ src/viced/viced.c &nbsp; 2005-03-26 00:26:01.000000000 +0100<br>
@@ -1462,10 +1462,22 @@<br>
 &nbsp; &nbsp; code = ubik_Call(VL_RegisterAddrs, cstruct, 0, &amp;FS_HostUUID,
0, &amp;addrs);<br>
 &nbsp; &nbsp; if (code) {<br>
 &nbsp; &nbsp; &nbsp; &nbsp;if (code == VL_MULTIPADDR) {<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; char uuid[1024];<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; int n;<br>
+<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; afsUUID_to_string(FS_HostUUID, uuid,
sizeof(uuid));<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ViceLog(0,<br>
- &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (&quot;VL_RegisterAddrs
rpc failed; The ethernet address exist on a different server; repair it\n&quot;));<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (&quot;VL_RegisterAddrs
rpc failed: The IP address(es) conflicted with the registered UUID\n&quot;));<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ViceLog(0,<br>
- &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (&quot;VL_RegisterAddrs
rpc failed; See VLLog for details\n&quot;));<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (&quot;UUID:
%s\n&quot;,uuid));<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for (n = 0; n &lt; FS_HostAddr_cnt;
n++) {<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Vicelog(0,<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; (&quot;IP %d: %d.%d.%d.%d\n&quot;, n+1, <br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp;(addr) &amp; 0xff,<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp;(addr &gt;&gt; 8) &amp; 0xff,<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp;(addr &gt;&gt; 16) &amp; 0xff,<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp;(addr &gt;&gt; 24) &amp; 0xff));<br>
+ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;return code;<br>
 &nbsp; &nbsp; &nbsp; &nbsp;} else if (code == RXGEN_OPCODE) {<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ViceLog(0,<br>
_______________________________________________<br>
OpenAFS-devel mailing list<br>
OpenAFS-devel@openafs.org<br>
https://lists.openafs.org/mailman/listinfo/openafs-devel<br>
</tt></font>
<br>
--=_alternative 0010E5C185256FD0_=--