[OpenAFS] Re: Ubik trouble

Harald Barth haba@kth.se
Mon, 13 Jan 2014 15:00:41 +0100 (CET)


> This turned out to be a subtle network problem.

This reminds me of another Ubik problem that I had which was because
of (1) a config error and (2) something about Ubik that I still do not
understand.

(1) I had an old NetInfo file with a wrong IP addr lying around. This
id _not_ prevent the server to start nor to prevent sync completely.
The protection server synced fine and the volume location server
refused.

(2) I have a machine where the database server is known as X.Y.Z.43
but the machine's primary IP is X.Y.Z.46. This seems to work well
until something somewhere checks the source address of the traffic
when sync is tried. Result: The protection server synced fine and the
volume location server refused. 


After I corrected the NetInfo on machine (1) and added -rxbind to the
vlserver on machine (2) everything synced fine. I am still puzzled why
when -rxbind is _not_ given then ubik says to me it first worked fine
for 13 hours (866491 secs ago, see below) but not after that.

$ udebug  130.237.234.3 7002
Host's addresses are: 130.237.234.3 
Host's 130.237.234.3 time is Mon Jan 13 14:39:58 2014
Local time is Mon Jan 13 14:39:58 2014 (time differential 0 secs)
Last yes vote for 130.237.234.3 was 8 secs ago (sync site); 
Last vote started 8 secs ago (at Mon Jan 13 14:39:50 2014)
Local db version is 1369329952.4
I am sync site until 52 secs from now (at Mon Jan 13 14:40:50 2014) (3 servers)
Recovery state 1f
The last trans I handled was 0.360296
Sync site's db version is 1369329952.4
0 locked pages, 0 of them for write

Server (130.237.234.101): (db 1369329952.4)
    last vote rcvd 8 secs ago (at Mon Jan 13 14:39:50 2014),
    last beacon sent 8 secs ago (at Mon Jan 13 14:39:50 2014), last vote was yes
    dbcurrent=1, up=1 beaconSince=1

Server (130.237.234.43): (db 1369329952.4)
    last vote rcvd 866491 secs ago (at Fri Jan  3 13:58:27 2014),
    last beacon sent 866474 secs ago (at Fri Jan  3 13:58:44 2014), last vote was yes
    dbcurrent=1, up=0 beaconSince=0


On 130.237.234.43:

# ps augxww | grep ptserv
root     22308  0.0  0.3   9216  7264 ?        S    Jan03   0:23 /usr/lib/openafs/ptserver

Log looks fine:

# cat /var/log/openafs/PtLog
Fri Jan  3 13:58:28 2014 ubik: primary address 130.237.234.46 does not exist
Fri Jan  3 13:58:28 2014 Using 130.237.234.43 as my primary address
Fri Jan  3 13:58:28 2014 Starting AFS ptserver 1.1 (/usr/lib/openafs/ptserver)

# cat  /var/lib/openafs/local/NetInfo
130.237.234.43

So how should I know that this would cease to work after 13 hours? Or some other odd number of hours?

Now restarting everything on  130.237.234.43

After that: This seems not to be self-healing either. On the sync site:

Fri Jan  3 13:58:44 2014 assuming distant vote time 19270408 from 130.237.234.43 is an error; marking host down
Mon Jan 13 14:48:42 2014 ubik: A Remote Server has addresses:

Looks like I have to restart the server on the syncsite as well (so it
forgets the bad vote time). And I'm not sure what 19270408 actually
means. 223 days ago?

Well, after that restart:

$ udebug  130.237.234.3 7002
Host's addresses are: 130.237.234.3 
Host's 130.237.234.3 time is Mon Jan 13 14:59:03 2014
Local time is Mon Jan 13 14:59:07 2014 (time differential 4 secs)
Last yes vote for 130.237.234.3 was 5 secs ago (sync site); 
Last vote started 5 secs ago (at Mon Jan 13 14:59:02 2014)
Local db version is 1369329952.4
I am sync site until 51 secs from now (at Mon Jan 13 14:59:58 2014) (3 servers)
Recovery state 1f
The last trans I handled was 0.0
Sync site's db version is 1369329952.4
0 locked pages, 0 of them for write

Server (130.237.234.101): (db 1369329952.4)
    last vote rcvd 9 secs ago (at Mon Jan 13 14:58:58 2014),
    last beacon sent 5 secs ago (at Mon Jan 13 14:59:02 2014), last vote was yes
    dbcurrent=1, up=1 beaconSince=1

Server (130.237.234.43): (db 1369329952.4)
    last vote rcvd 8 secs ago (at Mon Jan 13 14:58:59 2014),
    last beacon sent 5 secs ago (at Mon Jan 13 14:59:02 2014), last vote was yes
    dbcurrent=1, up=1 beaconSince=1


But was a lot of hazzle to get there...

Harald.