[OpenAFS] Odd ubik (?) synchronization problem

Mon, 26 Apr 2004 14:44:12 -0700

Ken Hornstein <kenh@cmf.nrl.navy.mil> writes:

> This kinda makes me think "network failure" given the time differences
> here from the vote/beacon between these servers.

This is now happening very regularly; it's making it almost impossible to
do mass volume moves.  Sometimes up is actually going to zero:

Host's addresses are: 171.64.7.222 
Host's 171.64.7.222 time is Mon Apr 26 14:38:46 2004
Local time is Mon Apr 26 14:38:46 2004 (time differential 0 secs)
Last yes vote for 171.64.7.222 was 11 secs ago (sync site); 
Last vote started 11 secs ago (at Mon Apr 26 14:38:35 2004)
Local db version is 1083015420.15
I am sync site until 49 secs from now (at Mon Apr 26 14:39:35 2004) (3 servers)
Recovery state f
I am currently managing write trans 1083015420.-1992384204
Sync site's db version is 1083015420.15
0 locked pages, 0 of them for write
There are write locks held
Last time a new db version was labelled was:
         106 secs ago (at Mon Apr 26 14:37:00 2004)

Server (171.64.7.246): (db 1083015420.14)
    last vote rcvd 71 secs ago (at Mon Apr 26 14:37:35 2004),
    last beacon sent 71 secs ago (at Mon Apr 26 14:37:35 2004), last vote was yes
    dbcurrent=0, up=0 beaconSince=0

Server (171.64.7.234): (db 1083015420.15)
    last vote rcvd 11 secs ago (at Mon Apr 26 14:38:35 2004),
    last beacon sent 11 secs ago (at Mon Apr 26 14:38:35 2004), last vote was yes
    dbcurrent=1, up=1 beaconSince=1

Host's addresses are: 171.64.7.246 
Host's 171.64.7.246 time is Mon Apr 26 14:39:07 2004
Local time is Mon Apr 26 14:39:07 2004 (time differential 0 secs)
Last yes vote for 171.64.7.246 was 8 secs ago (not sync site); 
Last vote started 8 secs ago (at Mon Apr 26 14:38:59 2004)
Local db version is 1083015420.14
I am not sync site
Lowest host 171.64.7.222 was set 2 secs ago
Sync host 171.64.7.222 was set 2 secs ago
Sync site's db version is 1083015420.14
0 locked pages, 0 of them for write

It always then recovers and they end up agreeing on a sync site again, but
the outages are often long enough to kill vos move.  Note that I can
always ping these systems even when up is zero, and bos status doesn't
report anything wrong with the running processes.

The Ubik negotiation is all happening over UDP, though, right?  Maybe
we're getting some sort of packet loss.

> From what I remember ... once you vote "yes" for a particular server,
> you can't vote yes for someone else for at least BIGTIME (75) seconds.
> And if a slave hasn't heard from the master (or someone better) in
> BIGTIME seconds, it's going to start running the voting algorithm ...
> which would mean it would probably end up voting for itself (depending
> on who started first).  Of course, why the heck "up" didn't go to zero
> is mystifying ...  but maybe by the time you noticed the problem, the
> supposed network glitch had fixed itself and the voting algorithm hadn't
> settled yet?

That seems possible.

The outages seem to last for a little over a minute when they happen.

> One thing that occurs to me is trying to crank up the vlserver logging
> ...  I think if you use "5", you'll get all of the ubik debugging info.
> Also, running tcpdump or an equivalant thing on traffic between the
> vlservers shouldn't be _too_ much information, and might help you see if
> some of those RPCs were getting lost.

Good point.  Thanks.  I'll look at doing this (and at the debugging tool
recommended separately).

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>