[OpenAFS] Odd ubik (?) synchronization problem

Ken Hornstein kenh@cmf.nrl.navy.mil
Mon, 26 Apr 2004 15:54:44 -0400


>Server (171.64.7.246): (db 1083007252.55)
>    last vote rcvd 83 secs ago (at Mon Apr 26 12:21:46 2004),
>    last beacon sent 83 secs ago (at Mon Apr 26 12:21:46 2004), last vote was yes
>    dbcurrent=0, up=1 beaconSince=0
>
>Server (171.64.7.234): (db 1083007252.56)
>    last vote rcvd 8 secs ago (at Mon Apr 26 12:23:01 2004),
>    last beacon sent 8 secs ago (at Mon Apr 26 12:23:01 2004), last vote was yes
>    dbcurrent=0, up=1 beaconSince=0

This kinda makes me think "network failure" given the time differences here
from the vote/beacon between these servers.

>Now, it's getting even worse; all of the servers see the same lowest site,
>but they're all voting for themselves in the Ubik election and can't agree
>on a sync site.

>From what I remember ... once you vote "yes" for a particular server,
you can't vote yes for someone else for at least BIGTIME (75) seconds.
And if a slave hasn't heard from the master (or someone better) in
BIGTIME seconds, it's going to start running the voting algorithm ...
which would mean it would probably end up voting for itself (depending
on who started first).  Of course, why the heck "up" didn't go to zero
is mystifying ...  but maybe by the time you noticed the problem, the
supposed network glitch had fixed itself and the voting algorithm
hadn't settled yet?

One thing that occurs to me is trying to crank up the vlserver logging ...
I think if you use "5", you'll get all of the ubik debugging info.  Also,
running tcpdump or an equivalant thing on traffic between the vlservers
shouldn't be _too_ much information, and might help you see if some of
those RPCs were getting lost.

--Ken