[OpenAFS] Odd ubik (?) synchronization problem

Russ Allbery rra@stanford.edu
Mon, 26 Apr 2004 12:35:48 -0700


Starting this morning, we've been getting periodic vos move failures
complaining that no quorum has been elected.  It looks like one of our
VLDB servers is periodically falling out of contact with the others.  The
udebug output on the master when this happens looks like:

Host's 171.64.7.222 time is Mon Apr 26 12:23:08 2004
Local time is Mon Apr 26 12:23:09 2004 (time differential 1 secs)
Last yes vote for 171.64.7.222 was 8 secs ago (sync site); 
Last vote started 8 secs ago (at Mon Apr 26 12:23:01 2004)
Local db version is 1083007252.57
I am sync site until 52 secs from now (at Mon Apr 26 12:24:01 2004) (3 servers)
Recovery state f
I am currently managing write trans 1083007252.-1998134370
Sync site's db version is 1083007252.56
0 locked pages, 0 of them for write
There are write locks held
Last time a new db version was labelled was:
         136 secs ago (at Mon Apr 26 12:20:53 2004)

Server (171.64.7.246): (db 1083007252.55)
    last vote rcvd 83 secs ago (at Mon Apr 26 12:21:46 2004),
    last beacon sent 83 secs ago (at Mon Apr 26 12:21:46 2004), last vote was yes
    dbcurrent=0, up=1 beaconSince=0

Server (171.64.7.234): (db 1083007252.56)
    last vote rcvd 8 secs ago (at Mon Apr 26 12:23:01 2004),
    last beacon sent 8 secs ago (at Mon Apr 26 12:23:01 2004), last vote was yes
    dbcurrent=0, up=1 beaconSince=0

Note recovery state of "f" and dbcurrent and beaconSince not set on that
second server.  A little bit after that (about 100 seconds later), it goes
to:

Server (171.64.7.246): (db 1083007252.55)
    last vote rcvd 0 secs ago (at Mon Apr 26 12:23:15 2004),
    last beacon sent 0 secs ago (at Mon Apr 26 12:23:15 2004), last vote was no
    dbcurrent=0, up=1 beaconSince=1

Now, it's getting even worse; all of the servers see the same lowest site,
but they're all voting for themselves in the Ubik election and can't agree
on a sync site.

None of the services have restarted on any of the servers, so I'm rather
mystified as to what would be causing this problem.  My only guess is some
sort of network trouble, but I can ping all of the systems from each
other, and the up flag in the above never went away.

Okay, and they've now finally agreed on a sync site while I was writing
this and it's back to the correct sync site.  But I'm guessing this is
going to happen again.  The first time I saw this happen was over the
weekend, late Saturday night.

Any ideas would be appreciated....

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>