[OpenAFS] Odd ubik (?) synchronization problem

Marcus Watts mdw@umich.edu
Mon, 26 Apr 2004 18:03:49 -0400

Russ Allbery <rra@stanford.edu> writes:
> Starting this morning, we've been getting periodic vos move failures
> complaining that no quorum has been elected.  It looks like one of our
> VLDB servers is periodically falling out of contact with the others.  The

Sounds like a ubik problem alright.  We (umich.edu) had the same
problem a number of years ago, and ended up with ubik mods to fix this.
We started with afs 3.4; transarc since made some changes similiar to
what we did, but not others.

The basic problem is if you have a series of writes happening
at the same as you have a heavy load of reads, the writes and
resulting replication interfer with the voting logic.  There
are various timing windows and timeouts that are issues at
various points in the logic.  There were a lot of these in afs 3.4;
I don't how how many remain in openafs today, but it sure
sounds like you ran into one of them.

The udebug output you posted has a db version of .56, after being
labelled 136 seconds ago, so that's almost one change every 2 seconds,
so you definitely have the series of writes.  You didn't include
any rx statistics, but I'd guess you'd see a lot of calls.
If you look through the rx connections, you may also be
able to find the ubik connections - those may be hard to spot,
but could be interesting.  It would also be worth figuring out
what is causing the writes to your vldb, if you can identify the
connection and process.

A couple of questions:

what version software are you running?
how big is your vldb?
how fast are your machines?
Have you run out of disk bandwidth or memory?
are there any interesting messages in stderr from vlserver?
is there anything happening right now that has resulted in
	a temporary increase in your load or decrease in server speed?

				-Marcus Watts
				UM ITCS Umich Systems Group