[OpenAFS] Weird Quorum Issues

Fri, 07 Nov 2003 05:33:55 -0500

On Friday, November 07, 2003 07:40:49 +0000 Dr A V Le Blanc 
<LeBlanc@mcc.ac.uk> wrote:

> On Wed, 05 Nov 2003 at 21:50:09 -0500, Aaron Stanley
> <astanley@strozllc.com>:
>> I was away from my cluster for two days and when I got back I noticed...
>>
>> u: No Quorum Elected
> ...
>> I ran a udebug on all three of my vl servers for both ports 7002 and
>> 7004. On the primary (largest) fileserver, the output would waffle
>> between the normal "I am sync site" and the not normal "I am not sync
>> site".  I thought at first that it might be a network issue, but pings
>> between servers was great and bandwith was not an issue.
>
> We currently have three DB servers, two machines running IRIX 6.5 and
> openafs 1.2.10, and one Linux box running 2.4.22 and openafs 1.2.10.
> They are all on the same subnet and have only the one IP address
> each, but often enough there are quorum problems.  For example,
> yesterday I checked and found no quorum for the protection server
> (7002), though both volume server (7003) and kaserver (7004) had
> sync sites, though for some reason on different machines.  All
> three servers had been up and running without restarts for more than
> two months.  I stopped the ptserver processes and restarted them
> one by one: a quorum was elected in about 5 minutes.
>
> We have seen this kind of problem occasionally, though infrequently,
> for several years.  I've always assumed it just happens because of
> bugs in ubik; doesn't everyone have this kind of problem?  The main
> nuisance occurs when the jobs to create new users fail because they
> can't make entries in the ka or pt database, and I end up with a
> partially created user.

We certainly don't have this problem.  The only time I see lack of quorum 
or a coordinator change is when there is a network problem or a server that 
really is down or restarting.  Of course, once elected, a server remains 
coordinator until it goes down or there are not enough votes to sustain it. 
So if your normal "lowest" machine restarts for some reason, the next one 
will become and stay coordinator.  We see that behaviour on a regular 
basis, because each of our dbservers restarts on a different day of the 
week.

It's worth noting that the Ubik election algorithm is very sensitive to 
proper time synchronization.  The maximum permitted clock skew between any 
two servers is 10 seconds; more than this, and the election algorithm will 
break down.  Thus, running NTP is critical for database servers.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA