[OpenAFS] Weird Quorum Issues

Dr A V Le Blanc Dr A V Le Blanc <LeBlanc@mcc.ac.uk>
Fri, 7 Nov 2003 07:40:49 +0000


On Wed, 05 Nov 2003 at 21:50:09 -0500, Aaron Stanley <astanley@strozllc.com>:
> I was away from my cluster for two days and when I got back I noticed...
> 
> u: No Quorum Elected
...
> I ran a udebug on all three of my vl servers for both ports 7002 and 7004.
> On the primary (largest) fileserver, the output would waffle between the
> normal "I am sync site" and the not normal "I am not sync site".  I thought
> at first that it might be a network issue, but pings between servers was
> great and bandwith was not an issue.

We currently have three DB servers, two machines running IRIX 6.5 and
openafs 1.2.10, and one Linux box running 2.4.22 and openafs 1.2.10.
They are all on the same subnet and have only the one IP address
each, but often enough there are quorum problems.  For example,
yesterday I checked and found no quorum for the protection server
(7002), though both volume server (7003) and kaserver (7004) had
sync sites, though for some reason on different machines.  All
three servers had been up and running without restarts for more than
two months.  I stopped the ptserver processes and restarted them
one by one: a quorum was elected in about 5 minutes.

We have seen this kind of problem occasionally, though infrequently,
for several years.  I've always assumed it just happens because of
bugs in ubik; doesn't everyone have this kind of problem?  The main
nuisance occurs when the jobs to create new users fail because they
can't make entries in the ka or pt database, and I end up with a
partially created user.

     -- Owen
     LeBlanc@mcc.ac.uk