[OpenAFS] Weird Quorum Issues

Aaron Stanley astanley@strozllc.com
Tue, 11 Nov 2003 12:40:21 -0500


Thanks Jeffrey,

Looks like what happened was a time issue, though it wasn't easy for me to
diagnose.  One of the server was off by about 30 seconds and the rest were
in sync.  

Thanks for all the help from everyone.

 - AB


-- 
Aaron Stanley
Director, Information Technology
Stroz Friedberg, LLC
15 Maiden Lane, 12th Floor
New York, NY  10038
212/981.6534[o] | 917/859.1503[c] | 815/642.0223[f]


***********************************************************************

This message is for the named person's use only.  It may contain
confidential, proprietary or legally privileged information. No right to
confidential or privileged treatment of this message is waived or lost
by any error in transmission.  If you have received this message in
error, please immediately notify the sender by e-mail or by telephone at
212 981 6540, delete the message and all copies from your system and
destroy any hard copies.  You must not, directly or indirectly, use,
disclose, distribute, print or copy any part of this message if you are
not the intended recipient.

************************************************************************

> From: Jeffrey Hutzelman <jhutz@cmu.edu>
> Date: Fri, 07 Nov 2003 05:33:55 -0500
> To: Dr A V Le Blanc <LeBlanc@mcc.ac.uk>, openafs-info@openafs.org
> Subject: Re: [OpenAFS] Weird Quorum Issues
> 
> 
> 
> On Friday, November 07, 2003 07:40:49 +0000 Dr A V Le Blanc
> <LeBlanc@mcc.ac.uk> wrote:
> 
>> On Wed, 05 Nov 2003 at 21:50:09 -0500, Aaron Stanley
>> <astanley@strozllc.com>:
>>> I was away from my cluster for two days and when I got back I noticed...
>>> 
>>> u: No Quorum Elected
>> ...
>>> I ran a udebug on all three of my vl servers for both ports 7002 and
>>> 7004. On the primary (largest) fileserver, the output would waffle
>>> between the normal "I am sync site" and the not normal "I am not sync
>>> site".  I thought at first that it might be a network issue, but pings
>>> between servers was great and bandwith was not an issue.
>> 
>> We currently have three DB servers, two machines running IRIX 6.5 and
>> openafs 1.2.10, and one Linux box running 2.4.22 and openafs 1.2.10.
>> They are all on the same subnet and have only the one IP address
>> each, but often enough there are quorum problems.  For example,
>> yesterday I checked and found no quorum for the protection server
>> (7002), though both volume server (7003) and kaserver (7004) had
>> sync sites, though for some reason on different machines.  All
>> three servers had been up and running without restarts for more than
>> two months.  I stopped the ptserver processes and restarted them
>> one by one: a quorum was elected in about 5 minutes.
>> 
>> We have seen this kind of problem occasionally, though infrequently,
>> for several years.  I've always assumed it just happens because of
>> bugs in ubik; doesn't everyone have this kind of problem?  The main
>> nuisance occurs when the jobs to create new users fail because they
>> can't make entries in the ka or pt database, and I end up with a
>> partially created user.
> 
> We certainly don't have this problem.  The only time I see lack of quorum
> or a coordinator change is when there is a network problem or a server that
> really is down or restarting.  Of course, once elected, a server remains
> coordinator until it goes down or there are not enough votes to sustain it.
> So if your normal "lowest" machine restarts for some reason, the next one
> will become and stay coordinator.  We see that behaviour on a regular
> basis, because each of our dbservers restarts on a different day of the
> week.
> 
> 
> It's worth noting that the Ubik election algorithm is very sensitive to
> proper time synchronization.  The maximum permitted clock skew between any
> two servers is 10 seconds; more than this, and the election algorithm will
> break down.  Thus, running NTP is critical for database servers.
> 
> -- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
>  Sr. Research Systems Programmer
>  School of Computer Science - Research Computing Facility
>  Carnegie Mellon University - Pittsburgh, PA
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>