[OpenAFS-devel] diagnosing my problem with ubik elections... bug in ubik

Neulinger, Nathan nneul@umr.edu
Tue, 3 Apr 2001 12:19:51 -0500


Well, turns out, that change for lowestHost DID fix my entire problem...

Unfortunately, when I added debugging I changed a 

if (x)
	return 0;

to a 

if (x)
	debug
	return 0;

(I hate the fact that openafs code is unmaintainable with tabstop=4. That's
what got me messed up there.)

as soon as I fixed that, it started working after the 75 (BIGTIME) second
delay.

What I'd like to know is - how come no one else has been impacted by that
lowestHost thing? what is different about my setup that it's affecting them
- is no one else running 3 database servers on linux boxes perhaps?

I'm going to back out all my debugging changes and come up with a minimal
set of changes to verify that this indeed corrects the problem. 

-- Nathan

> -----Original Message-----
> From: Neulinger, Nathan 
> Sent: Tuesday, April 03, 2001 12:08 PM
> To: 'Ken Hornstein'
> Cc: 'openafs-devel@openafs.org'
> Subject: RE: [OpenAFS-devel] diagnosing my problem with ubik
> elections... bug in ubik 
> 
> 
> Yeah, I've been waiting long enough... learned that much 
> about the protocol already, head about to explode from it too...
> 
> I've let it sit overnight in a couple cases, it's just 
> looping forever. I've about got it tracked down, has taken me 
> a while to get enough debugging added to ubik stuff to where 
> I can understand exactly how it works.
> 
> -- Nathan
> 
> > -----Original Message-----
> > From: Ken Hornstein [mailto:kenh@cmf.nrl.navy.mil]
> > Sent: Tuesday, April 03, 2001 12:03 PM
> > To: Neulinger, Nathan
> > Cc: 'openafs-devel@openafs.org'
> > Subject: Re: [OpenAFS-devel] diagnosing my problem with ubik
> > elections... bug in ubik 
> > 
> > 
> > >Once I changed that, the lowestHost calculation is looking 
> > much better.
> > >Still not syncing up cause no one is ever sending a yes 
> > vote, but I'm still
> > >looking at that. 
> > 
> > Just FYI: as part of the protocol, no one can send a "yes" 
> > vote for BIG
> > seconds after startup (I think "BIG" is something like 90, 
> but I don't
> > remember).  If you're restarting it before that timer elapses, then
> > that might be part of the problem.
> > 
> > I have a document which describes the basic Ubik protocol which IMHO
> > is essential for debugging these sorts of things; Derrick, 
> > maybe it should
> > be added to the base distribution?  (If it isn't already).
> > 
> > --Ken
> > 
>