[OpenAFS] Re: Ubik trouble

Andrew Deason adeason@sinenomine.net
Mon, 13 Jan 2014 23:34:53 -0600


On Mon, 13 Jan 2014 15:00:41 +0100 (CET)
Harald Barth <haba@kth.se> wrote:

> After that: This seems not to be self-healing either. On the sync site:
> 
> Fri Jan  3 13:58:44 2014 assuming distant vote time 19270408 from 130.237.234.43 is an error; marking host down
> Mon Jan 13 14:48:42 2014 ubik: A Remote Server has addresses:
> 
> Looks like I have to restart the server on the syncsite as well (so it
> forgets the bad vote time). And I'm not sure what 19270408 actually
> means. 223 days ago?

Sorry to further hijack Timothy's thread, but I guess he's not using it
anyway :)

19270408 is an error code, as I had intended that log message to
indicate. The error is:

$ translate_et 19270408
19270408 (rxk).8 = ticket contained unknown key version number

In that particular part of the ubik protocol, error codes are
indistinguishable from vote timestamps. A somewhat recent change
(v1.6.3) was done to provide a heuristic to see if something "looks
like" a timestamp or an error code, and to treat it accordingly. That
certainly does look like an error.

Before that change was introduced, the behavior in ubik was indeed
rather puzzling, since we would seem to not elect a sync site (since the
quorum immediately expires), even though all of the hosts are up and
reachable. That's probably not helping any confusion if such a version
is relevant at all.

The obvious question is whether you are changing your keys or something
while the processes are running, but I assume you think you're not doing
that. But it is possible to try monitoring it locally on each machine,
to see if the KeyFile/rxkad.keytab is changing or something. If we had
better logging, you could see what kvnos are actually in play.

-- 
Andrew Deason
adeason@sinenomine.net