[OpenAFS] Re: DB quorum problems on 1.4/1.6 mixed cell

Andrew Deason adeason@sinenomine.net
Tue, 11 Feb 2014 10:38:29 -0600


On Tue, 11 Feb 2014 09:50:35 +0000
Arne Wiebalck <Arne.Wiebalck@cern.ch> wrote:

> I am currently managing write trans 1392106724.-1892301558
[...]
> Note that the sync site has gone to Recovery state f and that the time
> at which the last vote was received on the other two servers has quite
> a time gap which gets larger with time. Is the negative trans ID ok?

There was a bug where a negative transaction ID would not be handled
correctly, causing transactions to be killed and causing ubik to think
it doesn't have quorum. It was fixed by
4c80871a16d6022c3d3e5edc0504208ddad49cc8 on 1.6 (gerrit 5751, 2647). It
looks like that was in 1.6.1.

Without that fix, the workaround is to periodically restart the
dbservers so the transaction id doesn't roll over and become negative.

I assume that's what it is, just because as far as I'm aware it always
happens with negative transaction IDs. But if you want some other ways
of trying to verify it, it can result in the "major synchronization
error" (USYNC) error message, which is barely ever seen outside of this
issue. There's also a thread describing a manifestation of the issue
here, if this looks familiar:
<http://lists.openafs.org/pipermail/openafs-info/2004-April/013225.html>.
Though that thread didn't identify the problem enough to actually fix it
back then.

> Is this a known issue? 
> 
> I had understood that it should be OK to run 1.4 and 1.6 file servers
> in parallel and that the DB servers could be updated after the file
> servers, but maybe that is not correct?

Mixing 1.4 and 1.6 servers is fine (some sites have or had fileservers
much much older than that :). While I don't think there's too much of a
reason to do dbservers 'before' or 'after' the others (but I'm not
thinking too hard about it), they are usually seen as more critical, so
they probably do tend to get updated last.

If the issue you're experiencing is the thing I mentioned above, it
doesn't have anything to do with the version of the fileservers (and I
don't think anything would; there's no interaction between the
fileservers and dbservers for those operations). If you saw it only when
moving volumes between 1.4 and 1.6 servers, as far as I know you're just
lucky :)

-- 
Andrew Deason
adeason@sinenomine.net