[OpenAFS-devel] The ubik transaction ID rollover problem
Andrew Deason
adeason@sinenomine.net
Mon, 30 Aug 2010 13:12:41 -0500
As a few people have probably noticed at some point or another, the
transaction IDs that ubik uses can roll over and become negative. This
can confuse ubik and make the relevant database (usually the VLDB) at
least sporadically, if not completely, unusable. I think
<http://www.openafs.org/pipermail/openafs-info/2004-April/013225.html>
is one such instance around 6 years ago.
While this is not all too surprising, I've never really fully known why
negative trans IDs cause things to break, except that it probably had
something to do with the dubious greater-than test in
urecovery_CheckTid.
Recently I think I've found why this happens, but the intent of current
ubik behavior can be confusing, so input from others more familiar with
the original ubik design would be helpful in fixing this. (Derrick has
already been helpful so far in jabber, but there's more here.)
The basic problem is that, during a write, we send a VOTE_Beacon message
to other sites using a trans id based on dbase->writeTidCounter, and we
send a DISK_Begin message to other sites using a trans id based on
dbase->tidCounter. (see ubeacon_Interact and BeginTrans)
As far as I can make out, writeTidCounter is incremented by 2 every
write transaction, and tidCounter is incremented by 2 every transaction.
So, typically, when tidCounter is large enough to roll over and be
negative, writeTidCounter is a relatively small positive number.
So when the current transaction on a non-sync-site has a negative
counter (since it is based on tidCounter), and it receives a beacon with
a positive counter (since it is based on writeTidCounter), SVOTE_Beacon
will invalidate the current transaction since the received trans ID has
a greater counter than the current trans ID. The current transaction
will then abort on the next RPC for that transaction.
My first question about this was "what is the transaction ID in
VOTE_Beacon even for?" but I think that was answered well enough by
conversing with derrick in jabber. I _think_ this is just an added
safety check to abort any write transaction if we receive a beacon from
another sync site that has another write transaction in progress. The
atid parameter in SVOTE_Beacon can probably just be ignored without ill
effects (though I haven't tried it yet), but it's one less safety check.
So, the next question is "what transaction ID should VOTE_Beacon really
be using?" I believe it is supposed to be sending the transaction ID for
the current in-progress write transaction if there is one, so we need to
be recording what that transaction was. It seems like maybe
writeTidCounter should be this, but it's not. So, we should either be
setting writeTidCounter to tidCounter on every write transaction, or we
should make DISK_Begin send writeTidCounter on new writes. I'm inclined
to go with the former, as that already happens under a UBIK_PAUSE
define; we could just make it unconditional.
Now, the last question I have is about urecovery_CheckTid itself. Since
this is supposed to check if the given tid matches the "current" tid, I
have always been confused why it checks if the given tid counter is
greater than the current tid counter, instead of just not-equal-to. If
the test was a simply inequality test, I think the above issue would
have been much more apparent and been found sooner.
So, based on all of this, I propose to do three things:
1. Set writeTidCounter = tidCounter on beginning write transactions
instead of incrementing writeTidCounter by 2.
2. Change urecovery_CheckTid to strictly check if the given transaction
id is _equal_ to the current transaction id, not just if it is equal
or happened before the current transaction.
3. To accomodate ubik sites without the change in "1.", change
SVOTE_Beacon to only check the epoch of the given transaction id,
and ignore the tid counter. Otherwise, we will abort valid
transactions due to other sites giving us a bogus transaction id in
VOTE_Beacon messages. This relaxes the supposed safety check
somewhat, but if other sites are sending bogus trans id counters, we
can't really check them.
Objections and corrections welcome. Any changes will be going through
review in gerrit, too, of course, but I haven't implemented any of this
yet, so... if there's objections to this conceptually, it'd be best to
get them out sooner.
--
Andrew Deason
adeason@sinenomine.net