[OpenAFS-devel] Re: The ubik transaction ID rollover problem

Jeffrey Hutzelman jhutz@cmu.edu
Fri, 03 Sep 2010 17:49:26 -0400


--On Friday, September 03, 2010 04:00:00 PM -0500 Andrew Deason 
<adeason@sinenomine.net> wrote:

> I think we don't currently see this problem because if DBWRITING is set,
> we send a trans id counter that cannot be "wrong". Since we base it off
> of the writeTidCounter, which is always a very low positive number, it
> will always be below any active write transaction, and
> urecovery_CheckTid will not mark it as "wrong".
>
> If DBWRITING is not set, we send tidCounter+1, as you mention. If there
> is still no write transaction when it arrives, the trans id is not
> checked. If a write transaction has started in the meantime, it will
> have a higher transaction id than the one sent since it began after we
> sent the beacon. (Otherwise the sync site would have detected DBWRITING
> and would have sent writeTidCounter).

No, I think you're making an assumption of atomicity that is not true.  "It 
began" is a distributed state change which may not take effect everywhere 
at once, with respect to when our beacon is sent.  Moreso for the _end_ of 
a transaction, where we're transitioning in the opposite direction.  Fixing 
writeTidCounter may make this problem worse, as it will no longer tend to 
be much lower than tidCounter.

In addition, as we discussed on jabber, there are some rather significant 
thread-safety issues with pthreaded ubik.  One of those is that our 
examination of DBWRITING, tidCounter, and writeTidCounter are not atomic, 
and neither is the starting of a new local transaction atomic with respect 
to the main body of ubeacon_Interact().

As I said, this is going to require some more thought.

-- Jeff