[OpenAFS-devel] Re: Cache corruption with RX busy code

Andrew Deason adeason@sinenomine.net
Thu, 18 Apr 2013 10:52:24 -0500


On Thu, 18 Apr 2013 07:49:28 -0400
chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil> wrote:

> > I'm not looking at the code at the moment, but don't we get the serial
> > of the offending packet in the BUSY we receive? Therefore, we should be
> > able to ignore a BUSY packet if it does not reference the most recent
> > serial we've sent.
> 
> Even if we did, I think that a race between two new connections would
> both have the same starting serial number (even if the starting serial
> numbers were random, it would just make the problem more unlikely but
> still possible).

We have the CID and epoch, too. I meant all of that in the context of
the relevant connection, which we can identify.

> > I'm not really checking myself here and looking rather quickly, so I
> > may be remembering stuff entirely incorrectly. But if any of all
> > that makes any sense, eliminating such errors is impossible and we
> > just need to discard cache data for all uncertain cases.
> 
> This does seem to be the case.  RX doesn't have a three way handshake
> like TCP so I don't think this race is fixable without a protocol
> change.

That doesn't solve it, either; it just moves the problematic "did the
server get it" packet to the first data packet after the handshake.

Nothing "solves" this (beyond, detecting that it might have happened).
It's the "generals" problem, or the byzantine whatever problem; it's not
solvable.

-- 
Andrew Deason
adeason@sinenomine.net