[OpenAFS-devel] Cache corruption with RX busy code

Simon Wilkinson simonxwilkinson@gmail.com
Thu, 18 Apr 2013 15:40:26 +0100


On 17 Apr 2013, at 20:40, Benjamin Kaduk wrote:

> On Sat, 13 Apr 2013, Andrew Deason wrote:
>> Can't the race below still happen with this old behavior? Say you =
retry
>> the call 6 times before timing out (pulling numbers out of the air; I
>> don't remember how many it typically takes). The first 5 result in a
>> BUSY response. On the 6th, the server receives the packet and the =
call
>> channel is clear, but before it gets an ACK to the client, the client
>> times out the call. And the same thing you describe happens; the =
server
>> processes the request but the client thinks it failed.
>=20
> My quick thought experiment agrees with Andrew that the race is still =
possible with the old behavior.

Yes, the race is completely possible with the old behaviour. However, =
with the old behaviour the client only cancels a call when it times out. =
That takes in the order of 50 seconds, and (with an RTT of 20ms) =
requires 10 retries. With idle dead disabled, or a longer idle dead =
timeout, it will take much longer.

With the new behaviour, a single packet call on a fresh channel will be =
cancelled after a single retry (requiring about 60ms).

When the server receives a packet on a busy channel, it tries to tear =
down the existing call, by setting an RX call error. With the old =
behaviour we had 50 or so seconds for the application thread to notice =
this error, and terminate the call. With the new behaviour, the server =
has 40ms.

So, a race that was extremely unlikely becomes more probable.

> It seems rather like we should be invalidating the cache "whenever the =
client thinks an RPC failed" (but we can be clever about it for RPCs =
that don't change anything, etc.).


The critical thing is that we should be invalidating the cache whenever =
an RPC that changes server state is failed due to a timeout. We've =
already discussed adding this behaviour to deal with the idle dead mess =
- we just need to add it elsewhere.

Then, there is an additional question - is the hit you take from cache =
invalidation worth terminating a call for? That's going to be a trade =
off, but I suspect that the current behaviour just doesn't allow a =
server enough time to tidy up.

Cheers,

Simon