[OpenAFS-devel] Re: Cache corruption with RX busy code

Wed, 17 Apr 2013 15:40:38 -0400 (EDT)

On Sat, 13 Apr 2013, Andrew Deason wrote:

> On Fri, 12 Apr 2013 20:59:27 +0100
> Simon Wilkinson <sxw@your-file-system.com> wrote:
>
>> Various things can cause a client and server to have differing views
>> on the available call channels. When the client attempts to use a call
>> channel that the server thinks is in use, the server responds with a
>> BUSY packet. Originally, the client would just ignore this. It would
>> then look like the server wasn't responding, and the client would keep
>> retrying on that channel until either the call timed out, or the
>> channel on the server was freed.
>
> Can't the race below still happen with this old behavior? Say you retry
> the call 6 times before timing out (pulling numbers out of the air; I
> don't remember how many it typically takes). The first 5 result in a
> BUSY response. On the 6th, the server receives the packet and the call
> channel is clear, but before it gets an ACK to the client, the client
> times out the call. And the same thing you describe happens; the server
> processes the request but the client thinks it failed.

My quick thought experiment agrees with Andrew that the race is still 
possible with the old behavior.

>
>> The question is whether just adding more cases where we invalidate the
>> cache is the right approach, or whether we should reconsider the BUSY
>> behaviour.
>
> If what I described in the first paragraph indeed applies, it seems like
> we'd need to invalidate cache items on any network error (not just idle
> dead, or busy, or whatever). Even a normal DEAD error for the 1st packet
> of an RPC doesn't guarantee that the server never received anything;
> imagine the case where, say, all of the server's ACKs get dropped.
>
> I'm not really checking myself here and looking rather quickly, so I may
> be remembering stuff entirely incorrectly. But if any of all that makes
> any sense, eliminating such errors is impossible and we just need to
> discard cache data for all uncertain cases.

It seems rather like we should be invalidating the cache "whenever the 
client thinks an RPC failed" (but we can be clever about it for RPCs that 
don't change anything, etc.).

-Ben