[OpenAFS-devel] Re: Cache corruption with RX busy code

Sat, 13 Apr 2013 01:36:31 -0500

On Fri, 12 Apr 2013 20:59:27 +0100
Simon Wilkinson <sxw@your-file-system.com> wrote:

> Various things can cause a client and server to have differing views
> on the available call channels. When the client attempts to use a call
> channel that the server thinks is in use, the server responds with a
> BUSY packet. Originally, the client would just ignore this. It would
> then look like the server wasn't responding, and the client would keep
> retrying on that channel until either the call timed out, or the
> channel on the server was freed.

Can't the race below still happen with this old behavior? Say you retry
the call 6 times before timing out (pulling numbers out of the air; I
don't remember how many it typically takes). The first 5 result in a
BUSY response. On the 6th, the server receives the packet and the call
channel is clear, but before it gets an ACK to the client, the client
times out the call. And the same thing you describe happens; the server
processes the request but the client thinks it failed.

> The race is as follows:
> 
> Client                                        Server
>         
> Sends 1st pkt of call to server                
>                                         Receives 1st packet, but channel busy
>                                             sets error on old call
>                                             sends BUSY packet to client
> RTT expires
>     resends 1st pkt
>                                         Old call terminates
> Receives BUSY packet
>     sets call busy flag

[sorry if my client messes up the formatting, hopefully you get where
I'm referencing]

I'm not looking at the code at the moment, but don't we get the serial
of the offending packet in the BUSY we receive? Therefore, we should be
able to ignore a BUSY packet if it does not reference the most recent
serial we've sent.

> The question is whether just adding more cases where we invalidate the
> cache is the right approach, or whether we should reconsider the BUSY
> behaviour.

If what I described in the first paragraph indeed applies, it seems like
we'd need to invalidate cache items on any network error (not just idle
dead, or busy, or whatever). Even a normal DEAD error for the 1st packet
of an RPC doesn't guarantee that the server never received anything;
imagine the case where, say, all of the server's ACKs get dropped.

I'm not really checking myself here and looking rather quickly, so I may
be remembering stuff entirely incorrectly. But if any of all that makes
any sense, eliminating such errors is impossible and we just need to
discard cache data for all uncertain cases.

-- 
Andrew Deason
adeason@sinenomine.net