[OpenAFS] R/W replication

Ted Anderson ota@transarc.com
Wed, 14 Feb 2001 09:41:43 -0500 (EST)

On 13 Feb 2001 10:29:56 -0500 Derek Atkins <warlord@MIT.EDU> wrote:
> I doubt that RW clones will ever happen.  It would imply huge
> consistency issues.  Who's job is it to make sure data is consistent
> across all the servers?

Well probably not, but we can imagine...

On Wed, 14 Feb 2001 08:03:33 -0600 Nathan Neulinger <nneul@umr.edu> wrote:
> Right, but what do you do in that circumstance?
> [client a][server b] ---/  /--- [client c][server d]
> if you have rw replicates on both servers, and the link between fails
> - a and b think they are fine, and c and d think they are fine. Who
> takes over as master? And what tells the other server that it is no
> longer master.
> What you'd probably have to do is something like forcing a quorum on
> which rw volume was master, which would mean that rw volumes on small
> distant networks would not be able to take over. ...

For R/W replication of file data I've always like the idea of using the
technique of spliting files into N shares where only M (M<N) are needed
to reconstruct the data.  This technique is akin to ECC used in memory
and RAID5 used by disks to provide redunancy without simply making
copies (mirroring).  You can achieve resistance to single failures at a
very small cost in extra storage.  This strategy is used by
MojoNation[1] and OceanStore[2].

What is interesting about this in the context of handling consistency in
general and network partitions in particular, is that these shares can
automatically implement quorum.  If M is selected to be N/2 < M < N, say
N=8 and M=5, then reads are impossible unless a majority of the servers
are up and writes can't complete until a majority of the servers have
obtained the new data.  The extra delay in sending the data could be
managed by doing more agressive write-behind and, for clients with a
thin network connection, asking the server to do the splitting and
sending.  Though in the limit of M = N-1 the extra bandwidth requirement
is very small.

This approach doesn't address the directory consistency issue and there
is the small problem of fitting it into the AFS protocol...


[1] http://MojoNation.net
[2] http://oceanstore.cs.berkeley.edu