[OpenAFS] R/W replication
Shafeeq Sinnamohideen
shafeeq@cs.cmu.edu
Wed, 14 Feb 2001 17:09:33 -0500 (EST)
On Wed, 14 Feb 2001, Dirk Heinrichs wrote:
> Nathan Neulinger wrote:
> >
> > Dirk Heinrichs wrote:
> > Right, but what do you do in that circumstance?
> >
> > [client a][server b] ---/ /--- [client c][server d]
> >
> > if you have rw replicates on both servers, and the link between fails -
> > a and b think they are fine, and c and d think they are fine. Who takes
> > over as master? And what tells the other server that it is no longer
> > master.
> Ok, if it is the link that fails, and not one of the servers, you are
> right. You would need some kind of conflict resolution. I wonder how the
> Coda folks solved this. They support both disconnected operation and rw
> replication. I think they use some kind of transaction processing
> system, similar to databases. Would be nice to hear from some practical
> experience with this (maybe I'll have to ask on their mailinglist).
>
Coda does write-write by having every client send writes to all servers
that it can reach. Each server keeps track of where the modification
came from and which other servers know about it. The servers normally
don't need to communicate with each other because the the client tells
them where else it sent the operation and where else it succeeded.
When reading, the client first checks the version on each of the servers
it can reach. If they're the same, it gets the data from one of them. If
they're different, the cient considers it a conflict. Simple conflicts,
like when some servers have missed updates, are handled automatically
since the client can determine from the versions what happend and force
the right data to the servers that missed it.
If client A wrote to B & C to D while partitioned, there is no clear right
version of that file. The client that first tries to read after the
partition is fixed will kick the conflict up to a user-level,
application-specific resolver, which can try to take both versions and
produce a valid merging, and if that fails, up to the user. Some
file formats, like mailboxes, are easier to merge than others.
Directories are handled as a special case. The servers log directory
operations, and when a client detects version skew on a directory, the
servers exchange logs amongst themselves and apply the equivalent missing
operations to get back to a consistent state. In the case no equivalent
exists (same update made on both sides of the partition) this also gets
flagged as a conflict that has to be fixed by a user-level tool. The
directories and the logs are maintained in a Recoverable Virtual Memory
package that provides transactional properties for VM to deal with server
crashes or hw failure. Entries get removed from the logs when it is known
that all servers have seen that operation, so they only grow when there is
an actual disconnection.
Thus the Coda scheme is optimistic, as it does not require data to be
committed everywhere for an operation to succeed, and lazy, as it only
tries to resynchronize when some client observes that there was a
divergence. The major downside of the lazyness is that the client that
first sees a divergence may not be one of the ones that caused it.
Client disconnection is more straightforwards. The client logs all
operations when disconnected and replays them on reconnection. The same
conflict resolution process happens if an connected update happend while
disconnected.
In our experience with Coda, with about 20 users, it has been fairly
stable for average workstation use. We still find bugs every now and then,
but almost never lose data. Most of the problems now occur when clients
are at the end of modem, DSL, or other lossy links and are constantly
switching between connected, disconnected and weakly-connected modes. The
administrative tools are also not as good as AFS's but we'd like to
improve in this area.
Shafeeq