[OpenAFS] Further TransArc -> OpenAFS musings/planning

Wed, 16 Aug 2006 17:07:50 -0400

Ken Hornstein <kenh@cmf.nrl.navy.mil> writes:
> AFAIK, Ubik's concept is a "shared disk file"; at the level replication
> is handled, it has no idea what the underlying database format is.
> So while I also prefer to have Ubik copy it over, just using cp or
> whatever should be fine (all of the AFS databases use network byte
> order internally).

This is completely correct.  For what it's worth, it is completely
safe to use "cp" or whatever to replicate your database, *provided*
you don't have any db servers running (or at least no server process
on your target host and no sync sites) when you do this, and your copy
completes successfully.  It is also completely safe (and takes fewer
brain cells) to let ubik do it for you.

You can use "openssl sha1" (or "sum" or whatever) to ensure your
copy succeeded; it is not possible to have a database which is
identical unless it is bytewise identical on every last byte.
Of course it is possible to have a database which is logically
equivalent and nearly identical except for the disk label (an
epoch + counter value pair which ubik updates on every write.)

It *is* possible to clobber your database if you rely on ubik
to propagate and screw up adding or moving servers.  The catch
is if you bring up enough servers to establish a quorum, but don't
give them a valid copy of the database, they'll create a new
empty one.  That new copy will overwrite any older copy you might
subsequently supply.  It's a wise idea to make a backup of your
ubik database before you make this kind of change.  Of course
you should probably be making periodic offline copies for disaster
recovery purposes anyways.

In case you're curious:

The ubik file proper actually consts of a 64-byte header (mostly
unused), followed by the contents of the database.  ubik proper
presents a byte seekable random byte array to the application data,
but actually tracks and propagates data changes in terms of buffers
of 1024 bytes.  Since those buffers are offset by 64 bytes, this is
slightly non-optimal for modern filesystems.  The 64-byte header 
ontains the ubik database label plus some other trivial stuff.

Most of the various openafs databases are "hash" tables -- they consist
of a fixed size header with some counts & such, one or more hash tables
consisting entirely of links, and then fixed sized records whose
size varies by the database.  In the case of ptserver, those records
are 192 bytes each.  A viced which contains a lot of groups or users
may overflow into a number of additional chained records.
There are no provisions to dynamically resize the hash table
structure in an existing database.

When ubik establishes a sync site, the sync site is in charge of
acquiring the "definitive" database.  It will pull it from another db
host if necessary, then propagate it back out to everybody, then
relabel the database on all machines.  There's short-circuit logic to
avoid propagating the database to machines that already have the
current copy.  There is also a "redo" log which allows each host to
provider proper commit/abort semantics and recover cleanly from
unexpected system termination.  The process uses rx (of course), and
copies to one machine at a time.  If your database is large enough that
network throughput might be an issue, some external offline means of
syncing data might be faster.  This is particularly true if you have
many sites and can keep multiple network paths busy.

				-Marcus Watts