[AFS3-std] first draft: ubik update proposal

Derrick Brashear shadow@gmail.com
Tue, 15 Feb 2011 14:07:34 -0500


On Tue, Feb 15, 2011 at 1:48 PM, Jeffrey Hutzelman <jhutz@cmu.edu> wrote:
> --On Tuesday, February 15, 2011 01:07:50 AM -0500 Derrick Brashear
> <shadow@gmail.com> wrote:
>
>>> I'm not clear on how snapshotting interacts with GetFile/SendFile and
>>> active operations. =A0I think in practice the mechanism you need is one
>>> that allows you to "freeze" the target's databases so that active
>>> transactions read from the frozen copy, while sendfile prepares a "new"
>>> copy; note that there can be no write transactions, since writes happen
>>> only on the sync site and these calls are made only by the sync site an=
d
>>> never to itself. Having done a snapshot and sent some new files, it mus=
t
>>> be possible to either commit the new files or discard them; recovery
>>> should only do the commit operation if it is still sync site.
>>
>> the original intent of getfilediff was for some future use, not at this
>> time.
>>
>> sendfilediff is an optimization. just because you're recovering
>> doesn't mean the extant quorum can't continue taking writes. so i take
>> writes and when sendfile to you finishes, i stop taking writes, send
>> *only* a diff, and then commit and resume taking writes, not unlike a
>> volume release.
>
> First, properly, "recovering" is something that only the sync site does.
> Other sites don't "recover"; they simply do what they're told.

which, for the purpose of this discussion i refer to as "recovering";
the master site says "take this"

if the "this" you are taking is not recovering you, i'm not really
sure what to call it.

>=A0Still, your
> point is taken -- the sync site can send the bulk of the database while
> still handling write transactions, and then do an incremental update of s=
ome
> sort at the end.

right, that's the goal here.

> However, I think you will discover you need an operation which throws awa=
y
> changes since the snapshot, because as soon as you allow not only for
> multiple files but also for the sync site to keep taking updates during
> sendfile, there is the possibility that the sync site will stop being syn=
c
> site, and need to abort any sends it has in progress. =A0Previously this =
was
> not an issue, because even though the SendFile took time to run, it was a=
n
> atomic operation with respect to anything that might modify the database =
on
> either side.

at the end which is having its database updated, you mean?
so e.g. an RPC which at the end of sending files to a site, does
either a commit or abort of the data sent for the recovery process.

and assuming this is openafs-specific (which it seems like a
reasonable thing for it to be; it's certainly not client-facing for
any of these changes, and mixing ubik versions would be a mess)
should we move this discussion to openafs-devel?

--=20
Derrick