[OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range Locking, etc.

Thu, 23 Jul 2009 11:54:01 -0400 (EDT)

Remarks inline.

----- "Hartmut Reuter" <reuter@rzg.mpg.de> wrote:

> Tom Keiser wrote:
> > my forthcoming paper, which I outlined in the Gerrit commentary:
> >=20
> > 1) extended callbacks cannot be implemented
>=20
> Why not?

I think potentially, the issue is this (two issues, one real, one apparent =
and projected onto xcb)

(I'm doing some thinking aloud here, and with imperfect knowledge of Tom's =
objections--please forgive any misstatements.)

Issue 1:  Apparent Operation Order, Conflicting Stores

Tom's concern involves non-atomicity of dataversion increments.  As I sugge=
sted in my afsbpw 2009 talk, what matters is that there is a consistant, -a=
pparent- operation order (here dataversion order) shared by all parties.  I=
 would expect that the fileserver (not OSD servers) will continue to be res=
ponsible for extended callback delivery, and will do so for StoreData opera=
tions on receipt of the final store which succeeds operations on the involv=
ed OSD(s).

As Tom notes later, if two clients A and B were attempting to write, "nearl=
y simultaneously," the same range R in the same file F, on a mirrored volum=
e, the final stored state might represent partly the data stored by A, and =
partly the data stored by B--this fact will, Tom is inferring, not actually=
 be known by the coordinating fileserver.  To each change, corresponds an i=
ncrement of DV(F).  Consider that A and B had cached date in R.  If A and B=
 and the server support extended callbacks, then A and B will each receive =
a sequence of two extended callback StoreData notifications each with new D=
Vs, the changed ranges listed, identified by the client that originated the=
 change.  As noted, since the coordinating fileserver doesn't actually know=
 what bytes were written in the interval, it cannot send the correct range =
invalidates to the clients.  If it attempts to send the range invalidates c=
orresponding to the contributing changes, there is a risk that either A, B,=
 or even A and B will incorrectly retain stale data on receipt of the callb=
ack.

However.  This problem isn't insoluble.  It appears to me that it can be so=
lved at two levels (i.e., potentially, solved once initially, and a better =
solution implemented later):

1. it can be solved through protocol extension (more below)
2. it can be solved through conservative inference at the coordinating file=
server, perhaps with very limited protocol enhancement (additional xcb flag=
)

In 2, the fileserver, knowing that F is on a mirrored OSD, infers that the =
data in all of R is potentially invalid at A and B, and therefore, sends to=
 both clients a single notification, from arbitrary origin (could be A or B=
, or 0), with a flag indicating that the range is strictly invalidated, wit=
hout consideration of origin or DV.  The new DV is the highest DV in the in=
terval.  The state of potentially cached data of F outside R is not affecte=
d (may be retained).

Issue 2:  Read Instability, Intersecting Store (Dirty Read)

Here, A is a client initiating a store on F, and B is a client interested i=
n data in a range R on a file F, which it has not cached.  These operations=
 are executed "nearly simultaneously."  In legacy AFS3, with operations coo=
rdinated at the store site, the B's view of DV(F) when its read on R comple=
tes, is the data last stably stored at DV.  In OSD, that data may more rece=
nt than the DV.  That data may be inconsistent with respect to the state of=
 subranges of R that may be on different OSDs.

However.  All changes are still coordinated on a single fileserver, and tha=
t fileserver is the one responsible for delivery of notifications, as descr=
ibed.  In the scenario just sketched, B's dirty read will be -undone-, and =
it's view of the correct DV(F) corrected, as soon as A's coordinating store=
 completes.  B will receive an extended callback StoreData notification on =
the affected range of A's logical store.  It will be forced to re-read ever=
ything in the range it remains interested in.  Any data it had read origina=
lly that was inconsistent, will be invalidated.  So this is not, in fact a =
problem with extended callback delivery at all--it is a consistency change,=
 discussed below.  (It is, as noted by Hartmut and/or Felix, one clients ca=
n reliably contain via existing and proposed locking mechanisms.)

Tom, can you provide additional clarification about your concern as regards=
 extended callbacks?

>=20
> > 2) mandatory locking semantics cannot be implemented
>=20
> Why not? Once it is implemented for FatchData ans StoreData the same
> code could be used in GetOSDlocation.

I believe it can be implemented, but it requires protocol enhancement, prob=
ably a reservation-based mechanism coordinated (initially) through the file=
server.  Also, clearly, mandatory lock enforcement is -NOT- in AFS3 semanti=
cs, by definition.  Something new is being elaborated, but it has already b=
een proposed for community review in all versions of the byte-range locking=
 draft.  Which is not to say that draft cannot be revised, if appropriate.

>=20
> > 4) lack of write-atomicity entirely changes the meaning of DV in the
> protocol
>=20
> If you want read and write from different clients at the same time in
> a
> way that can produce inconsistencies you should use advisory locking.
> Also without OSD an intermediate StoreData because of the cache
> becoming
> full may leed to an intermediate data version which from the
> application's point of view is inconsistent.

I think I more agree with the "then use locking" response than with the obj=
ection, but not unequivocally.  Tom's use of language here ("entirely chang=
es the meaning") appears to me to be an attempt to nail down a meaning for =
DV that it never had--but I say this as someone who would like to incorpora=
te stronger, negotiated semantics in the AFS protocol.

In my view, long term, AFS protocol extension is required, not merely to de=
liver enhanced (or reduced) semantics, but, in fact, a) well defined, and b=
) negotiable semantics, such that clients and servers know and agree on the=
 semantics that hold for operations on specific objects.  I think we must c=
onsider that going forward, the capabilities of different implementations a=
nd provisioning choices necessarily means coexistence of objects whose cons=
istency guarantees are different.  A key objective for us, in design of fut=
ure protocol, should be to make this fact visible and useful in the protoco=
l.  As I state again further on, I do not think that "OSDs have this fixed =
(reduced) consistency is adequate, long term, though it seems silly to do a=
nything but accept that until the alternative is available.

>=20
> > 5) the data "mirroring" capability is race-prone, and thus not
> consistent

What I understand this to mean is that, for example, if two clients A and B=
 were attempting to write, "nearly simultaneously," the same range in the s=
ame file, on a mirrored volume, the final stored state might represent part=
ly the data stored by A, and partly the data stored by B, and also, as note=
d earlier, that data read by B overlapping with a store by A may not reflec=
t a consistent state of F when A's store completes (and during the store in=
terval, B may have different data for F than do other clients which had cac=
hed on F, at the same DV). =20

It appears to me that, current AFS3 implementations don't permit these scen=
arios to happen, and to that degree, they not permitted under AFS3 semantic=
s.  We could argue about whether that's true, however, even in general (and=
 the two cases are distinct).  But even if we take the strict view, we have=
, as Hartmut notes, not established that the semantics of AFS3+OSD need be =
those of legacy AFS3 in all respects.

>=20
> > 7) insufficient information is communicated on the wire to support
> the
> > distributed transaction management necessary to ensure atomicity of
> > various operations
>=20
> I admit reading and writing data from/to OSDs is not as atomic as
> doing
> the same to a classical AFS fileserver. And I think to make it atomic
> would require substantial overhead and even more complexity. Therefore
> I
> think files stored in OSDs never should and will replace normal AFS
> files totally, but this technique should be used where the usage
> pattern
> of the files does not require such atomicy.

I do think that, relative to enhanced semantics options we will wish to sup=
port in future protocol revisions, there should be a fuller discussion.  Ag=
ain, the topic feels clearly forward-looking to me--not about the current s=
emantics of rxOSD.  I think that as the discussion has already hinted, the =
most interesting areas to examine first are in the direction of negotiated =
protocol levels, supporting mandatory locking, reservations, IO hints, etc.

Matt

--=20

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309