[AFS3-std] Re: [OpenAFS-devel] convergence of RxOSD,
Extended Call Backs, Byte Range Locking, etc.
Matt W. Benjamin
matt@linuxbox.com
Thu, 23 Jul 2009 11:54:01 -0400 (EDT)
Remarks inline.
----- "Hartmut Reuter" <reuter@rzg.mpg.de> wrote:
> Tom Keiser wrote:
> > my forthcoming paper, which I outlined in the Gerrit commentary:
> >=20
> > 1) extended callbacks cannot be implemented
>=20
> Why not?
I think potentially, the issue is this (two issues, one real, one apparent =
and projected onto xcb)
(I'm doing some thinking aloud here, and with imperfect knowledge of Tom's =
objections--please forgive any misstatements.)
Issue 1: Apparent Operation Order, Conflicting Stores
Tom's concern involves non-atomicity of dataversion increments. As I sugge=
sted in my afsbpw 2009 talk, what matters is that there is a consistant, -a=
pparent- operation order (here dataversion order) shared by all parties. I=
would expect that the fileserver (not OSD servers) will continue to be res=
ponsible for extended callback delivery, and will do so for StoreData opera=
tions on receipt of the final store which succeeds operations on the involv=
ed OSD(s).
As Tom notes later, if two clients A and B were attempting to write, "nearl=
y simultaneously," the same range R in the same file F, on a mirrored volum=
e, the final stored state might represent partly the data stored by A, and =
partly the data stored by B--this fact will, Tom is inferring, not actually=
be known by the coordinating fileserver. To each change, corresponds an i=
ncrement of DV(F). Consider that A and B had cached date in R. If A and B=
and the server support extended callbacks, then A and B will each receive =
a sequence of two extended callback StoreData notifications each with new D=
Vs, the changed ranges listed, identified by the client that originated the=
change. As noted, since the coordinating fileserver doesn't actually know=
what bytes were written in the interval, it cannot send the correct range =
invalidates to the clients. If it attempts to send the range invalidates c=
orresponding to the contributing changes, there is a risk that either A, B,=
or even A and B will incorrectly retain stale data on receipt of the callb=
ack.
However. This problem isn't insoluble. It appears to me that it can be so=
lved at two levels (i.e., potentially, solved once initially, and a better =
solution implemented later):
1. it can be solved through protocol extension (more below)
2. it can be solved through conservative inference at the coordinating file=
server, perhaps with very limited protocol enhancement (additional xcb flag=
)
In 2, the fileserver, knowing that F is on a mirrored OSD, infers that the =
data in all of R is potentially invalid at A and B, and therefore, sends to=
both clients a single notification, from arbitrary origin (could be A or B=
, or 0), with a flag indicating that the range is strictly invalidated, wit=
hout consideration of origin or DV. The new DV is the highest DV in the in=
terval. The state of potentially cached data of F outside R is not affecte=
d (may be retained).
Issue 2: Read Instability, Intersecting Store (Dirty Read)
Here, A is a client initiating a store on F, and B is a client interested i=
n data in a range R on a file F, which it has not cached. These operations=
are executed "nearly simultaneously." In legacy AFS3, with operations coo=
rdinated at the store site, the B's view of DV(F) when its read on R comple=
tes, is the data last stably stored at DV. In OSD, that data may more rece=
nt than the DV. That data may be inconsistent with respect to the state of=
subranges of R that may be on different OSDs.
However. All changes are still coordinated on a single fileserver, and tha=
t fileserver is the one responsible for delivery of notifications, as descr=
ibed. In the scenario just sketched, B's dirty read will be -undone-, and =
it's view of the correct DV(F) corrected, as soon as A's coordinating store=
completes. B will receive an extended callback StoreData notification on =
the affected range of A's logical store. It will be forced to re-read ever=
ything in the range it remains interested in. Any data it had read origina=
lly that was inconsistent, will be invalidated. So this is not, in fact a =
problem with extended callback delivery at all--it is a consistency change,=
discussed below. (It is, as noted by Hartmut and/or Felix, one clients ca=
n reliably contain via existing and proposed locking mechanisms.)
Tom, can you provide additional clarification about your concern as regards=
extended callbacks?
>=20
> > 2) mandatory locking semantics cannot be implemented
>=20
> Why not? Once it is implemented for FatchData ans StoreData the same
> code could be used in GetOSDlocation.
I believe it can be implemented, but it requires protocol enhancement, prob=
ably a reservation-based mechanism coordinated (initially) through the file=
server. Also, clearly, mandatory lock enforcement is -NOT- in AFS3 semanti=
cs, by definition. Something new is being elaborated, but it has already b=
een proposed for community review in all versions of the byte-range locking=
draft. Which is not to say that draft cannot be revised, if appropriate.
>=20
> > 4) lack of write-atomicity entirely changes the meaning of DV in the
> protocol
>=20
> If you want read and write from different clients at the same time in
> a
> way that can produce inconsistencies you should use advisory locking.
> Also without OSD an intermediate StoreData because of the cache
> becoming
> full may leed to an intermediate data version which from the
> application's point of view is inconsistent.
I think I more agree with the "then use locking" response than with the obj=
ection, but not unequivocally. Tom's use of language here ("entirely chang=
es the meaning") appears to me to be an attempt to nail down a meaning for =
DV that it never had--but I say this as someone who would like to incorpora=
te stronger, negotiated semantics in the AFS protocol.
In my view, long term, AFS protocol extension is required, not merely to de=
liver enhanced (or reduced) semantics, but, in fact, a) well defined, and b=
) negotiable semantics, such that clients and servers know and agree on the=
semantics that hold for operations on specific objects. I think we must c=
onsider that going forward, the capabilities of different implementations a=
nd provisioning choices necessarily means coexistence of objects whose cons=
istency guarantees are different. A key objective for us, in design of fut=
ure protocol, should be to make this fact visible and useful in the protoco=
l. As I state again further on, I do not think that "OSDs have this fixed =
(reduced) consistency is adequate, long term, though it seems silly to do a=
nything but accept that until the alternative is available.
>=20
> > 5) the data "mirroring" capability is race-prone, and thus not
> consistent
What I understand this to mean is that, for example, if two clients A and B=
were attempting to write, "nearly simultaneously," the same range in the s=
ame file, on a mirrored volume, the final stored state might represent part=
ly the data stored by A, and partly the data stored by B, and also, as note=
d earlier, that data read by B overlapping with a store by A may not reflec=
t a consistent state of F when A's store completes (and during the store in=
terval, B may have different data for F than do other clients which had cac=
hed on F, at the same DV). =20
It appears to me that, current AFS3 implementations don't permit these scen=
arios to happen, and to that degree, they not permitted under AFS3 semantic=
s. We could argue about whether that's true, however, even in general (and=
the two cases are distinct). But even if we take the strict view, we have=
, as Hartmut notes, not established that the semantics of AFS3+OSD need be =
those of legacy AFS3 in all respects.
>=20
> > 7) insufficient information is communicated on the wire to support
> the
> > distributed transaction management necessary to ensure atomicity of
> > various operations
>=20
> I admit reading and writing data from/to OSDs is not as atomic as
> doing
> the same to a classical AFS fileserver. And I think to make it atomic
> would require substantial overhead and even more complexity. Therefore
> I
> think files stored in OSDs never should and will replace normal AFS
> files totally, but this technique should be used where the usage
> pattern
> of the files does not require such atomicy.
I do think that, relative to enhanced semantics options we will wish to sup=
port in future protocol revisions, there should be a fuller discussion. Ag=
ain, the topic feels clearly forward-looking to me--not about the current s=
emantics of rxOSD. I think that as the discussion has already hinted, the =
most interesting areas to examine first are in the direction of negotiated =
protocol levels, supporting mandatory locking, reservations, IO hints, etc.
Matt
--=20
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309