[OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range Locking, etc.

Thu, 23 Jul 2009 11:20:33 -0400

On Thu, Jul 23, 2009 at 4:51 AM, Hartmut Reuter<reuter@rzg.mpg.de> wrote:
> Tom Keiser wrote:
>> Hi All,
>>
>> The other day, we had a small discussion on Gerrit patch #70 regarding
>> adoption of several RxOSD protocol bits by OpenAFS. =A0Simon and Derrick
>> both suggested that I should move the discussion over to -devel so
>> that it reaches a broader audience. =A0As a bit of background, I'm
>> writing a white paper regarding the RxOSD protocol. =A0It's not quite
>> ready for release, but I get the sense that we need to start the
>> discussion sooner rather than later. =A0Here are several key issues from
>> my forthcoming paper, which I outlined in the Gerrit commentary:
>>
>> 1) extended callbacks cannot be implemented
>
> Why not?

Use of storeMini to increment DV does not pass the necessary
byte-range invalidate information onto the callback layer.  We can't
use the data passed into GetOSDlocation because there's no way to
correlate the calls.

>
>> 2) mandatory locking semantics cannot be implemented
>
> Why not? Once it is implemented for FatchData ans StoreData the same
> code could be used in GetOSDlocation.
>

JHutz already explained this quite well.  Putting his remarks into the
RxOSD context:

using the same logic in GetOSDlocation fails to meet mandatory locking
criteria for these reasons:

* GetOSDlocation has no knowledge of the status of in-flight I/O transactio=
ns
* read I/Os do not notify the metadata server of their completion, so
protocol changes are required

>> 3) lack of read-atomicity means that read-only clones fetches can now
>> return intermediate values, thus clone transitions are no longer
>> atomic
>
> The fileserver does a copy on write for files in OSD exactly in the same
> way as it does for normal AFS files. Also the rxosd doesn't allow write
> RPCs from clients to files with a link count > 1. During the volserver
> operation GetOSDmetadata returns busy and after an update the usual
> callbacks are sent to the client and honored also for OSD files
> (VerifyCache).

The Lamport diagrams necessary to prove this are complicated, and I
haven't had time to vet them.  I should like to defer this issue.

>
>> 4) lack of write-atomicity entirely changes the meaning of DV in the pro=
tocol
>
> If you want read and write from different clients at the same time in a
> way that can produce inconsistencies you should use advisory locking.
> Also without OSD an intermediate StoreData because of the cache becoming
> full may leed to an intermediate data version which from the
> application's point of view is inconsistent.
>

This isn't about whether advisory locking should be used.  I think we
all agree that's a requirement for consistency at the application
level (unless we were to standardize and implement the protocol
primitives necessary to support atomic operations, such as conditional
stores).  Rather, the issue I want to raise is the significant
redefinition of DV in RxOSD.  I think it's important to decompose this
problem into two orthogonal issues.

First, at the _protocol_ level, afs3 is strongly cache coherent.  The
ability to make arbitrary-length and -offset I/Os allows the protocol
to neatly side-step the problems that come with standard modulo 2^N
cache blocking models.  As a result, it is entirely possible to
implement an afs3 client which provides for fully coherent access to
the backing store.  The proviso is that said client must only flush
changed byte ranges.

Moving to the real world, let's grant up-front that typical afs3
clients have a relatively weak cache coherence model.  Unlike DFS's
MESI-like coherence token mechanism, we do not have a means to enforce
store-store atomicity (because we like to flush file data in chunks,
and VM subsystems can only notify us of changes on page granularity).
However, we do have sufficient infrastructure in-place to provide for
store-load atomicity (sans async XCB).  RxOSD removes our ability to
ensure store-load atomicity as well.

Another important point to consider is that from a data read
perspective, DV no longer represents an atomic point-in-time snapshot
of the file bit-string.  Rather, DV merely becomes a means of
asserting that something changed at some undefined point in the past.

While I don't love the existing afs3 coherence model, I think there
needs to be a serious discussion about whether/how it should evolve.
This isn't the right forum for such a discussion.  We desperately need
a protocol draft so that we can take this discussion over to
afs3-standardization, and hopefully make forward progress.

>> 5) the data "mirroring" capability is race-prone, and thus not consisten=
t
>
>
>> 6) the GetOSDlocation RPC conflates a number of operations, and should b=
e split
>
> Basically it is used to prepare I/O to OSDs for both cases: read and
> write. I don't know whether you mean this. It also has kind of a debug
> interface for "fs osdmetadata -cm <file>" to allow you to see what the
> client gets returned on GetOSDlocation.

I mean that the procedure conflates the following operations:

1) start of I/O
2) fetch of osd location metadata
3) conditionally, as a backing store allocator

I can understand the desire for (1) and (3) to be combined into a
single RPC.  They are both non-idempotent operations, and it is an
opportunity to save round-trips.  However, (2) is an idempotent
operation, and there is significant value in allowing the client to
cache this metadata for certain access patterns.  Simply sending an
osd location metadata version ordinal in a hypothetical osd
start-of-I/O RPC would even make this cache coherent.

>
> The program get_osd_location which is called by GetOSDlocation is also
> used for the legacy-interface to serve I/O to OSD for old clients. If
> you use the HSM funtionality of AFS/OSD a file which is presently
> off-line (on tape in an underlying HSM system) must be brought back to
> an on-line OSD before the client can access it. In the OSD-aware client
> this is done already when the file is opened by the application letting
> the application wait in the open system call. Only for the legacy
> interface it can only be done during the FetchData or Storedata RPC.
> Therefore this functionality was put also into =A0get_osd_location.
> Is it that what you mean?
>
>> 7) insufficient information is communicated on the wire to support the
>> distributed transaction management necessary to ensure atomicity of
>> various operations
>
> I admit reading and writing data from/to OSDs is not as atomic as doing
> the same to a classical AFS fileserver. And I think to make it atomic
> would require substantial overhead and even more complexity. Therefore I
> think files stored in OSDs never should and will replace normal AFS
> files totally, but this technique should be used where the usage pattern
> of the files does not require such atomicy. In our case e.g. all user
> home directories, software ... remain in normal AFS files. But our cell
> contains very many long time archives for data produced by experiments
> and other sources (digitized photo libraries, audio and video documents)
> whith typically are written only once. So for this kind of data atomicy
> is not required at all.

I think we are ceding important problem space by building a false
dichotomy.  It is certainly possible to build a start-of-I/O RPC
which, depending upon policy bits, conditionally allocates a
transaction identifier, and further mandates that the client send an
end-of-I/O RPC to the metadata server.  If the metadata server doesn't
push such a mandate, fully-concurrent access is allowed.  If it does
allocate a transaction ID, then strong coherence and atomicity
guarantees can be made.  This would have no performance impact (modulo
a few conditional branches) on sites wishing to do fully-concurrent
I/O.  The upside is it would allow sites to deploy osd in environments
where strong coherence and atomicity are requirements.

>
>> 8) there is no means to support real-time parity computation in
>> support of data redundancy
>
> How should that look like? BTW, AFS/OSD keeps md5 checksums for archival
> copies of files in OSD. This feature already proved to be very useful
> when we had a problem with the underlying HSM system DSM-HSM.
>

I don't think it particularly matters what the parity mechanism looks
like.  My core point is that nothing higher than raid-0 is a
possibility until the metadata server has the ability to mediate all
in-flight I/O transactions.

>> 9) osd location metadata should be cacheable
>
> It is implemented only for embedded shared filesystems (GPFS or Lustre
> /vicep partitions accessed directly from the client). I admit that
> especially in the case of reading files it could reduce the number of
> RPCs sent to the fileserver because still each chunk requires separate
> RPCs. However, my idea to this ponit is it would be better to allow the
> client to prefetch a reasonable number of chunks in a single RPC.
>

That works fine for deterministic access patterns.  What happens when
there is significant non-determinism, or the patterns are simply too
complicated to discern?

>> 10) the wire protocol is insufficient to support any notion of data jour=
nalling
>
> What kind of journaling you have in mind here?
>

Data journalling normally implies transactional data update semantics
(e.g. MVCC).  In order for something like this to be implemented,
there must be some notion of distributed transactions, and a
distributed transaction manager.

Regards,

-Tom