[OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range Locking, etc.

Hartmut Reuter reuter@rzg.mpg.de
Thu, 23 Jul 2009 10:51:21 +0200


Tom Keiser wrote:
> Hi All,
> 
> The other day, we had a small discussion on Gerrit patch #70 regarding
> adoption of several RxOSD protocol bits by OpenAFS.  Simon and Derrick
> both suggested that I should move the discussion over to -devel so
> that it reaches a broader audience.  As a bit of background, I'm
> writing a white paper regarding the RxOSD protocol.  It's not quite
> ready for release, but I get the sense that we need to start the
> discussion sooner rather than later.  Here are several key issues from
> my forthcoming paper, which I outlined in the Gerrit commentary:
> 
> 1) extended callbacks cannot be implemented

Why not?

> 2) mandatory locking semantics cannot be implemented

Why not? Once it is implemented for FatchData ans StoreData the same
code could be used in GetOSDlocation.

> 3) lack of read-atomicity means that read-only clones fetches can now
> return intermediate values, thus clone transitions are no longer
> atomic

The fileserver does a copy on write for files in OSD exactly in the same
way as it does for normal AFS files. Also the rxosd doesn't allow write
RPCs from clients to files with a link count > 1. During the volserver
operation GetOSDmetadata returns busy and after an update the usual
callbacks are sent to the client and honored also for OSD files
(VerifyCache).

> 4) lack of write-atomicity entirely changes the meaning of DV in the protocol

If you want read and write from different clients at the same time in a
way that can produce inconsistencies you should use advisory locking.
Also without OSD an intermediate StoreData because of the cache becoming
full may leed to an intermediate data version which from the
application's point of view is inconsistent.

> 5) the data "mirroring" capability is race-prone, and thus not consistent


> 6) the GetOSDlocation RPC conflates a number of operations, and should be split

Basically it is used to prepare I/O to OSDs for both cases: read and
write. I don't know whether you mean this. It also has kind of a debug
interface for "fs osdmetadata -cm <file>" to allow you to see what the
client gets returned on GetOSDlocation.

The program get_osd_location which is called by GetOSDlocation is also
used for the legacy-interface to serve I/O to OSD for old clients. If
you use the HSM funtionality of AFS/OSD a file which is presently
off-line (on tape in an underlying HSM system) must be brought back to
an on-line OSD before the client can access it. In the OSD-aware client
this is done already when the file is opened by the application letting
the application wait in the open system call. Only for the legacy
interface it can only be done during the FetchData or Storedata RPC.
Therefore this functionality was put also into  get_osd_location.
Is it that what you mean?

> 7) insufficient information is communicated on the wire to support the
> distributed transaction management necessary to ensure atomicity of
> various operations

I admit reading and writing data from/to OSDs is not as atomic as doing
the same to a classical AFS fileserver. And I think to make it atomic
would require substantial overhead and even more complexity. Therefore I
think files stored in OSDs never should and will replace normal AFS
files totally, but this technique should be used where the usage pattern
of the files does not require such atomicy. In our case e.g. all user
home directories, software ... remain in normal AFS files. But our cell
contains very many long time archives for data produced by experiments
and other sources (digitized photo libraries, audio and video documents)
whith typically are written only once. So for this kind of data atomicy
is not required at all.

> 8) there is no means to support real-time parity computation in
> support of data redundancy

How should that look like? BTW, AFS/OSD keeps md5 checksums for archival
copies of files in OSD. This feature already proved to be very useful
when we had a problem with the underlying HSM system DSM-HSM.

> 9) osd location metadata should be cacheable

It is implemented only for embedded shared filesystems (GPFS or Lustre
/vicep partitions accessed directly from the client). I admit that
especially in the case of reading files it could reduce the number of
RPCs sent to the fileserver because still each chunk requires separate
RPCs. However, my idea to this ponit is it would be better to allow the
client to prefetch a reasonable number of chunks in a single RPC.

> 10) the wire protocol is insufficient to support any notion of data journalling

What kind of journaling you have in mind here?

> 
> Many of these issues will eventually need to be discussed on
> afs3-standardization.  Lacking a formal internet draft, I suspect
> there may be some value in starting a discussion here.  At the very
> least, it may help us with dependency analysis of the major
> enhancements in the pipeline.  Coming out of these ten points, I see a
> few major classes of issues that will require discussion and planning:
> 
> a) convergence of RxOSD with other protocol changes (XCB, byte-range
> locking, perhaps others)
> b) changes to cache coherence, especially DV semantics
> c) tackling the thorny issue of distributed transactions
> d) future-proofing (distributed RAID-n, journalling, rw repl, etc.)
> e) protocol design issues (RXAFS_GetOSDlocation, the means by which
> location data is XDR encoded, etc.)
> f) reference implementation code issues (DAFS integration, MP-fastness
> of metadata server extensions, etc.)
> 
> 
> -Tom
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel


-Hartmut
-----------------------------------------------------------------
Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------