I/O parallelism: was: [AFS3-std] Re: [OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range

Tue, 08 Sep 2009 13:26:16 +0200

Matt W. Benjamin wrote:
> Hi,
> 
> I wanted to continue this conversation, re-raising topics of I/O
> parallelism and data consistency, and opening the general topic of
> error detection and recovery.
> 
> This note looks forward to the later mail in which the signatures for
> new begin/end IO routines are described e.g., RXAFS_StartAsyncFetch,
> as well as back to Tom's original list of issues.
> 
> First, some function prototypes have been sent to this list, but not,
> I think the full set of what Hartmut, you have been working on, plus,
> I don't think Tom, you have provided feedback as yet on what has been
> posted.
> 
> Second, I think it very likely that we are clearly trying to have a
> discussion that includes rxOSD in the form it's incorporated into
> OpenAFS, but potentially opening up forward-looking discussion on
> future protocol concepts, and, where possible, potential overlap that
> will take us further in that direction.
> 
> I think it was clear that there was some consensus on the Begin/End
> IO operations, but I still have questions, on reviewing the thread.
> 
> I. I may be leaving things out (help solicited) or seem way-out-there
> (sorry), but I'll attempt to formalize my current questions and
> reactions about what's been specified so far:
> 
> 1. considering the protocol post begin/end I/O transactions (which
> include offset and range information), can we clarify that there is
> an ability for different clients (or the same client) to carry out
> non-overlapping I/O operations in parallel?  I think it was
> implicitly clear that this is provided for--is that self-evident?
>

The main concern Tom had against the implementation of aysnchronous I/O
as it was implemeted in the version of OpenAFS/OSD this summer was that
it would break the concept of data versions because it allowed that some
one modified the data while some one else was reading it. By
implementing RXAFS_StartAsyncFetch, RXAFS_ExtendAsyncFetch,
RXAFS_EdndAsyncFetch and RXAFS_StartAsyncStore, RXAFS_ExtendAsyncStore,
RXAFS_EndAsyncStore this problem was solved in the mean time and this
code is noaw already in production at our cell.

The logic is rather simple: Before starting any fetch or store opeartion
(synchronous or asynchronous) a locking takes place. For synchronous
RPCs the lock is released at the end, for asynchronous ones it's held
until the client does the EndAsync... RPC.

If the file is already locked the StartAsync... RPCs wait for the lock
to be released.

Right now the ranges are transmitted and stored in the appropriate
struct, but they are not evaluated. That means it is locking for the
whole file. In future more sophisticated techniques could be
implemented, however.

> a) if yes, then I would like to raise the question -when- should such
> operations be allowed (see II.3)
> 
> 2. final disposition on data version;  I think it is agreed that the
> coordinating file server assigns successive data versions as and when
> (possibly parallel) mutating I/O operations complete--is that
> agreed/self-evident?
> 
> 3. the current proposal (read as above) allows I/O operations on a
> contiguous byte range to be atomic, but not others--for example, I
> could imagine a structured I/O description which allows for a
> sequence of operations, on one or more files, to constitute a
> transaction.  I can imagine file server/OSD implementations which
> would make a fuller transactional semantics useful.  Would it be out
> of the question to future proof in this direction?
> 
> II.  Hartmut, one of your most recent mails raised the issue of
> handling non-completing I/O operations, and I think that provides a
> good segue to my next questions:
> 
> 1. what is the overall data consistency guarantee for current OSD
> volumes, and looking forward, extended ones we might define?
> 
> a. since the coordinating file server allocates data versions post
> hoc, a server acting as an OSD no longer has a mechanism to track
> them;  I can imagine ways forward, though I am not certain I
> understand all the issues

The OSD server "rxosd" is a completely stupid and passive device. It
doesn't know anything about data versions. It's just an external storage.

> 
> 1) clients could send data version information with every component
> I/O operation--this costs nothing, and provides information which may
> be used for reliable I/O strategies, error state identification, and
> recovery
> 
> 2) an additional operation finalizing -component- I/O operations
> could be added, which transfers the final data version to OSDs
> 
> b. any mutating I/O operation which fails to complete, or to complete
> successfully, puts the distributed system in an inconsistent state

This is true, but this can happen with the classical AFS file as well.
If a StoreData doesn't send the promised amount of data for what reason
ever we have an inconsistency. The fileserver will give that new
inclomplete version of the file an new version number and there is no
way to get the old version back, except it had immediately before been
copied on write. The situation with OSDs isn't worse. Only because of
the higher number of partners the chance to happen some strange thing is
higher.

> 
> 1) a not uncommon problem is likely to be component I/O operation
> failure due to network partitioning--recovery for this case seems
> possible, in that the client could re-try the operation on the
> coordinating fileserver, and if successful, complete the I/O
> transaction as normal;  perhaps it already does this?
> 
> (but what if it fails?  see II.3)
> 
> 2) Integrity checking
> 
> 2a) Tom raised the question of data checksumming;  Currently, I
> believe we on rx packet checksums and integrity checking to
> accomplish reliability guarantees.  Newer filesystems such as ZFS
> have raised the visibility of data checksum operations.  Logically it
> seems possible to identify approaches which checksum data only in
> component I/O, and others which involve the coordinating file
> servers.  Is it possible that we should incorporate placeholders for
> this?  Tom, do you have ideas about what would be required?
> 
> (and, what if it fails?  see II.3)
> 
> 2b) Tom raised the question of support for parity computation in
> support of raidn OSD implementation.  Clearly this could make the
> protocol substantially more useful for some applications.  Is it
> possible that we should incorporate placeholders for this?  Tom, do
> you have ideas about what would be required?

That would have to happen in the cache manager, I suppose. Could be
quite slow/expensive. All modern RAID systems use specialized chip sets
for doing it. The reason certainly is that normal computer CPUs are not
fast enough.

Hartmut

> 
> (and, what if it fails?  see II.3)
> 
> 3) Tom raised the question of support for data journalling.  Perhaps
> this seems way-out-there, but in fact, the current generation of
> local file system technology actually supports low-cost point-in-time
> snapshot functionality.  In that context, I think I can imagine
> (achievable) implementations supporting a protocol in which the
> transaction boundary is extended to the component I/O servers (OSDs)
> and commit/rollback semantics were implemented on it.  I think it
> would be valuable to distinguish between strong and weak
> transactions--the former supporting full nesting and isolation
> guarantees (which possibly are not provided by any local file system
> implementation today), and the latter rolling back to a consistent
> state but potentially lacking isolation (and so rolling back
> concurrent operations), and being much easier and less costly (both
> senses) to implement.  I sense that rollback would be controlled by
> coordinating file server, either at the request of a client holding
> an open transaction, or itself, on detecting failed, mutating
> operations.
> 
> 
> Thanks,
> 
> 
> Matt
> 
> 
> (Thanks to Tom for read-through and feedback, all errors mine.)
> 
> 
> ----- "Hartmut Reuter" <reuter@rzg.mpg.de> wrote:
> 
>> Tom's idea to have a Start-of-I/O-rpc and a Stop-I/O-rpc to enforce
>>  data consistency is great. I think it would not be very difficult
>> to implement this.
>> 
>> Caching of the information returned by GetOSDlocation could reduce 
>> traffic on the wire, but is not really essential. So if we still do
>>  one GetOSDlocation per I/O we can use GetOSDlocaltion as 
>> Start-of-I/O-rpc.
>> 
>> So for write I would propose that the fileserver has to keep the 
>> information about Fid, offset, length, host, and time in a table or
>>  chain and keep it there until the storeMini has happened. So also 
>> extended callbacks for file ranges would become possible. For the 
>> write case storeMini would function as End-of-I/O-rpc.
>> 
>> While the entry for write exists all incoming GetOSDlocation RPCs 
>> have to wait. This is the same behavior as happens for FetchData or
>>  StoreData while another Storedata has the write lock on the vnode.
>> 
>> 
>> Up to this point everything would work fine also with the clients
>> out here in our cell.
>> 
>> However, there could be reads still under way while a new write is 
>> starting. It's not that probable because unfortunately reads are 
>> always for single chunks only, but it's still possible. To protect
>> also these reads requires an End-of-I/O-rpc for read. A new bit in
>> the flag used in GetOSDlocation could indicate that the client
>> promises to send at the end of a read operation an appopriate rpc.
>> 
>> With this fleg set GetOSDlocation would also create (or find an 
>> existing) entry in the before mentioned table or chain. A field 
>> readers would be incremented and after the I/O is finished
>> decremented by the End-of-I/O-rpc. As long as there are readers
>> write requests have to wait.
>> 
>> The legacy interface for non rxosd prepared clients, of course,
>> would have to honor this table as well. But here things are easier
>> because everything happens within a single rpc (FetchData or
>> StoreData).
>> 
>> An open question is how the fileserver should handle missing 
>> End-of-I/O-rpcs. Therefore the timestamp field. The 
>> FiveMinuteCheckLWP could look for out-timed transactions....
>> 
>> -Hartmut 
>> ----------------------------------------------------------------- 
>> Hartmut Reuter                  e-mail
>> reuter@rzg.mpg.de phone                  +49-89-3299-1328 fax
>> +49-89-3299-1301 RZG (Rechenzentrum Garching)           web
>> http://www.rzg.mpg.de/~hwr Computing Center of the
>> Max-Planck-Gesellschaft (MPG) and the Institut fuer Plasmaphysik
>> (IPP) 
>> ----------------------------------------------------------------- 
>> _______________________________________________ OpenAFS-devel
>> mailing list OpenAFS-devel@openafs.org 
>> https://lists.openafs.org/mailman/listinfo/openafs-devel
> 

-- 
-----------------------------------------------------------------
Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------