I/O parallelism: was: [AFS3-std] Re: [OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range

Matt W. Benjamin matt@linuxbox.com
Sun, 6 Sep 2009 17:07:58 -0400 (EDT)


Hi,

I wanted to continue this conversation, re-raising topics of I/O parallelis=
m and data consistency, and opening the general topic of error detection an=
d recovery.

This note looks forward to the later mail in which the signatures for new b=
egin/end IO routines are described e.g., RXAFS_StartAsyncFetch, as well as =
back to Tom's original list of issues.

First, some function prototypes have been sent to this list, but not, I thi=
nk the full set of what Hartmut, you have been working on, plus, I don't th=
ink Tom, you have provided feedback as yet on what has been posted.

Second, I think it very likely that we are clearly trying to have a discuss=
ion that includes rxOSD in the form it's incorporated into OpenAFS, but pot=
entially opening up forward-looking discussion on future protocol concepts,=
 and, where possible, potential overlap that will take us further in that d=
irection.

I think it was clear that there was some consensus on the Begin/End IO oper=
ations, but I still have questions, on reviewing the thread.

I. I may be leaving things out (help solicited) or seem way-out-there (sorr=
y), but I'll attempt to formalize my current questions and reactions about =
what's been specified so far:

1. considering the protocol post begin/end I/O transactions (which include =
offset and range information), can we clarify that there is an ability for =
different clients (or the same client) to carry out non-overlapping I/O ope=
rations in parallel?  I think it was implicitly clear that this is provided=
 for--is that self-evident?

a) if yes, then I would like to raise the question -when- should such opera=
tions be allowed (see II.3)

2. final disposition on data version;  I think it is agreed that the coordi=
nating file server assigns successive data versions as and when (possibly p=
arallel) mutating I/O operations complete--is that agreed/self-evident?

3. the current proposal (read as above) allows I/O operations on a contiguo=
us byte range to be atomic, but not others--for example, I could imagine a =
structured I/O description which allows for a sequence of operations, on on=
e or more files, to constitute a transaction.  I can imagine file server/OS=
D implementations which would make a fuller transactional semantics useful.=
  Would it be out of the question to future proof in this direction?

II.  Hartmut, one of your most recent mails raised the issue of handling no=
n-completing I/O operations, and I think that provides a good segue to my n=
ext questions:

1. what is the overall data consistency guarantee for current OSD volumes, =
and looking forward, extended ones we might define?

a. since the coordinating file server allocates data versions post hoc, a s=
erver acting as an OSD no longer has a mechanism to track them;  I can imag=
ine ways forward, though I am not certain I understand all the issues

1) clients could send data version information with every component I/O ope=
ration--this costs nothing, and provides information which may be used for =
reliable I/O strategies, error state identification, and recovery

2) an additional operation finalizing -component- I/O operations could be a=
dded, which transfers the final data version to OSDs

b. any mutating I/O operation which fails to complete, or to complete succe=
ssfully, puts the distributed system in an inconsistent state

1) a not uncommon problem is likely to be component I/O operation failure d=
ue to network partitioning--recovery for this case seems possible, in that =
the client could re-try the operation on the coordinating fileserver, and i=
f successful, complete the I/O transaction as normal;  perhaps it already d=
oes this?

(but what if it fails?  see II.3)

2) Integrity checking

2a) Tom raised the question of data checksumming;  Currently, I believe we =
on rx packet checksums and integrity checking to accomplish reliability gua=
rantees.  Newer filesystems such as ZFS have raised the visibility of data =
checksum operations.  Logically it seems possible to identify approaches wh=
ich checksum data only in component I/O, and others which involve the coord=
inating file servers.  Is it possible that we should incorporate placeholde=
rs for this?  Tom, do you have ideas about what would be required?

(and, what if it fails?  see II.3)

2b) Tom raised the question of support for parity computation in support of=
 raidn OSD implementation.  Clearly this could make the protocol substantia=
lly more useful for some applications.  Is it possible that we should incor=
porate placeholders for this?  Tom, do you have ideas about what would be r=
equired?

(and, what if it fails?  see II.3)

3) Tom raised the question of support for data journalling.  Perhaps this s=
eems way-out-there, but in fact, the current generation of local file syste=
m technology actually supports low-cost point-in-time snapshot functionalit=
y.  In that context, I think I can imagine (achievable) implementations sup=
porting a protocol in which the transaction boundary is extended to the com=
ponent I/O servers (OSDs) and commit/rollback semantics were implemented on=
 it.  I think it would be valuable to distinguish between strong and weak t=
ransactions--the former supporting full nesting and isolation guarantees (w=
hich possibly are not provided by any local file system implementation toda=
y), and the latter rolling back to a consistent state but potentially lacki=
ng isolation (and so rolling back concurrent operations), and being much ea=
sier and less costly (both senses) to implement.  I sense that rollback wou=
ld be controlled by coordinating file server, either at the request of a cl=
ient holding an open transaction, or itself, on detecting failed, mutating =
operations.


Thanks,


Matt


(Thanks to Tom for read-through and feedback, all errors mine.)


----- "Hartmut Reuter" <reuter@rzg.mpg.de> wrote:

> Tom's idea to have a Start-of-I/O-rpc and a Stop-I/O-rpc to enforce
> data
> consistency is great. I think it would not be very difficult to
> implement this.
>
> Caching of the information returned by GetOSDlocation could reduce
> traffic on the wire, but is not really essential. So if we still do
> one
> GetOSDlocation per I/O we can use GetOSDlocaltion as
> Start-of-I/O-rpc.
>
> So for write I would propose that the fileserver has to keep the
> information about Fid, offset, length, host, and time in a table or
> chain and keep it there until the storeMini has happened. So also
> extended callbacks for file ranges would become possible. For the
> write
> case storeMini would function as End-of-I/O-rpc.
>
> While the entry for write exists all incoming GetOSDlocation RPCs
> have
> to wait. This is the same behavior as happens for FetchData or
> StoreData
> while another Storedata has the write lock on the vnode.
>
> Up to this point everything would work fine also with the clients out
> here in our cell.
>
> However, there could be reads still under way while a new write is
> starting. It's not that probable because unfortunately reads are
> always
> for single chunks only, but it's still possible. To protect also
> these
> reads requires an End-of-I/O-rpc for read. A new bit in the flag used
> in
> GetOSDlocation could indicate that the client promises to send at the
> end of a read operation an appopriate rpc.
>
> With this fleg set GetOSDlocation would also create (or find an
> existing) entry in the before mentioned table or chain. A field
> readers
> would be incremented and after the I/O is finished decremented by the
> End-of-I/O-rpc. As long as there are readers write requests have to
> wait.
>
> The legacy interface for non rxosd prepared clients, of course, would
> have to honor this table as well. But here things are easier because
> everything happens within a single rpc (FetchData or StoreData).
>
> An open question is how the fileserver should handle missing
> End-of-I/O-rpcs. Therefore the timestamp field. The
> FiveMinuteCheckLWP
> could look for out-timed transactions....
>
> -Hartmut
> -----------------------------------------------------------------
> Hartmut Reuter                  e-mail                 reuter@rzg.mpg.de
>                                    phone                  +49-89-3299-132=
8
>                                    fax                    +49-89-3299-130=
1
> RZG (Rechenzentrum Garching)           web    http://www.rzg.mpg.de/~hwr
> Computing Center of the Max-Planck-Gesellschaft (MPG) and the
> Institut fuer Plasmaphysik (IPP)
> -----------------------------------------------------------------
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel

--=20

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309


--=20

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309

--=20

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309


--=20

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309