[OpenAFS] afs semantics

Sat, 10 Jun 2006 21:37:06 -0400

On Saturday, June 10, 2006 07:40:25 AM -0400 Jeffrey Altman 
<jaltman@secure-endpoints.com> wrote:

> Adam Megacz wrote:
>> The people who write darcs (an incredibly powerful/flexible version
>> control system) are looking into making sure that it works properly on
>> AFS, and were looking for an authoritative, official statement of
>> exactly how AFS file semantics differ from UNIX semantics:
>>
>>   http://bugs.darcs.net/issue117
>>
>> Specifically, can anybody comment on these points?
>>
>>   1. If two processes on different clients both attempt to
>>      open(O_CREAT|O_EXCL), does AFS guarantee that no more than one of
>>      them will succeed?
>
> It should.  There have been recent reports that this may not be true
> on some platforms either because of a bug.  However, insufficient data
> has been collected to determine if in fact this is a bug.  If it is a
> bug I suspect it is a bug in the client on some bug not all platforms.

Yes; modulo bugs, two simultaneous exclusive creates of the same name in 
the same directory will not both succeed.  Actually, I think the bug to 
which Jeff is referring is not a violation of this guarantee -- it results 
in _neither_ of the creates succeeding.

>>   2. If two processes both attempt to rename() the same [source] file,
>>      does AFS guarantee that exactly one of them succeeds?
>
> It should.  Again, if this is not true it would be a bug in the client.

Well, it guarantees that _at most_ one succeeds.  Of course, it is possible 
for both operations to fail for reasons having nothing to do with the race. 
However, assuming one succeeds, the other should fail with ENOENT.

Note that this only works when renaming a file within a volume, and only if 
the rename would not result in a single file having links in more than one 
directory.  A rename() call that would violate these constraints will 
instead return EXDEV.

>>   3. If client "A" makes two inode-level changes (creat, remove,
>>      rename, etc), is it ever possible for client "B" to see the
>>      second change before the first one?
>
> Not possible.  AFS does not distribute changes to clients.  It simply
> notifies clients that the known state of the object has changed.  The
> client could find out about the first change or the combination of the
> first and second changes, but never the second and not the first.

I'm not sure what you mean here by "inode-level"; the system calls 
mentioned are characterized by the fact that they result in changes to 
directory contents.  For operations with this property, and with respect to 
a single directory, Jeff's analysis is correct.  Changes to the 
authoritative copy of a directory are always performed at the fileserver, 
never by clients, and these changes are serialized.  Clients never receive 
partial directory updates from the fileserver; if a directory changes, the 
client must fetch a complete new copy of the directory.  This fetch is 
always done in a single RPC, so the client always recieves a complete, 
self-consistent copy of the directory.  If a single client makes two 
changes in some particular order, other clients will always see the changes 
in the same order, because no version of the directory ever existed which 
contains only the second change.

Note that this guarantee applies only with respect to any one directory. 
For changes to multiple directories, a considerably more complex analysis 
is required, and depending on the situation, changes might become visible 
to other clients in the "wrong" order.

> It is possible to test whether or not a path is located within AFS by
> using "fs whichcell <path>".  If it returns successfully, you have an
> AFS path.  If not, then not.  Darcs might want to test this to determine
> whether or not alternative behaviors should be used.

Of course, it's also possible to do this using the corresponding pioctl, if 
you are willing to grow a dependency on AFS libraries or libkafs.

Some more direct responses to the questions Juliusz is actually asking:

> Thanks, although I'd prefer authoritative docs to a FAQ entry.

There is no authoritative documentation at this level, and there never has 
been.  The FAQ is the closest you're going to get, but if you ask precise 
questions on openafs-devel, you're likely to get authoritative answers.

Note that the afs3-standardization list is about the AFS 3 _protocol_, and 
in fact is primarily about extending that protocol and resolving 
ambiguities in a consistent way, so as to maintain interoperability.  While 
a complete protocol specification would be nice, writing it does not seem 
to be high on anyone's to-do list.  What this list is explicitly _not_ 
about is defining the behavior and semantics of any particular 
implementation, including OpenAFS.  So, the semantics of the UNIX system 
call interface with respect to AFS are out of scope.

>> Hard links:                                             [ User ]
>>
>>       In AFS, hard links (eg: ln old new) are only valid within a
>>       directory.
>
> This will definitely break ``darcs get'' and ``optimize --relink''
> (anything else?).  We can work around the issue, but you'll have to
> tell us in what way link(2) fails when the above constraint is
> violated.

Attempts to create hard links between files in different AFS directories 
will fail with EXDEV.  For the case of links in different volumes, this 
check is done early, in the client (though of course, it also fails if the 
client fails to perform the check).  For attempts to create a link in a 
different directory in the same volume, the check is done fairly late, and 
so you're more likely to get errors like EACCES or EISDIR, if those apply.

Note that you can rename a file from one directory to another within the 
same volume, as long as the file does not have more than one existing link. 
Attempts to rename a file with multiple links, or to rename a file into a 
different volume, will fail with EXDEV.

>  - how does open(O_CREAT | O_EXCL) work?

It works as advertised - if the specified file already exists, the 
operation fails with EEXIST.

>   - is link(2) consistent w.r.t. link and open?
>   - is rename(2) consistent w.r.t. rename and open?

I'm not sure what is meant here by "consistent".  Where I come from, an 
operation that is "consistent" is one that never transitions from a valid 
state to an invalid one.  All of open, rename, and link have this property, 
and all of them obey the same rules with respect to what are considered 
valid states of the filesystem.

>       AFS does not support byte-range locking within a file,
>       although lockf() and fcntl() calls will return 0 (success).
>
> This is careless.  I fully agree that SVR4-style locks are brain-
> damaged beyond hope, but fcntl(2) over AFS should fail with ENOSYS
> rather than returning success!

We try fairly hard to support whatever locking interfaces are available on 
any given platform.  Regardless of the interface used, AFS supports 
whole-file locking, both between processes on the same client system and 
between client systems.  It does not implement partial-file locking at all, 
because applications which actually _rely_ on fine-grained locking also 
tend to rely on such locks to act as fine-grained data consistency 
barriers, and such semantics would be quite difficult for AFS to support 
between clients.

> Adam, I really need authoritative documentation on
>
>   - consistency properties of AFS;
>   - restrictions of Unix system calls on AFS.

There is no complete, authoritative documentation on these issues.  As I 
mentioned above, someone asking specific, well-defined questions on the 
openafs-devel list (openafs-devel@openafs.org) would be likely to get 
authoritative answers.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA