[OpenAFS-devel] linux-and-locks-cleanup-20070202 crashes linux kernels older than 2.6.17 (see RT #53457)

Thu, 08 Feb 2007 17:54:55 -0500

Christopher Allen Wing <wingc@engin.umich.edu> writes:
...
> I think it would be possible to do something like this on linux by 
> primarily using the local linux locking code, and having a helper function 
> that attempts to change the lock state on the AFS server to:
> 
>  	single read lock
>  	single write lock
>  	no lock
> 
> upon request.  But it seems there aren't reliable means to do this. 
> There is no race-free way to transition between a read lock and write lock 
> and vice versa.  If there is an extended network failure any lock on the 
> server will time out and then, potentially, local locks might remain in 
> place- do we then have to call back into the kernel and kill all the local 
> locks?
> 
> 
> I don't know what type of kernel APIs are available on other types of 
> unix.  It would be nice if we didn't have to rewrite an entire posix file 
> locking layer in openafs, but rather, re-use kernel APIs where possible.

The problem with reusing kernel APIs is they've evolved independently
so have incompatible API's with each other.  (Then there's Linux which
managed to evolve incompatible API's with itself.)  Additionally, to
really work right, the locking should somehow interact with file
flushing, and also with the server.  In-kernel APIs are not guaranteed
to have useful hooks for that, because local filesystem semantics do
not need that functionality.

Additionally, AFS needs to have byte range locking code in one
other place: the file server.  This can't use any kernel's locking
code; it must have its own userspace implementation.  The fileserver
already has code to track which clients are up, and what files those
clients are interested in - this is what the callback code does.
So the fileserver code should probable interact with the callback
code to manage lock recovery.  Obviously, it's possible for a client
to lose locks through network failure and there are other failures
that have to be accounted for.  There are multiple cases here where
it's simply not possible to exactly account for posix semantics.
DFS and NFSv4 have already had to deal with these; one easy
choice would be to try to have the same or similar behavior.
ESTALE seems to be one popular approach.

The file flushing problem might not be obvious.  People lock files to
ensure parts of the file are consistent (ie, other things can't stomp
on those parts).  Generic (ie, not afs-aware) applications that do file
locking assume that file locks are sufficient to ensure data
consistency between all copies of the application sharing that data.
Generic applications *might* use fsync/fdatasync/msync in addition, but
only for system crash integrity protection; applications *do not* in
general use these to ensure runtime data consistency.  So, for a local
filesystem, in-kernel locking is sufficient and "trivial".  With a
remote filesystem that does no caching (nfsv2), in kernel locking +
some network api is sufficient.  With a remote filesystem that does
caching, it's important to ensure when the lock is acquired, reads
after that retrieve consistent data (afs callbacks more or less do
that), and that when a write lock is released, any writes completed
under the lock are visible to any other server that acquires a read or
write lock after that on that data - ie, that the file data is flushed
out to the server.

For AFS, there is one other additional fun issue.  Cache manager
chunks aren't guaranteed to align with byte range locks.  Consider
the case where two clients have byte range locks over separate pieces
of one file chunk, at the same time.  You can't have both clients
write their chunk out - that would obliviate the value of the write
lock.  Each client has to somehow know that it must only write the
portion of the chunk that it owns.

So, for afs, we would like to (eventually) have:
	cache manager byte-range file locking
		that has hooks to do file data flushing
		that can fragment chunks
	fileserver byte-range file locking
		that presumably interact with callbacks

So, here's the roadmap.

Phase 0 "the past".  Whole file locking semantics only.
phase 1 "now".  Fileserver only supports whole-file locking.
	A few clients support local byte-range locking, backed up
	by serverside whole-file locking.
phase 2 "soon".  Fileserver only does whole-file locking.
	All clients support "generic" local byte-range locking,
	backed up by serverside whole-file locking and
	some sort of simplistic file flushing when a write
	lock is released.
phase 3 "not so soon".  Fileserver supports byte-range locking.
	Clients do byte-range locking with chunk fragmentation
	and flushing.
phase 4 "even later".  Clients do opportunistic filelocking
	and server can ask to get ranges back from client.

Possibly also:
per-volume server-side or per-pag cache manager settings that tune how
file locking works.

					-Marcus