[AFS3-std] Re: [OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range Locking, etc.

Fri, 24 Jul 2009 11:39:44 -0400

Hartmut Reuter wrote:

> I can't use the vnode->lock for this kind of locking, anyway, because
> the End-of-I/O-rpc wouldn't run in the same thread. So I have planned a
> counter for ongoing reads (write can only start if that came down to 0)
> a counter for waiters (to know whether End-of-I/O-rpc has to wake
> someone or just can free the struct) and, of course a writer field which
> contains the ip-address of the writing client or 0 if there is no write
> in progress.
> 
> But all these are implementation details which have nothing to do with
> the AFS3 protocol and can be changed later if it seems appropriate.
> 
> -Hartmut

Hartmut:

The issue to which Jeff Hutzelman is referring is RXAFS_SetLock,
RXAFS_ReleaseLock, and RXAFS_ExtendLock.   As you know, these RPCs are
used to manage the CM-FS transactions for file locks.  A CM requests a
lock with SetLock and then proceeds to extend the lifetime of the lock
every five minutes with ExtendLock and releases the lock with ReleaseLock.

The problem is that there is no magic cookie or lockId or transactionId
returned as part of the SetLock call.  Therefore, when the FS receives a
ExtendLock or ReleaseLock call it does not know if the request came from
the CM that issued the original SetLock or not.

An ExtendLock can be issued and will succeed as long as the lock count
is non-zero.  If there is a client that is issuing ExtendLock calls on a
FID, those will fail until such time as another client obtains a read
lock at which point the lock will be successfully extended even though
it was never issued.

In the same regards, a ReleaseLock can be issued and will succeed on a
FID even when there is no outstanding lock issued to the CM performing
the release.

We have seen these problems in practice.  A CM was issued a lock and
then gets disconnected from the network for longer than five minutes
(perhaps due to a suspend).  The lock for that CM should have been
dropped but the CM is unaware and when it wakes attempts to ExtendLock
and eventually ReleaseLock causing the lock counts to get out of sync.
We have also seen buggy clients that issue ExtendLocks and never stop
even after the client has issued a ReleaseLock.

Now that we have UUIDs for most clients (UUIDs are not required) we can
mitigate the problem by tracking the clients that are actively issued
locks and when they will expire.  However, it cannot be fixed entirely.

The proper way to address this is for SetLock to return some identifier
for the lock that can be used to ensure that when an ExtendLock or
ReleaseLock is sent, it applies only to the one instance of a lock that
was issued and not to any others.

The
RXAFS_OSD_StartFetchData/RXAFS_OSD_ExtendFetchData/RXAFS_OSD_EndFetchData
and
RXAFS_OSD_StartStoreData/RXAFS_OSD_ExtendStoreData/RXAFS_OSD_EndStoreData
rpcs are going to have exactly the same issue as
SetLock/ExtendLock/ReleaseLock rpcs.   Jeff's point is that we must not
repeat the same mistakes from our past.

Jeffrey Altman