[OpenAFS-devel] flock Input/output error

Simon Wilkinson sxw@inf.ed.ac.uk
Wed, 11 Aug 2010 23:34:53 +0100


On 11 Aug 2010, at 17:21, Simon Wilkinson wrote:
>=20
> Once you've applied this, I would be interested to know what error =
your test now returns ...

I'm still interesting in the error code you're seeing, but on further =
analysis, I think I've identified two problems. They're both related to =
race conditions in the way that we enrol AFS locks with the kernel's =
local lock management system (we do this so that the kernel can handle =
byte-range locks on the local machine for us).

The first is that locks and unlocks can race against each other. On a =
lock we do SetAFSLock, SetKernelLock. On unlock we do ReleaseAFSLock, =
ReleaseKernelLock. However, we don't hold any locks on the file whilst =
we do so. Multiple calls to set a lock are safe, as the SetAFSLock =
serialises them. However, a lock and an unlock may race each other. In =
this case we have

Process A                 Process B
SetAFSLock
SetKernelLock
....
ReleaseAFSLock
			  SetAFSLock
			  SetKernelLock
ReleaseKernelLock

Process B can't get the kernel lock, despite the fact that it has the =
AFS lock, because process A hasn't released it yet. So you get an error =
message.

The second problem is a similar race, but related to what happens when =
we close a file handle. We don't actually clean up any of the kernel =
file locks ourselves - instead, we let the kernel do so when it disposes =
of the file descriptor. However, we do release any file server locks =
that we might have. Between us releasing the fileserver locks, and the =
kernel freeing it's locks, there's an opportunity for another process to =
gain a fileserver lock, but not a local one, and you'll get an error =
back there.

I think that it's the second problem that your test is hitting. Sadly =
this problem is the harder one to fix, as it requires refactoring the =
way that we interface with the Linux lock management code.

Cheers,

Simon.