[OpenAFS-devel] Re: [master] Change I3fb76654: (openafs) Adjust afs_lockctl to compensate for byte-range lock fixes

Anders Kaseorg andersk@MIT.EDU
Sun, 21 Feb 2010 17:18:24 -0500 (EST)


On Sun, 21 Feb 2010, Derrick Brashear (Code Review) wrote:
> this is common, not merely linux; the change in 49b7bbdd3b45df694fadbef48=
f9ed99d9bfe07b9 doesn't appear to play with the length other than for linux=
, though.
>=20
> is this change deliberate for all platforms? rationale?
>=20
> To respond, visit http://gerrit.openafs.org/1353

[I can=E2=80=99t currently sign into Gerrit, so I=E2=80=99m replying by ema=
il for now. =20
I=E2=80=99ll copy this to Gerrit when my account gets fixed.]

Yes, my reasoning is as follows:

The Linux off-by-one bug fixed by 49b7bbd has been present since the=20
beginning of time (87c10e8 Initial IBM OpenAFS 1.0 tree).  That bug caused=
=20
l_len to be set to OFFSET_MAX rather than 0 for whole-file locks, which=20
would have made it _appear_ as if the afs_lockctl code needs to check for=
=20
both 0 and various 0x7fffffff-like values.  To be safe, those checks were=
=20
implemented in the platform-independent code.

If there were other platforms that also have the off-by-one bug, they=20
would also need to be fixed, but I don=E2=80=99t think there are.  Other pl=
atforms=20
are given an l_len directly, instead of needing to compute one from fl_end=
=20
- fl_start + 1.

Now, as it turns out, there is also a buggy application that sets l_len =3D=
=20
0x7fffffffffffffff.  But on Linux, thanks to the off-by-one bug,=20
afs_lockctl was seeing the wrong value 0x7ffffffffffffffe.  Instead of=20
fixing the off-by-one bug, a (platform-independent!) check for that value=
=20
was added in commits 0188224 lock-mask-64bit-negative-1-for-java-20070212,=
=20
65d8923 java-locking-redux-20070214, 226c1ee java-lock-fix-200702310.  So=
=20
that check needs to stay, but it should be adjusted to check the right=20
value 0x7fffffffffffffff.  No platform other than Linux would have been=20
seeing 0x7ffffffffffffffe.

If there were buggy applications that set l_len =3D 0x7fffffff, then we=E2=
=80=99d=20
also need to check for that value.  However, that would have shown up on=20
Linux as 0x7ffffffe until very recently, and a check for 0x7ffffffe has=20
never been found to be necessary.  Furthermore, checking for 0x7fffffff=20
may be wrong on =E2=89=A5 2 GiB files, so I left it out.  But I could go ei=
ther=20
way on this.

(The _real_ fix for this giant mess would be to stop ignoring byte-range=20
locks, so that we don=E2=80=99t need heuristics to detect them.  Hopefully =
some=20
day the protocol will be extended to deal with byte-range locks, but until=
=20
then, it would still be nice to treat byte-range locks as byte-range locks=
=20
locally and send a whole-file lock to the server, rather than risking=20
corruption from other machines.)

Anders