[OpenAFS] Re: Possible cache corruption with linux client and 1.6.1
fileserver
Richard Brittain
Richard.Brittain@dartmouth.edu
Tue, 13 Nov 2012 11:26:33 -0500 (EST)
On Tue, 13 Nov 2012, Richard Brittain wrote:
> While testing new client installs, I've got a regular habit of banging hard
> on my fileservers and checking the md5sum of a bunch of random files. I came
> across an odd error recently with this scenario:
>
> - Client (doesn't seem to matter what platform) writes a bunch of largish
> files to fileserver.
>
> - Linux client tries to read same files before they have finished writing.
> Mostly this results in premature EOF, but eventually the whole file can be
> read and the checksum is correct.
>
> - Occasionally the short file results in corrupt blocks in cache, which the
> local client thinks are good, and when the complete file is available, the
> checksum is wrong. Running 'cmp' between the bad file and a copy of the
> original shows a similar number of changed bytes (~4k) regardless of size of
> file.
More testing shows that every time I create this scenario, it is the
first 4kB of the file that has been replaced by nulls. The initial test
was confusing because some of my test files contain nulls.
> - Run 'fs flushvolume' on the client, and recompute md5sum, and it always
> checks out fine, so the fileserver has correct data.
>
> Tested with 1.6.1 client on RHEL5 and RHEL6, 1.6.1 fileserver on RHEL5 and
> RHEL6. Reasonably reproducible, although the locations in the files might
> change. Small files don't show problems, but I never get partial reads on
> them. If I'm patient and let the files finish copying to the server, there
> is never a problem.
>
>
> Richard
>
--
Richard Brittain, Research Computing Group,
Computing Services, 37 Dewey Field Road, HB6219
Dartmouth College, Hanover NH 03755
Richard.Brittain@dartmouth.edu 6-2085