[OpenAFS] Re: Possible cache corruption with linux client and 1.6.1 fileserver

Richard Brittain Richard.Brittain@dartmouth.edu
Tue, 13 Nov 2012 11:26:33 -0500 (EST)

On Tue, 13 Nov 2012, Richard Brittain wrote:

> While testing new client installs, I've got a regular habit of banging hard 
> on my fileservers and checking the md5sum of a bunch of random files. I came 
> across an odd error recently with this scenario:
> - Client (doesn't seem to matter what platform) writes a bunch of largish 
> files to fileserver.
> - Linux client tries to read same files before they have finished writing.
> Mostly this results in premature EOF, but eventually the whole file can be 
> read and the checksum is correct.
> - Occasionally the short file results in corrupt blocks in cache, which the 
> local client thinks are good, and when the complete file is available, the 
> checksum is wrong.  Running 'cmp' between the bad file and a copy of the 
> original shows a similar number of changed bytes (~4k) regardless of size of 
> file.

More testing shows that every time I create this scenario, it is the 
first 4kB of the file that has been replaced by nulls.  The initial test 
was confusing because some of my test files contain nulls.

> - Run 'fs flushvolume' on the client, and recompute md5sum, and it always 
> checks out fine, so the fileserver has correct data.
> Tested with 1.6.1 client on RHEL5 and RHEL6, 1.6.1 fileserver on RHEL5 and 
> RHEL6.  Reasonably reproducible, although the locations in the files might 
> change.  Small files don't show problems, but I never get partial reads on 
> them.  If I'm patient and let the files finish copying to the server, there 
> is never a problem.
> Richard

Richard Brittain,  Research Computing Group,
                    Computing Services, 37 Dewey Field Road, HB6219
                    Dartmouth College, Hanover NH 03755
Richard.Brittain@dartmouth.edu 6-2085