[OpenAFS] Re: File corruption, 1.4.1 & 1.4.4 on linux clients

cball@bu.edu cball@bu.edu
Mon, 21 May 2007 15:29:16 -0400 (EDT)


On Thu, 26 Apr 2007 cball@bu.edu wrote:
> > >>> On Tue, 24 Apr 2007 cball@bu.edu wrote:
> > >>> We are serving up a virus .dat file to mail relays via AFS
> > >>> readonly. The file is periodically updated, the volume where it
> > >>> lives is re-released hourly whether update occured or not.  Read
> > >>> activity is constant.
> > >>>
> > >>> When vos release occurs [...]
> > >>>
> > >>> At erratic intervals, the virus scanner on one of our mail relay
> > >>> systems will choke on the database file reporting that it's
> > >>> invalid.  When this happens, the file remains invalid until a
> > >>> re-release occurs or a manual fs flush is invoked.
>
> Our client is using 256k cache files.  The null bytes start and end on a
> cache block boundary which happens to be the last full size block,
> contents of the last [ partially full ] block matches correctly.
>
> At the moment we're planning to save stat info as well as a copy of the
> cache directory before stopping all related activity, flushing the bad
> file, and comparing cache status before and after; idea is to try and
> determine whether the null block makes it to disk cache.

We caught another occurrence of this problem.  As is typical for this
issue, only one of several systems using the same file were affected.
The file had not changed prior to the volume's release.

The observed cache subsystem activity reflects exactly what you'd expect
from examining the invalid version of the file:

- While the binary file was invalid, all processes using it as well as the
listener starting new scans were shut down and made a copy of the cache
directory.

- Flushed bad file and made a second copy of the cache directory.  Diff
showed that 39 cache files changed from Size: 262144 to Size: 0 and one
block from Size: 185018 to Size: 0.

- Ran a single scan and made third copy of cache directory.  Diff showed
40 cache files went from Size: 0 > Size: 262144 and one from Size 0: to
Size: 185018.


So the problem is that a cache file which should have 262144 bytes of
content was erroneously cleared when the volume was released.  The cache
file is still listed as a component of the AFS file, thus the block of
null data near the end of the file.

Charles Ball
Boston University