[OpenAFS] Re: File corruption, 1.4.1 & 1.4.4 on linux clients

Thu, 26 Apr 2007 15:42:25 -0400 (EDT)

On Thu, 26 Apr 2007, Derrick J Brashear wrote:

> On Thu, 26 Apr 2007 cball@bu.edu wrote:
>
> > On Wed, 25 Apr 2007, Derrick J Brashear wrote:
> >
> >> On Tue, 24 Apr 2007 cball@bu.edu wrote:
> >>
> >>> We are serving up a virus .dat file to mail relays via AFS readonly.
> >>> The file is periodically updated, the volume where it lives is re-released
> >>> hourly whether update occured or not.  Read activity is constant.
> >>>
> >>> When vos release occurs, the fileserver logs a message like this:
> >>>
> >>> Mon Apr 23 17:04:28 2007 fssync: volume 536959020 restored; breaking all
> >>> call backs
> >>>
> >>> [ normal behavior ]
> >>>
> >>> At erratic intervals, the virus scanner on one of our mail relay systems
> >>> will choke on the database file reporting that it's invalid.  When this
> >>> happens, the file remains invalid until a re-release occurs or a manual fs
> >>> flush is invoked.
> >>
> >> Let me guess, it's mmap()ed by whatever is using it, directly in /afs?
> >
> > The file is not being mmap()ed.  If I wasn't clear, the affected client
> > system is consistant about serving up the corrupted file with null bytes
> > to new and old processes until the cached version of the file is flushed.
>
> If it were mmap()ed there is a way the file can basically get pinned due
> to the dentry cache in 1.4.1, at least if i recall the specifics.

Is this likely to have persisted in 1.4.4 (where we're currently having
the problem)?

A system call trace of the only program which uses the file shows no use
of mmap associated with any file accessed in afs.

We did just observe that there are several hung processes which have(had?)
the file open.  Could this be relevant?

% ps

F S UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY TIME  CMD
4 S root  32163  1  0  76   0 -  1892 afs_os Apr15 ? 00:00:00 uvscan-5.10.00

% lsof -p 32163
[...]
uvscan  32163 root    4r  unknown
/afs/bu.edu/sun4_57/IT/uvscan-5.10/config/scan.dat (deleted) (stat: No
such file or directory)
uvscan  32163 root    5r      DIR   0,18    2048 1479278597
/afs/bu.edu/sun4_57/IT/uvscan-5.10/config
uvscan  32163 root    6r  unknown
/afs/bu.edu/sun4_57/IT/uvscan-5.10/config/names.dat (deleted) (stat: No
such file or directory)
uvscan  32163 root    7r      DIR   0,18    2048 1479278597
/afs/bu.edu/sun4_57/IT/uvscan-5.10/config
uvscan  32163 root    8r  unknown
/afs/bu.edu/sun4_57/IT/uvscan-5.10/config/clean.dat (deleted) (stat: No
such file or directory)
uvscan  32163 root    9r      DIR   0,18    2048 1479671811
/afs/bu.edu/x86_bulnx20/IT/uvscan-5.10.00/config

Our client is using 256k cache files.  The null bytes start and end on a
cache block boundary which happens to be the last full size block,
contents of the last [ partially full ] block matches correctly.

At the moment we're planning to save stat info as well as a copy of the
cache directory before stopping all related activity, flushing the bad
file, and comparing cache status before and after; idea is to try and
determine whether the null block makes it to disk cache.

Is there any persistance to the cache V-files associated with a file when
it's flushed and reloaded?  We've observed that the first V-file
comes back with the same name, but we haven't yet tracked down the rest of
the files for a 10Mb file.

Thanks,
-Charles