[OpenAFS] File corruption, 1.4.1 & 1.4.4 on linux clients

cball@bu.edu cball@bu.edu
Tue, 24 Apr 2007 14:28:44 -0400 (EDT)

We are serving up a virus .dat file to mail relays via AFS readonly.
The file is periodically updated, the volume where it lives is re-released
hourly whether update occured or not.  Read activity is constant.

When vos release occurs, the fileserver logs a message like this:

Mon Apr 23 17:04:28 2007 fssync: volume 536959020 restored; breaking all
call backs

[ normal behavior ]

At erratic intervals, the virus scanner on one of our mail relay systems
will choke on the database file reporting that it's invalid.  When this
happens, the file remains invalid until a re-release occurs or a manual fs
flush is invoked.

This is currently happening on a Redhat Enterprise Linux 4 system with
openafs 1.4.4:

$ uname -a
Linux relay7.bu.edu 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 10:11:19 EST
2007 i686 i686 i386 GNU/Linux

It was also happening on a Fedora Core system running openafs 1.4.1.
We've moved the volume from 1.2.13 to 1.4.2 Solaris servers.  The size,
version, and activity level of the filerserver have no obvious impact.

The client systems affected are IBM 335 and 336 1U machines.

Examination of the most recent bad file shows that 21149 bytes are null.
In all cases, we've experienced null blocks; the first time it was
obvserved the null area size was approximately 64K.

% ls -l scan.dat.good scan.dat.bad
-rw-rw-rw-   1 32766    root     10351796 Apr 24 11:32 scan.dat.bad
-rw-rw-rw-   1 32766    root     10351796 Apr 24 11:32 scan.dat.good

% cmp -l scan.dat.good scan.dat.bad | head -5
9961473  13   0
9961474 351   0
9961475 140   0
9961476 253   0
9961477 205   0

% cmp -l scan.dat.good scan.dat.bad | wc -l

cmdebug entry for this file:

** Cache entry @ 0xca94dd00 for 2.536959020.66.30528 [bu.edu]
        10351796 bytes  DV            5  refcnt     2
    callback f7da4ec0   expires 1177438028
    5 opens     0 writers
    normal file
    states (0x5), stat'd, read-only

Any suggestions to help with tracking this down would be appreciated.

-Charles Ball
Boston University