[OpenAFS] File corruption, 1.4.1 & 1.4.4 on linux clients
Tue, 24 Apr 2007 14:28:44 -0400 (EDT)
We are serving up a virus .dat file to mail relays via AFS readonly.
The file is periodically updated, the volume where it lives is re-released
hourly whether update occured or not. Read activity is constant.
When vos release occurs, the fileserver logs a message like this:
Mon Apr 23 17:04:28 2007 fssync: volume 536959020 restored; breaking all
[ normal behavior ]
At erratic intervals, the virus scanner on one of our mail relay systems
will choke on the database file reporting that it's invalid. When this
happens, the file remains invalid until a re-release occurs or a manual fs
flush is invoked.
This is currently happening on a Redhat Enterprise Linux 4 system with
$ uname -a
Linux relay7.bu.edu 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 10:11:19 EST
2007 i686 i686 i386 GNU/Linux
It was also happening on a Fedora Core system running openafs 1.4.1.
We've moved the volume from 1.2.13 to 1.4.2 Solaris servers. The size,
version, and activity level of the filerserver have no obvious impact.
The client systems affected are IBM 335 and 336 1U machines.
Examination of the most recent bad file shows that 21149 bytes are null.
In all cases, we've experienced null blocks; the first time it was
obvserved the null area size was approximately 64K.
% ls -l scan.dat.good scan.dat.bad
-rw-rw-rw- 1 32766 root 10351796 Apr 24 11:32 scan.dat.bad
-rw-rw-rw- 1 32766 root 10351796 Apr 24 11:32 scan.dat.good
% cmp -l scan.dat.good scan.dat.bad | head -5
9961473 13 0
9961474 351 0
9961475 140 0
9961476 253 0
9961477 205 0
% cmp -l scan.dat.good scan.dat.bad | wc -l
cmdebug entry for this file:
** Cache entry @ 0xca94dd00 for 2.536959020.66.30528 [bu.edu]
10351796 bytes DV 5 refcnt 2
callback f7da4ec0 expires 1177438028
5 opens 0 writers
states (0x5), stat'd, read-only
Any suggestions to help with tracking this down would be appreciated.