[OpenAFS] Tracking down AFS Fileserver corruption

Mon, 28 Nov 2011 13:13:32 -0500

Folks,

I'm deploying new OpenAFS 1.6.0 DAFS file servers on fully update RHEL
6.1 servers and I've stumbled across a data corruption problem.  My ext4
filesystem on the vice mounts are not getting corrupted, just the AFS
volume data.

Our /vicep[ab] mounts are provided by an EMC Clariion SAN array using
the PowerPath driver.  Each of the two vice mounts have 4 paths and are
not partitioned.  I've directly formatted the /dev/emcpower[ab] block
device as ext4.  Of course, the /dev/emcpowerX device is mounted on
/vicepX.

Every hour our OCS Inventory agent runs which eventually runs "fdisk -l"
to get statistics for the storage on the server.  When I was moving test
volumes to the new server and the agent ran fdisk -l the kernel would
print:

    Nov 28 13:01:39 xxx kernel: sdc: unknown partition table
    Nov 28 13:01:39 xxx kernel: sde: unknown partition table
    Nov 28 13:01:49 xxx kernel: sdc: unknown partition table
    Nov 28 13:01:49 xxx kernel: sde: unknown partition table

and the volume being moved at that exact time would be corrupt.  Usually
the server would soon detect this and salvage the volume, but the level
of corruptions has varied.

The above messages and corruption only seem to happen when volume moves
are in progress.  Running fdisk -l on an idle server produces no
messages.

Other things cause the above messages to be re-printed, such as running
fsck -yf /dev/emcpowera.  They occur during the early hours of the
morning as well from something that appears to be related to a cron job
I've not tracked down yet.  

I need some help in figuring out what is causing the corruption and,
more importantly, how to fix things.

Thanks,
Jack Neely

-- 
Jack Neely <jjneely@ncsu.edu>
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89