[OpenAFS] Tracking down AFS Fileserver corruption

Mon, 28 Nov 2011 20:34:00 +0100

Hi Jack,

no help, just a few dumb questions inline:

On Nov 28, 2011, at 19:13 , Jack Neely wrote:

> Folks,
>=20
> I'm deploying new OpenAFS 1.6.0 DAFS file servers on fully update RHEL
> 6.1 servers and I've stumbled across a data corruption problem.  My =
ext4
> filesystem on the vice mounts are not getting corrupted, just the AFS
> volume data.
>=20
> Our /vicep[ab] mounts are provided by an EMC Clariion SAN array using
> the PowerPath driver.  Each of the two vice mounts have 4 paths and =
are
> not partitioned.  I've directly formatted the /dev/emcpower[ab] block
> device as ext4.  Of course, the /dev/emcpowerX device is mounted on
> /vicepX.

emcpower{a,b} map to sdc{c,e} ?

> Every hour our OCS Inventory agent runs which eventually runs "fdisk =
-l"
> to get statistics for the storage on the server.  When I was moving =
test
> volumes to the new server and the agent ran fdisk -l the kernel would
> print:
>=20
>    Nov 28 13:01:39 xxx kernel: sdc: unknown partition table
>    Nov 28 13:01:39 xxx kernel: sde: unknown partition table
>    Nov 28 13:01:49 xxx kernel: sdc: unknown partition table
>    Nov 28 13:01:49 xxx kernel: sde: unknown partition table

If the devices aren't partitioned, why would it ever find a partition =
table?

This may have changed, but Red Hat used to not support setups with =
filesystems on unpartitioned block devices, I believe.

> and the volume being moved at that exact time would be corrupt.  =
Usually
> the server would soon detect this and salvage the volume, but the =
level
> of corruptions has varied.

I don't have experience with running 1.6 servers in production yet, but =
since the AFS fileserver is entirely running in userland, it should not =
cause this kind of corruption. That being said, there's an open BZ =
regarding ext4 corruption due to Ceph userland processes...

> The above messages and corruption only seem to happen when volume =
moves
> are in progress.  Running fdisk -l on an idle server produces no
> messages.

Any messages if you run bonnie++ or iozone on the filesystem when the =
agent runs?

> Other things cause the above messages to be re-printed, such as =
running
> fsck -yf /dev/emcpowera.

Is this safe to do on a mounted ext4 filesystem?

>  They occur during the early hours of the
> morning as well from something that appears to be related to a cron =
job
> I've not tracked down yet. =20
>=20
> I need some help in figuring out what is causing the corruption and,
> more importantly, how to fix things.

If the AFS fileserver could be run under a different account than root, =
one could be completely confident it's not the culprit. As things are, =
I'm only 99% confident...

Best regards,
	Stephan
>=20
> Thanks,
> Jack Neely
>=20
> --=20
> Jack Neely <jjneely@ncsu.edu>
> Linux Czar, OIT Campus Linux Services
> Office of Information Technology, NC State University
> GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

--=20
Stephan Wiesand
DESY -DV-
Platanenenallee 6
15738 Zeuthen, Germany