[OpenAFS] Tracking down AFS Fileserver corruption
Stephan Wiesand
stephan.wiesand@desy.de
Mon, 28 Nov 2011 20:34:00 +0100
Hi Jack,
no help, just a few dumb questions inline:
On Nov 28, 2011, at 19:13 , Jack Neely wrote:
> Folks,
>=20
> I'm deploying new OpenAFS 1.6.0 DAFS file servers on fully update RHEL
> 6.1 servers and I've stumbled across a data corruption problem. My =
ext4
> filesystem on the vice mounts are not getting corrupted, just the AFS
> volume data.
>=20
> Our /vicep[ab] mounts are provided by an EMC Clariion SAN array using
> the PowerPath driver. Each of the two vice mounts have 4 paths and =
are
> not partitioned. I've directly formatted the /dev/emcpower[ab] block
> device as ext4. Of course, the /dev/emcpowerX device is mounted on
> /vicepX.
emcpower{a,b} map to sdc{c,e} ?
> Every hour our OCS Inventory agent runs which eventually runs "fdisk =
-l"
> to get statistics for the storage on the server. When I was moving =
test
> volumes to the new server and the agent ran fdisk -l the kernel would
> print:
>=20
> Nov 28 13:01:39 xxx kernel: sdc: unknown partition table
> Nov 28 13:01:39 xxx kernel: sde: unknown partition table
> Nov 28 13:01:49 xxx kernel: sdc: unknown partition table
> Nov 28 13:01:49 xxx kernel: sde: unknown partition table
If the devices aren't partitioned, why would it ever find a partition =
table?
This may have changed, but Red Hat used to not support setups with =
filesystems on unpartitioned block devices, I believe.
> and the volume being moved at that exact time would be corrupt. =
Usually
> the server would soon detect this and salvage the volume, but the =
level
> of corruptions has varied.
I don't have experience with running 1.6 servers in production yet, but =
since the AFS fileserver is entirely running in userland, it should not =
cause this kind of corruption. That being said, there's an open BZ =
regarding ext4 corruption due to Ceph userland processes...
> The above messages and corruption only seem to happen when volume =
moves
> are in progress. Running fdisk -l on an idle server produces no
> messages.
Any messages if you run bonnie++ or iozone on the filesystem when the =
agent runs?
> Other things cause the above messages to be re-printed, such as =
running
> fsck -yf /dev/emcpowera.
Is this safe to do on a mounted ext4 filesystem?
> They occur during the early hours of the
> morning as well from something that appears to be related to a cron =
job
> I've not tracked down yet. =20
>=20
> I need some help in figuring out what is causing the corruption and,
> more importantly, how to fix things.
If the AFS fileserver could be run under a different account than root, =
one could be completely confident it's not the culprit. As things are, =
I'm only 99% confident...
Best regards,
Stephan
>=20
> Thanks,
> Jack Neely
>=20
> --=20
> Jack Neely <jjneely@ncsu.edu>
> Linux Czar, OIT Campus Linux Services
> Office of Information Technology, NC State University
> GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
--=20
Stephan Wiesand
DESY -DV-
Platanenenallee 6
15738 Zeuthen, Germany