[OpenAFS] openafs volume corruption when partition usage above 95%
Thossaporn (Pommm) Phetruphant
pommm@yannix.com
Mon, 6 Jan 2020 18:58:37 +0700
Hi,
We are using openafs-server 1.6.15 on Ubuntu 16.04.
Our cell consists of 9 large storage servers each running MD-RAID6, for
a total of 550TB of storage.
We are experiencing weird behaviour occasionally when the partitions
approach about 95% full.
For example, on one server we have a 60TB partition where corruption
starts occurring when we use up to 57TB with 3TB space remaining available.
Some volumes on this partition start to have corruption in the way that
the RO id number changes to an invalid value.
For example vos examine p.xxx.001 gave us the following prior to the
corruption:
p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536870982 Backup
Note how the ROnly volume id equals the RWrite volume id plus 1.
But as the partition filled up with other volumes, this volume entry in
the VLDB show the following for 'vos examine p.xxx.001':
p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536154372 Backup
Note that the ROnly volume id has changed to 536154372, which is
actually the RW site of another volume in that partition.
Salvage volume does not fix this. The only way we found to correct this
issue is to copy this RW volume's data out to another partition and zap
this volume.
But the real question is why is this happening? And why would it only
happen when the partition usage gets over 95% despite the fact that this
partition never has less than 3TB available?
Has anyone else encountered something like this? Does anyone have any
suggestions where we might look at a configuration issue or something we
might be doing wrong that might be causing this?
Sincerely,
Pommm