[OpenAFS] openafs volume corruption when partition usage above 95%

Thossaporn (Pommm) Phetruphant pommm@yannix.com
Mon, 6 Jan 2020 18:58:37 +0700


Hi,

We are using openafs-server 1.6.15 on Ubuntu 16.04.
Our cell consists of 9 large storage servers each running MD-RAID6, for 
a total of 550TB of storage.
We are experiencing weird behaviour occasionally when the partitions 
approach about 95% full.
For example, on one server we have a 60TB partition where corruption 
starts occurring when we use up to 57TB with 3TB space remaining available.
Some volumes on this partition start to have corruption in the way that 
the RO id number changes to an invalid value.

For example vos examine p.xxx.001 gave us the following prior to the 
corruption:
p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536870982 Backup

Note how the ROnly volume id equals the RWrite volume id plus 1.

But as the partition filled up with other volumes, this volume entry in 
the VLDB show the following for 'vos examine p.xxx.001':
p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536154372 Backup

Note that the ROnly volume id has changed to 536154372, which is 
actually the RW site of another volume in that partition.
Salvage volume does not fix this. The only way we found to correct this 
issue is to copy this RW volume's data out to another partition and zap 
this volume.

But the real question is why is this happening? And why would it only 
happen when the partition usage gets over 95% despite the fact that this 
partition never has less than 3TB available?

Has anyone else encountered something like this? Does anyone have any 
suggestions where we might look at a configuration issue or something we 
might be doing wrong that might be causing this?

Sincerely,

Pommm