[OpenAFS] VLDB corruption cause mount point go to other volume.

Thu, 21 Feb 2019 16:58:54 +0700

Hi everyone,

I have 3 vldb/pts servers and 13 file servers in my network. All are on 
the same subnet, same location.
We have encountered 2nd time of corrupted VLDB where when 'cd' into a 
mount point it go difference volume.

Example:
live.D1 mount at /afs/domain/live/data1
live.D2 mount at /afs/domain/live/data2
root.cell is at /afs/domain

cd /afs/domain/live/data1

'fs exa . ' show volume named 'live.D2' mounted at this mount point

'ls' show data in data2

or

cd /afs/domain

'fs exa . ' show volume named 'live.D1' mounted at this mount point

'ls' show data in data1

At first I think NTP getting out of sync but it is not.
I have 1 GPS NTP stratum 1 server and 2 of NTP stratum 2 on my network, 
Nagios and Cacti report no NTP down time during this event.

'vldb_check -database /var/lib/openafs/db/vldb.DB0'  show 'root.cell 
(xxxxxxxxxx) has no RW volume'  and ~10 volumes also 'has no RW volume'

So, I have backup of VLDB hourly, so it can be recovered fast enough but 
it is 2nd time that this happen.
Is anyone known why this would happen?  How can we prevent it?

Best regards,

Pommm