[OpenAFS] VLDB corruption cause mount point go to other volume.
Thossaporn (Pommm) Phetruphant
pommm@yannix.com
Thu, 21 Feb 2019 16:58:54 +0700
Hi everyone,
I have 3 vldb/pts servers and 13 file servers in my network. All are on
the same subnet, same location.
We have encountered 2nd time of corrupted VLDB where when 'cd' into a
mount point it go difference volume.
Example:
live.D1 mount at /afs/domain/live/data1
live.D2 mount at /afs/domain/live/data2
root.cell is at /afs/domain
cd /afs/domain/live/data1
'fs exa . ' show volume named 'live.D2' mounted at this mount point
'ls' show data in data2
or
cd /afs/domain
'fs exa . ' show volume named 'live.D1' mounted at this mount point
'ls' show data in data1
At first I think NTP getting out of sync but it is not.
I have 1 GPS NTP stratum 1 server and 2 of NTP stratum 2 on my network,
Nagios and Cacti report no NTP down time during this event.
'vldb_check -database /var/lib/openafs/db/vldb.DB0' show 'root.cell
(xxxxxxxxxx) has no RW volume' and ~10 volumes also 'has no RW volume'
So, I have backup of VLDB hourly, so it can be recovered fast enough but
it is 2nd time that this happen.
Is anyone known why this would happen? How can we prevent it?
Best regards,
Pommm