[OpenAFS] VLDB corruption cause mount point go to other volume.

Thossaporn (Pommm) Phetruphant pommm@yannix.com
Fri, 22 Feb 2019 23:46:56 +0700


Hi Mark Vitale,

Thank you for you fast reply.
I understand that mount point kept in volume its self not in vldb.
I tried stop and start vldb server before replace vldb.DB0 from backup 
but it didn't help.
Somehow, the issue resolve by replace vldb from backup.

 >What version of AFS are you using for your vlservers, fileservers, and 
cache managers (clients)? And what operating system and version do your 
clients run on?

I'm using Ubuntu 16.04 x64 both on server and client.

Linux 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux
openafs-client 1.6.18-1ubuntu1
openafs-fileserver 1.6.18-1ubuntu1
openafs-krb5 1.6.18-1ubuntu1


 >Are you running vldb_check against a live VLDB?
check live vldb which is point to wrong mount point :

$ vldb_check -database /var/lib/openafs/db/vldb.DB0'

address 0xhhhhhhh  root.cell (xxxxxxxxxx) has no RW volume

but when check backup vldb from 12hr before showing no error.

$ vldb_check -database /backup/openafs/live/db/vldb.DB0.yyyymmdd-hhmm

result no issue


 >Do the VLDB entries for the apparently corrupted volumes change 
frequently?

No, we have 200-300 volume create and release daily. From our event log 
it happen 2 time in the past 3 years.

 >Are you taking any steps to ensure the VLDB is not changing when you 
back it up?

Oh, I'm not preventing this. Just copy live vldb from syncsite 
/var/lib/openafs/db/vldb.DB0 and send it to non-afs backup server.
In the emergency recovery test, Use the backup from this method always work.

 >Could you provide more details about the steps you take to recover 
your VLDB?

Stop openafs-fileserver service on 3 of vldb server by
$ service openafs-fileserver stop

Remove vldb.DB0 from /var/lib/openafs/db/
$ rm /var/lib/openafs/db/vldb.DB0

Copy backup vldb.DB0 from non-afs backup server to vldb server
Repeat this step to 3 of vldb server
$ scp backupserver:/backup/openafs/live/db/vldb.DB0.yyyymmdd-hhmm 
/var/lib/openafs/db/

Start openafs-fileserver service on 3 of vldb server by
$ service openafs-fileserver start

Wait 2-3 minute and check for syncsite voted.
Then everything back to normal at this point.
We ran syncserv and syncvldb to update the change of actual volume on 
each file server.
Then all volume on 13 servers update with vldb.

At this point my concern is what may cause this to happen? So I can look 
for ways to prevent it.


Best regards,

Pommm

  

On 2/21/19 9:34 PM, Mark Vitale wrote:
> Pomm,
>
> Thank you for your report.  Could you provide some more details (inline below)?
>
>> On Feb 21, 2019, at 4:58 AM, Thossaporn (Pommm) Phetruphant<pommm@yannix.com>  wrote:
>>
>> I have 3 vldb/pts servers and 13 file servers in my network. All are on the same subnet, same location.
>> We have encountered 2nd time of corrupted VLDB where when 'cd' into a mount point it go difference volume.
>>
>> Example:
>> live.D1 mount at /afs/domain/live/data1
>> live.D2 mount at /afs/domain/live/data2
>> root.cell is at /afs/domain
>>
>>
>> cd /afs/domain/live/data1
>>
>> 'fs exa . ' show volume named 'live.D2' mounted at this mount point
>>
>> 'ls' show data in data2
>>
>> or
>>
>> cd /afs/domain
>>
>> 'fs exa . ' show volume named 'live.D1' mounted at this mount point
>>
>> 'ls' show data in data1
> Mount point information is stored in the fileserver vice partitions, not in the VLDB.
> What version of AFS are you using for your vlservers, fileservers, and cache managers (clients)?
> And what operating system and version do your clients run on?
>
>> <snip>
>>
>> 'vldb_check -database /var/lib/openafs/db/vldb.DB0'  show 'root.cell (xxxxxxxxxx) has no RW volume'  and ~10 volumes also 'has no RW volume'
>>
>> So, I have backup of VLDB hourly, so it can be recovered fast enough but it is 2nd time that this happen.
>> Is anyone known why this would happen?  How can we prevent it?
> Are you running vldb_check against a live VLDB?
> Do the VLDB entries for the apparently corrupted volumes change frequently?
> Are you taking any steps to ensure the VLDB is not changing when you back it up?
> Could you provide more details about the steps you take to recover your VLDB?
>
> Regards,
> --
> Mark Vitale
> mvitale@sinenomine.net
>
>
>