[OpenAFS] Change in volume status during vos dump in OpenAFS 1.6.x
Andy Malato
andym@oak.njit.edu
Wed, 13 Mar 2013 13:01:32 -0400 (EDT)
Hello Everyone,
We recently installed OpenAFS 1.6.2 on one of our fileservers in preparation
for migrating the rest of our cell to the latest 1.6.x release. One of the
driving factors behind upgrading to 1.6.x is to support volumes larger than 2TB.
Currently, the rest of the servers in our cell are running a mixture of 1.4.x
releases. The database servers are all running 1.4.5.
Like most other sites, we dump our volumes daily to disk using 'vos dump' so
that they can be backed up using our enterprise backup system. While
performing a dump of the volumes on the fileserver running 1.6.2 we noticed a
changed behavior in the volume status (from what occurs in 1.4.x) while a
dump is in progress.
When a 'vos dump' is performed on a volume that lives on a 1.4.x fileserver,
a 'vos ex' and 'vos listvol' have the following behavior:
root@fileserver02:# vos ex 537142259
**** Volume 537142259 is busy ****
RWrite: 537142257 Backup: 537142259
number of sites -> 1
server fileserver02 partition /vicepb RW Site
root@fileserver02:# vos listvol b locahost -local
Total number of volumes on server localhost partition /vicepb: 6
my.volume.6 537142257 RW 21191001 K On-line
my.volume.7 536995501 RW 2362268 K On-line
my.volume.7.backup 536995532 BK 2362268 K On-line
my.volume.8 537089944 RW 268280 K On-line
my.volume.8.backup 537089946 BK 268280 K On-line
**** Volume 537142259 is busy ****
However on a fileserver running 1.6.2 when running a 'vos ex'
against the volume being dumped, vos reports that the volume
does not exist. Furthermore, a vos listvol on the partition
shows: '**** Could not attach volume 537142257 ****'.
root@fileserver05:# vos ex 537466433
Could not fetch the information about volume 537466433 from the server
: No such device
Volume does not exist on server fileserver05 as indicated by the VLDB
Dump only information from VLDB
test.volume.5
RWrite: 537466431 Backup: 537466433
number of sites -> 1
server fileserver05 partition /vicepa RW Site
root@fileserver05:# vos listvol locahost -local
Total number of volumes on server localhost partition /vicepa: 6
test.volume.3 537465393 RW 4 K On-line
test.volume.3.backup 537465395 BK 4 K On-line
test.volume.4 537465396 RW 1539693624 K On-line
test.volume.4.backup 537465398 BK 1539693624 K On-line
test.volume.5 537466431 RW 99958788 K On-line
**** Could not attach volume 537466433 ****
So was this change in behavior from 1.4.x to 1.6.x intentional or are we
encountering a bug ? Perhaps this is being caused by our DB servers still
being at 1.4.5 ?
We have scripts that periodically do a vos listvol across all our fileservers
and look for volumes that could not be attached or possibly offline. This is
one of the ways in which we monitor the availability of our volumes. But
with the new behavior in 1.6.x, there is no easy way at first glance to
distinguish whether there is an actual problem with the volume or if is in
the process of being dumped.
Thanks.