[OpenAFS] Change in volume status during vos dump in OpenAFS 1.6.x

Andy Malato andym@oak.njit.edu
Wed, 13 Mar 2013 13:01:32 -0400 (EDT)

Hello Everyone,

We recently installed OpenAFS 1.6.2 on one of our fileservers in preparation 
for migrating the rest of our cell to the latest 1.6.x release.   One of the 
driving factors behind upgrading to 1.6.x is to support volumes larger than 2TB.

Currently, the rest of the servers in our cell are running a mixture of 1.4.x 
releases.  The database servers are all running 1.4.5.

Like most other sites, we dump our volumes daily to disk using 'vos dump' so 
that they can be backed up using our enterprise backup system.  While 
performing a dump of the volumes on the fileserver running 1.6.2 we noticed a 
changed behavior in the volume status (from what occurs in 1.4.x) while a 
dump is in progress.

When a 'vos dump' is performed on a volume that lives on a 1.4.x fileserver, 
a 'vos ex' and 'vos listvol' have the following behavior:

root@fileserver02:# vos ex 537142259
  	**** Volume 537142259 is busy ****

      		RWrite: 537142257     Backup: 537142259
          	number of sites -> 1
  		   server fileserver02 partition /vicepb RW Site

root@fileserver02:# vos listvol b locahost -local

  	Total number of volumes on server localhost partition /vicepb: 6
  	my.volume.6                    	  537142257 RW   21191001 K On-line
  	my.volume.7                       536995501 RW    2362268 K On-line
  	my.volume.7.backup                536995532 BK    2362268 K On-line
  	my.volume.8                       537089944 RW     268280 K On-line
  	my.volume.8.backup                537089946 BK     268280 K On-line
  	**** Volume 537142259 is busy ****

However on a fileserver running 1.6.2 when running a 'vos ex'
against the volume being dumped, vos reports that the volume 
does not exist.  Furthermore, a vos listvol on the partition 
shows: '**** Could not attach volume 537142257 ****'.

root@fileserver05:# vos ex 537466433

  	Could not fetch the information about volume 537466433 from the server
  	: No such device
  	Volume does not exist on server fileserver05 as indicated by the VLDB

  	Dump only information from VLDB

      	    RWrite: 537466431     Backup: 537466433
              number of sites -> 1
  	       server fileserver05 partition /vicepa RW Site

root@fileserver05:# vos listvol locahost -local

  	Total number of volumes on server localhost partition /vicepa: 6
  	test.volume.3		          537465393 RW          4 K On-line
  	test.volume.3.backup      	  537465395 BK          4 K On-line
  	test.volume.4	                  537465396 RW 1539693624 K On-line
  	test.volume.4.backup              537465398 BK 1539693624 K On-line
  	test.volume.5                     537466431 RW   99958788 K On-line
  	**** Could not attach volume 537466433 ****

So was this change in behavior from 1.4.x to 1.6.x intentional or are we
encountering a bug ?  Perhaps this is being caused by our DB servers still 
being at 1.4.5 ?

We have scripts that periodically do a vos listvol across all our fileservers 
and look for volumes that could not be attached or possibly offline.  This is 
one of the ways in which we monitor the availability of our volumes.  But 
with the new behavior in 1.6.x, there is no easy way at first glance to 
distinguish whether there is an actual problem with the volume or if is in 
the process of being dumped.