[OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?

Jeffrey Altman jaltman@secure-endpoints.com
Sun, 22 Mar 2009 15:39:38 -0400

Giovanni Bracco wrote:
> As I wrote in my posting, at that time (2002) my institution was using the 
> Transarc version of AFS and the reaction from  Transarc team was ...to 
> provide us with a patched version of AFS, not to correct the issue. That 
> version of course was not compatible with OpenAFS due to the large value of 
> the VolIDs existing at that point in our cell. 

The patched version of AFS fixed the issue.  The issue is that in some
locations in the source a Volume Id is an unsigned 32-bit value and in
others (most notably clone ids) the value is a signed 32-bit value.
If a signed value is increased beyond 2^31-1 it will wrap and become a
negative value.  There is no condition under which a negative value will
be greater than Max Volume Id.

I'm sure that the fix that IBM implemented for you in 2002 was to change
all of the Volume Id fields so that they are unsigned 32-bit values.
IBM does not provide their internal bug reports and patches to OpenAFS
so we never knew about the issue.

> To perform the migration to OpenAFS 3 years later we had to go through a  
> volume renumbering campaign (more than 1000 volumes) plus an ad-hoc 
> modification of the vl database to reset the MAxVolID to a value supported by 
> OpenAFS. At that point do you think we should have submitted a bug on 
> misterious event happened three years before on the Transarc AFS version?

You had to do this because OpenAFS did not have the patch that IBM
created and we didn't know that we needed to implement it ourselves.

> From the  follow-up of the thread (postings by Hartmut Reuter and  Rainer 
> Toebbicke )  I see that the "strange" big jump in the VolID still happens and 
> surely the issue should be solved.

There are several locations where unsigned and signed 32-bit variables
containing volume ids are mixed either for comparison or computation.
The computation of the new maxvolid value is one such place where this
takes place.  It is quite likely that the mixture of signed and unsigned
values resulted in signed 32-bit overflow which in turn resulted in an
incorrect comparison and then assignment.  This in turn would result in
the big jump.

I have a patch attached to ticket 124510 which will (I hope) make all
references to volume ids unsigned (except in the cache manager) and
avoid the problems with unsigned overflow conditions.

I suspect this patch is similar to what IBM applied to their source
tree in 2002.

Jeffrey Altman