[OpenAFS-devel] volume-dont-artificially-untimeout-vlservers-20061218 updated?

Christopher Allen Wing wingc@engin.umich.edu
Fri, 2 Feb 2007 17:34:40 -0500 (EST)


Jeff:


On Tue, 30 Jan 2007, Jeffrey Hutzelman wrote:

>> I think that the botched version of the CVS delta:
>>
>>  	volume-dont-artificially-untimeout-vlservers-20061218
>> 
>> is crashing some of our AFS clients.  I noticed that the fixed version of
>> this patch made it into CVS yesterday.
>
> This doesn't surprise me much; I suspected that this might cause issues for 
> any client which actually saw a vlserver go down.

We saw kernel crashes with a backtrace like this:

 	(crash)
 	InstallVolumeEntry()
 	afs_SetupVolume()
 	afs_NewVolumeByName()
 	...
 	...

I think what happened to us was that when the defective code inside 
afs_NewVolumeByName() ran, it left garbage inside the newly allocated 
struct {,n,u}vldbentry.  Eventually, the following code inside 
InstallVolumeEntry() loops up to a billion or whatever garbage was in the 
nServers entry of the structure:

     /* Step through the VLDB entry making sure each server listed is there */
     for (i = 0, j = 0; i < ve->nServers; i++) {
         if (((ve->serverFlags[i] & mask) == 0)
             || (ve->serverFlags[i] & VLSF_DONTUSE)) {
             continue;           /* wrong volume or  don't use this volume */
         }


While executing code inside that loop, a kernel watchdog would eventually 
trigger.  This, unhelpfully, just made the whole machine hard hang.

I think the watchdog timer just triggered due to the time spent looping in 
the kernel.

> Fortunately, we had the 
> chance to fix it before 1.4.3 final.  I'm glad there are people out there 
> deploying release candidates.

The way I see it, if anything goes wrong with the code I'm using, chances 
are I'd just be asked to upgrade to the next release (candidate) anyway.

I do browse through the CVS frequently (mainly via the openafs.org web 
interface), and I try to read the details on which deltas have gone in 
before deciding what to deploy.

>> My question is, why doesn't the delta name change in this case?
>
> Because the gatekeepers chose to treat it as part of the same delta. 
> Personally, I wish they wouldn't do this, especially when there's a release 
> in between.  It also confuses wdelta and some other tools if there happen to 
> have been any other commits to the affected files between the two parts of 
> the delta.

Ok.

> Currently, wdelta's sort-by-date uses the timestamp that is part of the delta 
> name.  Most of the time, this works fine.  Actually using the timestamp of 
> the last commit would be harder, because we'd have to inspect the CVS data 
> for each affected file to find the timestamps.  I'll look into it, but no 
> promises at this point.

I understand.  Thanks for looking into this,

Chris Wing
wingc@engin.umich.edu