[OpenAFS-devel] volume-dont-artificially-untimeout-vlservers-20061218
updated?
Christopher Allen Wing
wingc@engin.umich.edu
Fri, 2 Feb 2007 17:34:40 -0500 (EST)
Jeff:
On Tue, 30 Jan 2007, Jeffrey Hutzelman wrote:
>> I think that the botched version of the CVS delta:
>>
>> volume-dont-artificially-untimeout-vlservers-20061218
>>
>> is crashing some of our AFS clients. I noticed that the fixed version of
>> this patch made it into CVS yesterday.
>
> This doesn't surprise me much; I suspected that this might cause issues for
> any client which actually saw a vlserver go down.
We saw kernel crashes with a backtrace like this:
(crash)
InstallVolumeEntry()
afs_SetupVolume()
afs_NewVolumeByName()
...
...
I think what happened to us was that when the defective code inside
afs_NewVolumeByName() ran, it left garbage inside the newly allocated
struct {,n,u}vldbentry. Eventually, the following code inside
InstallVolumeEntry() loops up to a billion or whatever garbage was in the
nServers entry of the structure:
/* Step through the VLDB entry making sure each server listed is there */
for (i = 0, j = 0; i < ve->nServers; i++) {
if (((ve->serverFlags[i] & mask) == 0)
|| (ve->serverFlags[i] & VLSF_DONTUSE)) {
continue; /* wrong volume or don't use this volume */
}
While executing code inside that loop, a kernel watchdog would eventually
trigger. This, unhelpfully, just made the whole machine hard hang.
I think the watchdog timer just triggered due to the time spent looping in
the kernel.
> Fortunately, we had the
> chance to fix it before 1.4.3 final. I'm glad there are people out there
> deploying release candidates.
The way I see it, if anything goes wrong with the code I'm using, chances
are I'd just be asked to upgrade to the next release (candidate) anyway.
I do browse through the CVS frequently (mainly via the openafs.org web
interface), and I try to read the details on which deltas have gone in
before deciding what to deploy.
>> My question is, why doesn't the delta name change in this case?
>
> Because the gatekeepers chose to treat it as part of the same delta.
> Personally, I wish they wouldn't do this, especially when there's a release
> in between. It also confuses wdelta and some other tools if there happen to
> have been any other commits to the affected files between the two parts of
> the delta.
Ok.
> Currently, wdelta's sort-by-date uses the timestamp that is part of the delta
> name. Most of the time, this works fine. Actually using the timestamp of
> the last commit would be harder, because we'd have to inspect the CVS data
> for each affected file to find the timestamps. I'll look into it, but no
> promises at this point.
I understand. Thanks for looking into this,
Chris Wing
wingc@engin.umich.edu