[OpenAFS] 1.4.4 client on EL3: panic in afs_HashOutDcache

Derrick J Brashear shadow@dementia.org
Thu, 12 Apr 2007 10:46:01 -0400 (EDT)


On Thu, 12 Apr 2007, Stephan Wiesand wrote:

> On Wed, 11 Apr 2007, Derrick J Brashear wrote:
>
>> On Wed, 11 Apr 2007, Stephan Wiesand wrote:
>> 
>>> One of our systems panicked two times within 2 hours yesterday, at the 
>>> same location in the OpenAFS client. I attached the kernel's last words 
>>> below.
>>> 
>>> This is an SL3 system, kernel 2.4.21-47.0.1.ELsmp, i686. The client build 
>>> has two patches on top of 1.4.4: linux-task-pointer-safety-20070320 from 
>>> CVS, and the one from
>>> https://lists.openafs.org/pipermail/openafs-devel/2007-March/014985.html
[]
>> so basically you appear to have an unhashed dcache entry. Either there's a 
>> locking bug or something is becoming erroneously unhashed.
>> 
>> How reproducible is it?
>
> Good news: it is reproducible. The user confessed that he'd run "less than 
> 20" parallel rsyncs transferring data to our cell. The files are a mixture af 
> data and log files, with typical sizes of 15MB and 100kB.
>
> So I set up a dozen rsyncs to copy this data into another volume, and after 
> some 9 hours got the panic you find below.
>
> I'm going to repeat this exercise now, and will also try to make the panic 
> happen earlier (more rsyncs, read data from a faster source - any other
> ideas?).
>
> Just wondering what to do next then.

I'm thinking about a patch. I have something else I need to deal with but 
I will try to work something up after. There's a 3rd possibility, namely 
the missing object being mishashed. We can presumably just instead of 
panicing iterate everything and dump state.

I suppose the other possibility would be to get a kernel crash dump but 
it's sort of cumbersome to move those around so unless you're comfortable 
with a debugger on a kernel dump that's probably a non-starter.

Derrick