[OpenAFS] 1.4.4 client on EL3: panic in afs_HashOutDcache
Derrick J Brashear
shadow@dementia.org
Thu, 12 Apr 2007 10:46:01 -0400 (EDT)
On Thu, 12 Apr 2007, Stephan Wiesand wrote:
> On Wed, 11 Apr 2007, Derrick J Brashear wrote:
>
>> On Wed, 11 Apr 2007, Stephan Wiesand wrote:
>>
>>> One of our systems panicked two times within 2 hours yesterday, at the
>>> same location in the OpenAFS client. I attached the kernel's last words
>>> below.
>>>
>>> This is an SL3 system, kernel 2.4.21-47.0.1.ELsmp, i686. The client build
>>> has two patches on top of 1.4.4: linux-task-pointer-safety-20070320 from
>>> CVS, and the one from
>>> https://lists.openafs.org/pipermail/openafs-devel/2007-March/014985.html
[]
>> so basically you appear to have an unhashed dcache entry. Either there's a
>> locking bug or something is becoming erroneously unhashed.
>>
>> How reproducible is it?
>
> Good news: it is reproducible. The user confessed that he'd run "less than
> 20" parallel rsyncs transferring data to our cell. The files are a mixture af
> data and log files, with typical sizes of 15MB and 100kB.
>
> So I set up a dozen rsyncs to copy this data into another volume, and after
> some 9 hours got the panic you find below.
>
> I'm going to repeat this exercise now, and will also try to make the panic
> happen earlier (more rsyncs, read data from a faster source - any other
> ideas?).
>
> Just wondering what to do next then.
I'm thinking about a patch. I have something else I need to deal with but
I will try to work something up after. There's a 3rd possibility, namely
the missing object being mishashed. We can presumably just instead of
panicing iterate everything and dump state.
I suppose the other possibility would be to get a kernel crash dump but
it's sort of cumbersome to move those around so unless you're comfortable
with a debugger on a kernel dump that's probably a non-starter.
Derrick