[OpenAFS] afsd causes crash for openafs1.2 and kernel 2.4.7 (fwd)

Derrick J Brashear Derrick J Brashear <shadow@dementia.org>
Sat, 6 Oct 2001 10:00:03 -0400 (EDT)


On Fri, 21 Sep 2001, Derrick J Brashear wrote:

> On Fri, 21 Sep 2001 Warren.Yenson@morganstanley.com wrote:
> 
> > We have seen that we can repeatedly crash a Linux box running 2.4.7 and
> > OpenAFS 1.2 by doing operations in /afs that open a large number of files
> > or directories (e.g. du -sk).
> 
> It's an issue which has been known to me at least for some time, but which
> I have not yet been able to track (somewhat more due to lack of time than
> necessarily anything else) but it's near the top of my priority list at
> this point.
> 
> > Since we don't see this on our regular (Transarc) AFS on Solaris I'm
> > wondering if anyone knows of some kind of leak in this version of OpenAFS.
> 
> It's Linux-specific but (I believe provably) not 2.4 specific.

After considerable research I can tell you that that:
-this issue exists in 2.2
-this issue is not SMP specific
-this issue exists in OpenAFS 1.0 base (essentially AFS 3.6)
-it's likely later IBM AFS versions have the same problem, but that is
  outside the scope of this.

Basically the problem is we're currently trying to play like everyone else
in the dentry universe, but in our case we have a fixed-size pool of
inodes (which to us look like vcache entries). So when d_add wires down a
directory inode, this effectively cuts into the pool of inodes available
to us. The default parameter is 300 and we allow up to 2 times this to be
in use before we panic. 

Realistically the leak seems to affect only volume root vnodes, however,
after some thought I believe the correct fix in light of the relative
danger in our world of having too many entries wired down is to wire down
none of them. 

Currently we d_add in afs_linux_lookup(). We cannot remove this d_add as
it would preclude us from returning valid data. However, there is a way we
can avoid actually creating dcache entries. After the call to d_add, a
call to d_drop the dcache entry is added immediately. The performance
implication is that we lose the dcache; However the AFS cache still has
this data so not dcaching does not cause any extra network traffic.
Extensive testing proves this fixes the problem. 

The fix will be committed shortly and will be in the next OpenAFS version,
which in the interest of making people with Linux releases done since
OpenAFS 1.2.1 happy should be available within the next week. (And of
course once this is done we releases should be driven far less frequently
by Linux kernel changes)

-D