[OpenAFS] console messages: "all buffers locked"
Simon Wilkinson
sxw@inf.ed.ac.uk
Mon, 26 Oct 2009 21:17:23 +0000
On 26 Oct 2009, at 15:15, Rainer Toebbicke wrote:
>
> What I forgot to mention was that during that test those zillion
> files are eventually removed. While unlinking the dentry_cache
> shrinks, whereas to my surprise the afs_inode_cache doesn't.
We only actually release space in the afs_inode_cache when the kernel
asks us to destroy the inode. I think it will only do so when it
believes it is running low on memory. 1.4.11 also contains changes
(the dynamic inodes code) which mean that we try less hard to release
inodes,
unless we are forced into doing so.
In terms of the buffer code, I spent a bit of a train journey today
looking into it. The error message you're getting means that
afs_newslot can't find a buffer that isn't in use to supply to the
'dir' package, to process the directory structure. Buffers aren't
marked in use beyond a single call to the directory package (that is,
if it returns holding a buffer, then that's a bug). Whilst Linux has a
relatively low number of buffers configured (50), and the directory
code uses 2 buffers for some lookups, this error would mean that you
have 25 or more directory operations occuring simulatenously.
I find it hard to believe that it would be possible to get 25
processes all in the directory code at once (although, with a tree of
large directories and a massively parallel writer, it's not
impossible), so I started to look for unbalanaced reference counts, or
locking issues in the buffer and directory code.
I found one locking issue, which was fixed back in 2002 in dir/
buffer.c, but not in afs/afs_buffer.c. The fix for that is in gerrit
as 737. However, I think I've convinced myself that the GLOCK
serialises things sufficiently that this is purely a theoretical
problem - I'd be surprised if you were seeing this in practice, and if
you are, I think it would manifest itself in different ways.
The second issue that I found was with the way that newslot picks the
oldest buffer to replace. There is an int32 counter, which is
incremented each time a buffer is accessed, and the current value
stored within that buffer as its 'accesstime'. If a buffer has a
stored accesstime of 0x7ffffff, then newslot will never evict that
buffer. I can't, however, see a practical way in which you can get 50
buffers into this position, though. The fix for this is gerrit #738 ( http://gerrit.openafs.org/738
) It might be worth giving this a whirl, and seeing if it helps.
All of that is a long winded way of saying I don't really know what's
causing your issue. One key diagnostic question is whether the cache
manager continues to operate once it's run out of buffers. If we have
a reference count imbalance somewhere, then the machine will never
recover, and will report a lack of buffers for every operation it
performs. If the cache manager does recover, then it may just mean
that we need to look at either having a larger number of buffers, or
making our buffer allocation dynamic. Both should be pretty
straightforward, for Linux at least.
What happens to your clients once they've hit the error?
S.