[OpenAFS] console messages: "all buffers locked"

Simon Wilkinson sxw@inf.ed.ac.uk
Mon, 26 Oct 2009 21:17:23 +0000

On 26 Oct 2009, at 15:15, Rainer Toebbicke wrote:
> What I forgot to mention was that during that test those zillion  
> files are eventually removed. While unlinking the dentry_cache  
> shrinks, whereas to my surprise the afs_inode_cache doesn't.

We only actually release space in the afs_inode_cache when the kernel  
asks us to destroy the inode. I think it will only do so when it  
believes it is running low on memory. 1.4.11 also contains changes  
(the dynamic inodes code) which mean that we try less hard to release  
unless we are forced into doing so.

In terms of the buffer code, I spent a bit of a train journey today  
looking into it. The error message you're getting means that  
afs_newslot can't find a buffer that isn't in use to supply to the  
'dir' package, to process the directory structure. Buffers aren't  
marked in use beyond a single call to the directory package (that is,  
if it returns holding a buffer, then that's a bug). Whilst Linux has a  
relatively low number of buffers configured (50), and the directory  
code uses 2 buffers for some lookups, this error would mean that you  
have 25 or more directory operations occuring simulatenously.

I find it hard to believe that it would be possible to get 25  
processes all in the directory code at once (although, with a tree of  
large directories and a massively parallel writer, it's not  
impossible), so I started to look for unbalanaced reference counts, or  
locking issues in the buffer and directory code.

I found one locking issue, which was fixed back in 2002 in dir/ 
buffer.c, but not in afs/afs_buffer.c. The fix for that is in gerrit  
as 737. However, I think I've convinced myself that the GLOCK  
serialises things sufficiently that this is purely a theoretical  
problem - I'd be surprised if you were seeing this in practice, and if  
you are, I think it would manifest itself in different ways.

The second issue that I found was with the way that newslot picks the  
oldest buffer to replace. There is an int32 counter, which is  
incremented each time a buffer is accessed, and the current value  
stored within that buffer as its 'accesstime'. If a buffer has a  
stored accesstime of 0x7ffffff, then newslot will never evict that  
buffer. I can't, however, see a practical way in which you can get 50  
buffers into this position, though. The fix for this is gerrit #738 ( http://gerrit.openafs.org/738 
  ) It might be worth giving this a whirl, and seeing if it helps.

All of that is a long winded way of saying I don't really know what's  
causing your issue. One key diagnostic question is whether the cache  
manager continues to operate once it's run out of buffers. If we have  
a reference count imbalance somewhere, then the machine will never  
recover, and will report a lack of buffers for every operation it  
performs. If the cache manager does recover, then it may just mean  
that we need to look at either having a larger number of buffers, or  
making our buffer allocation dynamic. Both should be pretty  
straightforward, for Linux at least.

What happens to your clients once they've hit the error?