[OpenAFS] console messages: "all buffers locked"
Wed, 28 Oct 2009 00:20:15 +0000
On 27 Oct 2009, at 13:51, Rainer Toebbicke wrote:
> Simon Wilkinson schrieb:
>> All of that is a long winded way of saying I don't really know
>> what's causing your issue. One key diagnostic question is whether
>> the cache manager continues to operate once it's run out of
>> buffers. If we have a reference count imbalance somewhere, then the
>> machine will never recover, and will report a lack of buffers for
>> every operation it performs. If the cache manager does recover,
>> then it may just mean that we need to look at either having a
>> larger number of buffers, or making our buffer allocation dynamic.
>> Both should be pretty straightforward, for Linux at least.
>> What happens to your clients once they've hit the error?
> In two cases AFS continued to work. In two others however afs all
> AFS now stops after /afs, and eventually the looong lines with 'all
> buffers lockedall buffers lockedall buffers locked' (you could add a
> "\n" to your patch while you're at it) appear in the syslog.
It wouldn't surprise me if some codepaths tie themselves in knots when
DRead returns NULL - it's a rare enough occurence (and one which used
to just panic, rather than printing the warning message) that it's
probably not been widely examined. The two that never manage to free
their lockers are interesting though - can you get cmdebug and alt-
sysreq-t output from them while they're stuck? (If you could send that
privately, or to RT, rather than the list)
I hopefully did add a \n, too.
> I'll see if I can crank the 50 up an order of magnitude and track
> the increases. However, this *is* a stress test with about 100
> parallel "jobs" per client, not yet necessarily a leak, and even 25
> simultaneous "FindBlobs" aren't unthinkable.
I suspect that ultimately, we're going to need to make the buffer
structures dynamically allocated, with some kind of high and low
watermark system. Each buffer takes up slightly more than 2k of memory
- so having a large number permanently allocated is a little anti-
social on platforms with limited memory.