[OpenAFS] Suspect AFS bottlenecks on a web server

Jason Edgecombe jason@rampaginggeek.com
Sat, 21 Nov 2009 10:10:45 -0500

Simon Wilkinson wrote:
> On 18 Nov 2009, at 23:51, Jason Edgecombe wrote:
>> Nate Gordon wrote:
>>> As someone who also runs AFS as the backend to a webserver, I can 
>>> understand
>>> your problems.  My problems stem more specifically from PHP on AFS 
>>> and that
>>> PHP the language feels it is necessary to perform lots and lots of 
>>> trivial
>>> stat operations.  I have theorized that there are some global 
>>> locking issues
> This is the crux of the problem. Sadly, the AFS kernel module has a 
> single global lock, which it uses to prevent two processes from being 
> in the module at the same time. This does lead to contention, 
> especially around operations like lookup and getattr, which 
> applications expect to be low-cost. I do have a cunning plan to get 
> round this, but it's going to require a bit more thought, and a lot of 
> testing, before it's ready to see the light of day.
> In addition to our own global lock, we also hold the Linux Big Kernel 
> Lock around most of our VFS operations. This means that not only can 
> we never run concurrently, but a number of other kernel operations are 
> prevented from doing so, too. Matt Benjamin has done some work in the 
> 1.5 tree which suggests that we can get rid of the BKL when we're 
> using memory cache - I suspect that we may be able to generalise this 
> to remove it for many operations, even when the disk cache is in use.
>> Derrick, I have 1.4.10 with the 
>> STABLE14-background-fsync-consistency-issues patch already compiled 
>> and ready to deploy. Would that be new enough to consider debugging?
> If you are rolling out 1.4.10, then I would recommend that you disable 
> the dynamic vcache support in it. Whilst dynamic vcaches are a huge 
> improvement, the implementation in 1.4.10 aggressively minimises the 
> number of vcaches that AFS holds by invalidating the Linux directory 
> lookup cache every 5 minutes. If you are already seeing contention 
> problems on lookup then this is likely to make things worse, by 
> causing more time to be spent under the global lock.
1. My web server is on solaris, currently.
2. How do I disable dynamic vcache support?
3. I'm rolling 1.4.10 because I already have it compiled and packaged 
for deployment. (using the AFS package program, if that matters). I have 
already tested these binaries on other machines. Recompiling would 
require restarting the whole test cycle.

I have some data that suggests I need to increase the size of the 
vcache. It's about a 5% miss rate vs 2% miss rate on the dcache. I'll 
tweak the vcache size if the upgrade doesn't improve things.