[OpenAFS] Suspect AFS bottlenecks on a web server

Simon Wilkinson sxw@inf.ed.ac.uk
Sat, 21 Nov 2009 11:46:11 +0000

On 18 Nov 2009, at 23:51, Jason Edgecombe wrote:

> Nate Gordon wrote:
>> As someone who also runs AFS as the backend to a webserver, I can  
>> understand
>> your problems.  My problems stem more specifically from PHP on AFS  
>> and that
>> PHP the language feels it is necessary to perform lots and lots of  
>> trivial
>> stat operations.  I have theorized that there are some global  
>> locking issues

This is the crux of the problem. Sadly, the AFS kernel module has a  
single global lock, which it uses to prevent two processes from being  
in the module at the same time. This does lead to contention,  
especially around operations like lookup and getattr, which  
applications expect to be low-cost. I do have a cunning plan to get  
round this, but it's going to require a bit more thought, and a lot of  
testing, before it's ready to see the light of day.

In addition to our own global lock, we also hold the Linux Big Kernel  
Lock around most of our VFS operations. This means that not only can  
we never run concurrently, but a number of other kernel operations are  
prevented from doing so, too. Matt Benjamin has done some work in the  
1.5 tree which suggests that we can get rid of the BKL when we're  
using memory cache - I suspect that we may be able to generalise this  
to remove it for many operations, even when the disk cache is in use.

> Derrick, I have 1.4.10 with the STABLE14-background-fsync- 
> consistency-issues patch already compiled and ready to deploy. Would  
> that be new enough to consider debugging?

If you are rolling out 1.4.10, then I would recommend that you disable  
the dynamic vcache support in it. Whilst dynamic vcaches are a huge  
improvement, the implementation in 1.4.10 aggressively minimises the  
number of vcaches that AFS holds by invalidating the Linux directory  
lookup cache every 5 minutes. If you are already seeing contention  
problems on lookup then this is likely to make things worse, by  
causing more time to be spent under the global lock.