[OpenAFS] Suspect AFS bottlenecks on a web server
Jason Edgecombe
jason@rampaginggeek.com
Sat, 21 Nov 2009 10:10:45 -0500
Simon Wilkinson wrote:
>
> On 18 Nov 2009, at 23:51, Jason Edgecombe wrote:
>
>> Nate Gordon wrote:
>>>
>>> As someone who also runs AFS as the backend to a webserver, I can
>>> understand
>>> your problems. My problems stem more specifically from PHP on AFS
>>> and that
>>> PHP the language feels it is necessary to perform lots and lots of
>>> trivial
>>> stat operations. I have theorized that there are some global
>>> locking issues
>
> This is the crux of the problem. Sadly, the AFS kernel module has a
> single global lock, which it uses to prevent two processes from being
> in the module at the same time. This does lead to contention,
> especially around operations like lookup and getattr, which
> applications expect to be low-cost. I do have a cunning plan to get
> round this, but it's going to require a bit more thought, and a lot of
> testing, before it's ready to see the light of day.
>
> In addition to our own global lock, we also hold the Linux Big Kernel
> Lock around most of our VFS operations. This means that not only can
> we never run concurrently, but a number of other kernel operations are
> prevented from doing so, too. Matt Benjamin has done some work in the
> 1.5 tree which suggests that we can get rid of the BKL when we're
> using memory cache - I suspect that we may be able to generalise this
> to remove it for many operations, even when the disk cache is in use.
>
>> Derrick, I have 1.4.10 with the
>> STABLE14-background-fsync-consistency-issues patch already compiled
>> and ready to deploy. Would that be new enough to consider debugging?
>
> If you are rolling out 1.4.10, then I would recommend that you disable
> the dynamic vcache support in it. Whilst dynamic vcaches are a huge
> improvement, the implementation in 1.4.10 aggressively minimises the
> number of vcaches that AFS holds by invalidating the Linux directory
> lookup cache every 5 minutes. If you are already seeing contention
> problems on lookup then this is likely to make things worse, by
> causing more time to be spent under the global lock.
1. My web server is on solaris, currently.
2. How do I disable dynamic vcache support?
3. I'm rolling 1.4.10 because I already have it compiled and packaged
for deployment. (using the AFS package program, if that matters). I have
already tested these binaries on other machines. Recompiling would
require restarting the whole test cycle.
I have some data that suggests I need to increase the size of the
vcache. It's about a 5% miss rate vs 2% miss rate on the dcache. I'll
tweak the vcache size if the upgrade doesn't improve things.
Thanks,
Jason