[OpenAFS-devel] fileserver preformance bottlenecks

Fri, 13 May 2011 23:32:51 +0100

On 12 May 2011, at 12:48, Anton Lundin wrote:

> The smp-scaling in the fileserver is really bad. Have anyone done any
> profiling on what is causing this? Is any work getting done on this?

In general, recent work on the fileserver has been focussing on =
correctness, rather than on performance. We do have a number of results =
that point at poor SMP scaling of both the 1.4.x, and (sadly) the 1.6.x =
fileservers. In particular, many workloads seem to benefit from having a =
lower number of threads than the maximum permitted. This is obviously =
not ideal

As Derrick noted, the first thing would be to try the 1.6.0 prerelease =
fileserver. There are substantial changes in various parts of the =
fileserver in 1.6.x, even if you don't end up running demand attach. As =
far as I'm aware, little benchmarking has been performed of these =
changes, so it would be very interesting to see how both the demand =
attach, and normal, fileservers perform in your tests.

What has received substantial performance attention in the 1.6.x series =
is the RX transport protocol. We know that the RX that will ship in =
1.6.x is substantially faster than that in 1.4.x. If you are on an i?86 =
platform some of these performance improvements will only be apparent if =
you build for an i586 (or i686) architecture.

There are also a couple of RX "features" that will cause single user =
workloads to scale particularly badly.

Firstly, hot threads. In a typical dispatcher architecture one thread =
will read from the network, and hand incoming packets off to worker =
threads to handle them. This obviously entails a context switch, and for =
the data to be passed between threads. To avoid this, RX has "hot =
threads". The process which receives an incoming packet is the one which =
will handle it. The next free process then starts listening on the =
network. So, the process handling a given packet is constantly =
switching. Where there is a substantial amount of context associated =
with a packet (connection data, volume data, inode data, etc), if these =
two threads are scheduled on different cores, then a lot of data is =
constantly being swapped around. You might find, therefore, that =
disabling hot threads actually improves your performance.

Secondly, the way we round robin processes. In effect, we use an LRU =
queue to schedule idle threads. If we have five threads A, B, C, D, E, =
then packet 1 will be handled by A, whilst B becomes the listener. =
packet 2 goes to B, and C starts listening, packet 3 to C, packet 4 to =
D, packet 5 to E, and packet 6 back to A again. On a machine with 128 =
threads, and only 4 cores, there's a lot of churn here. Pulling the last =
entry, rather than the first entry from the idle queue would solve this =
problem - I have a patch for this change, but don't currently have =
access to any machines to test it on.

It's worth noting that both of these are likely to be particular issues =
in the single user case. On busier fileservers (where the number of =
connections is more than the number of cores) there will inevitably be =
churn, so I suspect that the performance degredation as cores come =
online will be much less marked.

Cheers,

Simon.