[OpenAFS-devel] fileserver parameters

Sat, 18 Jun 2005 23:03:57 -0400

On 6/16/05, Roland Kuhn <rkuhn@e18.physik.tu-muenchen.de> wrote:
> Dear experts!
>=20
> We are fighting with the fileserver performance since a long time.
> Once I got the advice to use the single threaded fileserver, which
> helped, but didn't get me more than 10MB/s. Now we upgraded to Debian
> sarge (openafs 1.3.81), which comes again with the threaded server.
> With default settings we get 1MB/s (the underlying RAID can easily
> deliver >200MB/s, which shows that the VM settings are okay). Now I
> tried with -L -vc 10000 -cb 100000 -udpsize 12800, which brings it
> back to about 6MB/s (all numbers with >>1 simultanous clients
> reading). This is still a factor 30 below the capabilities of the
> RAID (okay, we only have 1GB/s ethernet ;-) ). I've seen excessive
> context switch rates (>>100000/s), which obviously don't happen with
> the single threaded fileserver.
>=20
> So, can anybody comment on these numbers? Those are dual Opteron
> boxes with enough RAM, so please make some suggestions what options I
> should try to get more like the real performance of a fileserver...

This is very interesting.  On much much older hardware (2x 300MHz sun
e450 running solaris 10) I can get >15MB/s aggregate off a single
fc-al disk with >> 1 clients over gigE with absolutely no tweaking of
fileserver parameters.  Of course, there are many performance
bottlenecks in multithreading that are actually exacerbated by faster
cpu's, so the results you're seeing are plausible.

I'd be interested in seeing a comparison of 1.3.81 and 1.3.84
performance.  Several threading patches were integrated between these
revisions, and it would be interesting to see how they affect your
problem.  I know on sparc they are making a difference, but that
doesn't necessarily correllate to amd64.

If you upgrade to 1.3.84, there is another fileserver option that you
will want to experiment with: -rxpck.  Sometime after 1.3.81,
thread-local packet queues were integrated, and they may reduce your
context switch rate due to less contention for the global packet queue
lock.  The default value for -rxpck will give you approximately 500
rx_packet structures.  I recommend trying several values in the range
1000-5000.  At some point, you will reach an optimal tradeoff between
a small value that fits within your cache hierarchy, and a large value
that reduces the number of transfers between the thread-local and
global rx_packet queues.  Before submitting the thread-local patch to
RT, I was only able to test on a few architectures, and I'd like to
get feedback for amd64.

Another option you might care to experiment with is: -p .  IIRC, the
default will give you 12 worker threads.  It sounds like many of your
worker threads are busy handling calls, but are constantly contending
over locks, and blocking on i/o.  You will need to experiment with
this, but you may find that reducing the number of worker threads will
actually improve performance by forcing new calls to queue up, thereby
allowing your active calls to complete with less contention.  Of
course, this won't alleviate the problems caused by blocking i/o.=20
Reducing this value too far is dangerous because some calls have high
latencies (e.g. some calls make calls to the ptserver).

Have you looked at the xstat results from your servers?  afsmonitor is
a great little tool, and it can even dump these results periodically
to a log.  This data could help us to understand your workload.=20
Seeing those numbers would also help us with suggesting changes to
parameters in the volume package.

And now I'll digress, and talk about the more fundament issues.  After
spending a lot of time with dtrace, here's my list of 5 major
bottlenecks in rx and the fileserver:

1) 1:1 mapping of calls to threads

The 1:1 mapping of calls to threads was a fine model for LWP, but with
kernel threads we're doing a lot of unnecessary context switches.=20
Fixing this will involve a major rewrite of rx.  Basically, we have to
move call state off the stack, and implement an asynchronous
event-driven model for assigning calls to threads as they are ready to
proceed.  This would also require switching to asynchronous i/o.

2) single-threaded on udp receive

At the moment, one thread at a time calls recvmsg(), so we're
essentially limited to 1 cpu on the receive side.  After recvmsg()
returns, this thread will parse the packet header, possibly call into
the security object associated with the conn, and then route it
through the rx call mux.  If it's a new call, and hot threads is
enabled, this means we need to signal a waiting worker thread so it
can become the new listener, and we will then begin processing the new
call.  Otherwise, we drop the packet into the approrpiate call
struct's receive queue, signal the waiting worker, and go back to
blocking on recvmsg().

Depending on your fileserver's workload, hot threads can be a good
thing or a bad thing.
They reduce the latency on new call creation, but the tradeoff is
increased latency between recvmsg() syscalls following a new call.=20
Whether this is an improvement for you is very dependent on your
workload.  If you want new call latency to be low, then hot threads
should be on, but if you want to maximize server throughput, then I
would turn off hot threads.

On a related note, we incur a lot of extra mode switches to handle ACKs.

I'm working on a patch to scale the number of concurrent listeners up
to the number of cpu's.  So far, it just involves adding a few mutex
enter/exit's, and changing the the serverproc logic a little bit.

3) can only receive one datagram per syscall

Well, I have to blame the standards bodies here.  For POSIX asynch
i/o, nobody bothered to put a sockaddr field in the aiocb struct.  Oh
well.  There is one possible way to handle this right now.  I don't
know if others would approve of this, but I've been looking into the
possibility of using libafs to handle the fileserver's RX endpoint in
the kernel, and only return to userspace when we have a call that's
ready to proceed.  Of course, this would only work for platforms where
we can build libafs (not sure about linux 26), but that's quite a few.

4) blocking i/o

Ten years ago, asynch i/o was new and not exactly ready for prime
time.  Well, times have changed.  I think moving to an event-driven
asynchronous i/o model will allow us to keep the number of threads
close to the number of cpu's, which should drastically reduce context
switches.

5) storedata_rxstyle() / fetchdata_rxstyle()

There is a lot of room for improvement here.  As others have pointed
out, we make way too many readv/writev syscalls per MB of data.  Part
of the problem here is that readv/writev only take 16 iovecs, and each
cbuffer is a little less than the size of an ethernet mtu.  Thus,
we're moving very little data per mode switch.

I'd like to hack together a way to store multiple cbuffer's
contiguously in an expanded rx_packet struct, which would let us move
a lot more data in 16 iovec's, but the security trailer is in the way.
 I've only been able to come up with two ways of handling this, while
preserving the current zero-copy behavior.  The first way is to
separate the trailer and make the payload contiguous when copying into
the process's virtual address space during a syscall.  The second way
is to avoid the mode switches altogether by adding two new syscalls to
the afs_syscall mux that are basically rx equivalents of the
sendfile() and recvfile() syscalls.  Of course, both of these methods
assume libafs.  I guess if we wanted to be as crazy as the nfs guys,
we could just port the whole fileserver over to the osi api, and dump
it into the kernel...

While I'm talking about storedata, copyonwrite() is going to become a
headache as largefile becomes more heavily used.  I guess fixing that
will involve a new vice format :(

All of this is rather theoretical, and I only have a little bit of
code thus far.  I hope to find the time to make some of these changes
happen, but these are ambitious suggestions.  I can't promise
anything, unless other people with lots of time volunteer... ;)=20
Comments?  Criticisms?  Volunteers?

Regards,

--=20
Tom Keiser
tkeiser@gmail.com