[OpenAFS] performance and udp buffers

Simon Wilkinson sxw@your-file-system.com
Sun, 18 Nov 2012 21:34:04 +0000


On 9 Oct 2012, at 10:24, Dan Van Der Ster wrote:
> We currently run fileservers with udpsize=3D2MB, and at that size we =
have a 30 client limit in our test environment. With a buffer size=3D8MB =
(increased kernel max with sysctl and fileserver option), we don't see =
any dropped UDP packets during our client-reading stress test, but still =
get some dropped packets if all clients write to the server. With a 16MB =
buffer we don't see any dropped packets at all in reading or writing.

This was discussed in Edinburgh as part of the CERN site report (which =
I'd recommend to anyone interested in AFS server performance), however =
it's just occurred to me that nothing made it back to the list. As I've =
been looking at this whole area in more detail for the work I'm doing on =
YFS's RX stack, I thought it would be worth summarising what's happening =
here.

Sizing the UDP buffer for RX is tricky because unlike TCP, a single UDP =
buffer has to be large enough to handle all of the currently outstanding =
streams (TCP has a buffer per connection, we have a single buffer per =
server).

In order to avoid any packet loss at all, the UDP buffer has to be big =
enough to handle all of the packets which may be in flight at a =
particular moment in time. For each simultaneous call that the server =
can handle, there must be a full window's worth of packets available. =
Simultaneous calls are determined by the number of threads in the server =
- so for a typical OpenAFS installation, this is 128 (threads) x 32 =
(window size) =3D 4096 packets. Calls which are "waiting for a thread" =
can then each consume a single packet (they only have a window size of 1 =
until an application thread is allocated and data starts being =
consumed). You also need packets to be able to send pings and other =
various RX housekeeping - typically 1 packet for each client that a =
server has "recently" seen.

With the 1.4.x series, packet loss was a bad thing - the fast recovery =
implementation was broken, so any packet loss at all would put the =
connection back to square one, and had a significant effect on =
throughput. With 1.6.x, packet loss is generally dealt with through fast =
recovery, and the impact on throughput is less. That said, avoiding =
unnecessary packet loss is always good!

For a heavily loaded OpenAFS server with 128 threads, I would plan to =
have a receive buffer of around 5000 packets. If you increase the number =
of threads, you should similarly increase the size of your packet =
buffers.

Converting that number of packets into a buffer size is a bit of a dark =
art. I'm only going to discuss the situation on Linux.

The first wrinkle is that internally Linux takes your selected buffer =
size, and doubles it. So setting a buffer of 8Mbyte actually permits =
16Mbytes of kernel memory to be used by the RX socket.

The second wrinkle is that allocations from this buffer are counted =
according to the way in which memory is managed by your network card, by =
the socket buffer, and by the kernel's allocator. Very roughly, each =
packet will take the MTU of your network, plus the socket buffer =
overhead, rounded to the nearest bucket used by the kernel allocator. =
With a standard ethernet, the MTU will be 1500 bytes. The socket buffer =
overhead depends on your kernel architecture and configuration, but is =
potentially around 600 bytes on x86_64. This is large enough to put each =
packet allocation into the 4096 byte allocator bucket.

So, setting a UDP buffer of 8Mbytes from user space is _just_ enough to =
handle 4096 incoming RX packets on a standard ethernet. However, it =
doesn't give you enough overhead to handle pings and other management =
packets. 16Mbytes should be plenty providing that you don't

a) Dramatically increase the number of threads on your fileserver
b) Increase the RX window size
c) Increase the ethernet frame size of your network (what impact this =
has depends on the internals of your network card implementation)
d) Have a large number of 1.6.0 clients on your network

To summarise, and to stress Dan's original point - if you're running =
with the fileserver default buffer size (64k, 16 packets), or with the =
standard Linux maximum buffer size (128k, 32 packets), you almost =
certainly don't have enough buffer space for a loaded fileserver.

Hope that all helps!

Cheers,

Simon