[OpenAFS] 1.4.8, Rx Performance Improvements, and a Small Business Innovative Research grant

Jeffrey Altman jaltman@secure-endpoints.com
Fri, 03 Oct 2008 08:54:22 -0400


Rainer Toebbicke wrote:
> Just a few findings on RX from my previous (vain) attempts to make it
> "lightning fast" - perhaps they trigger ideas for whoever is still
> working on it or corrections from those who know better:
> 
> 1. as latency grows when crossing routers or even public networks the
> default window of 32 packets is too small. On the other hand, the
> handling of the transmission queue grows with n**2, and even fast
> processors are quickly overwhelmed. Here's where "oprofile" is a
> valuable tool. Some of this can be reduced with queue hints, wisely
> posting retransmit events and trying to avoid scanning the whole queue
> in several places;
> 
> 2. jumbograms are a pain: years ago we had a research network dropping
> fragmented packets and spent weeks on pinning that down. Currently we
> suspect another one. Firewalls choke on them. They also increase
> complexity for access list in routers. And of course the probability
> increases of having to retransmit the whole jumbogram because one
> fragment got lost. What makes me frown is that it is apparently faster
> if the kernels split and reassemble jumbograms on the fly, than by RX
> doing it with much more knowledge about the state;
> 
> 3. the path for handling an ACK packet is very long, I measured on the
> order of  10 microseconds on average on a modern processor. At over 100
> MB/s you'd be handling ~50000 ACKs per second in a non-jumbogram
> configuration and have hardly any time left to send out new packets. A
> lot is spent on waiting for the call-lock: even when that one is
> released quickly (which it isn't in the standard implementation, as the
> code leisurely walks around with it for extended periods, but I
> experimented with a "release" flag), the detour through the scheduler
> slows down things dramatically. The lock structure should probably be
> revisited to make contention between ack recv & transmit threads less
> likely;
> 
> 4. slow start is implemented state-of-the-art, fast recovery however
> looks odd to me (actually: "inexisting" but I may be fooled by some
> jumbogram smoke). When it comes to congestion avoidance, a lot of the
> research that went into TCP in the last ten years is obviously missing.
> I started experimenting with CUBIC in the hope that it helps to reduce
> retransmits and keeping a constant flow, let's see;
> 
> 5. earlier this year we mentioned handling of new calls, which is again
> a quadratic problem due to the mixture of service classes. This makes it
> impractical to allow for thousands of waiting calls, creating a problem
> on a cluster with thousands of nodes.

6. the rx statistics global lock is still a bottleneck

> With those observations... does rx-over-tcp look like a solution? On the
> packet-transmission side probably, but the encapsulation very likely
> still demands significant processing power. And running a server with
> 10000 or 20000 TCP connections does not sound that obvious either.

One of the problems that Rx/udp has is that all of the processing must
be done by the CPUs and none of it can be offloaded to the network
adapters.  Modern network adapters and supporting drivers permit the
offloading of tcp stream processing.  Pushing this work into hardware
is considered critical to filling 10GB network pipes.  Certainly
switching to an Rx/tcp model is going to be the long term future.

At the same time, there are still improvements that can be made to
Rx/udp that can provide benefits to existing clients.

> Voilà... my 0.02 €. Sorry for being verbose, I couldn't resist.

The community is very appreciative for all of the efforts you have
put into analyzing Rx over the years.  I look forward to your continued
involvement.

Jeffrey Altman