[OpenAFS] iperf vs rxperf in high latency network

xguan@reliancememory.com xguan@reliancememory.com
Thu, 8 Aug 2019 11:54:44 -0700

To make sure I captured all the explanations correctly, please allow me =
to summarize my understandings:

Flow control over a high-latency, potentially congested link is a =
fundamental challenge that both TCP and UDP+Rx face. Both protocol and =
implementation can pose a problem. The reason why I did not see an =
improvement when enlarging the window size in rxperf is that firstly I =
chose too few data bytes to transfer and secondly that OpenAFS's Rx has =
some implementation limitations that become a limiting factor before the =
window size limit kicks in. They are non-trivial to fix, as demonstrated =
in the 1.5.x throughput "hiccup". But AuriStor fixed a significant =
amount of it in its proprietary Rx re-implementation.=20

One can borrow ideas and principals from algorithm research in TCP's =
flow control to improve Rx throughput. I am not an expert on this topic, =
but I wonder if the principals in Google's BBR algorithm can help =
further improve Rx throughput, and I wonder if there is anything that =
makes TCP fundamentally superior than UDP in implementing flow control.=20

When it comes to deployment strategy, there may be workarounds to the =
high-latency limitation. Each of them, of course, has limitations. I can =
probably use the technique mentioned below to leverage the TCP =
throughput in RO volume synchronization,=20
and wait until DPF becomes available in vos operations:

I can also adopt a small home volume, distributed subfolder volume =
strategy that allows home volumes to move with relocated users across =
WAN, but keep subdirectory volumes at their respective geographic =
location. Users can pick a subdirectory that is closest to their current =
location to work with. When combined with a version control system that =
uses TCP in syncing, project data synching can be alleviated.=20

There is a commercial path that we can pursue with AuriStor or other =
vendors. But I guess that is out of the scope of this mail list.=20

Any other strategies that may help?

Thank you, Jeff!

Simon Guan

-----Original Message-----
From: Jeffrey E Altman <jaltman@auristor.com>=20
Sent: Wednesday, August 7, 2019 9:01 PM
To: xguan@reliancememory.com; openafs-info@openafs.org
Subject: Re: [OpenAFS] iperf vs rxperf in high latency network

On 8/7/2019 9:35 PM, xguan@reliancememory.com wrote:
> Hello,
> Can someone kindly explain again the possible reasons why Rx is so =
> slow for a high latency (~230ms) link?=20

As Simon Wilkinson said on slide 5 of "RX Performance"


  "There's only two things wrong with RX
    * The protocol
    * The implementation"

This presentation was given at DESY on 5 Oct 2011.  Although there have
been some improvements in the OpenAFS RX implementation since then the
fundamental issues described in that presentation still remain.

To explain slides 3 and 4.  Prior to the 1.5.53 release the following
commit was merged which increased the default maximum window size from
32 packets to 64 packets.

  commit 3feee9278bc8d0a22630508f3aca10835bf52866
  Date:   Thu May 8 22:24:52 2008 +0000


    we learned about the peer in a previous connection... retain the
    information and keep using it. widen the available window.
    makes rx perform better over high latency wans. needs to be present
    in both sides for maximal effect.

Then prior to 1.5.66 this commit raised the maximum window size to 128

  commit 310cec9933d1ff3a74bcbe716dba5ade9cc28d15
  Date:   Tue Sep 29 05:34:30 2009 -0400

    rx window size increase

    window size was previously pushed to 64; push to 128.

and then prior to 1.5.78 which was just before the 1.6 release:

  commit a99e616d445d8b713934194ded2e23fe20777f9a
  Date:   Thu Sep 23 17:41:47 2010 +0100

    rx: Big windows make us sad

    The commit which took our Window size to 128 caused rxperf to run
    40 times slower than before. All of the recent rx improvements have
    reduced this to being around 2x slower than before, but we're still
    not ready for large window sizes.

    As 1.6 is nearing release, reset back to the old, fast, window size
    of 32. We can revist this as further performance improvements and
    restructuring happen on master.

After 1.6 AuriStor Inc. (then Your File System Inc.) continued to work
on reducing the overhead of RX packet processing.  Some of the results
were presented in Simon Wilkinson's 16 October 2012 talk entitled "AFS
Performance" slides 25 to 30


The performance of OpenAFS 1.8 RX is roughly the same as the OpenAFS
master performance from slide 28.  The Experimental RX numbers were the
AuriStor RX stack at the time which was not contributed to OpenAFS.

Since 2012 AuriStor has addressed many of the issues raised in
the "RX Performance" presentation

 0. Per-packet processing expense
 1. Bogus RTT calculations
 2. Bogus RTO implmentation
 3. Lack of Congestion avoidance
 4. Incorrect window estimation when retransmitting
 5. Incorrect window handling during loss recovery
 6. Lock contention

The current AuriStor RX state machine implements SACK based loss
recovery as documented in RFC6675, with elements of New Reno from
RFC5682 on top of TCP-style congestion control elements as documented in
RFC5681. The new RX also implements RFC2861 style congestion window

When sending data the RX peer implementing these changes will be more
likely to sustain the maximum available throughput while at the same
time improving fairness towards competing network data flows. The
improved estimation of available pipe capacity permits an increase in
the default maximum window size from 60 packets (84.6 KB) to 128 packets
(180.5 KB). The larger window size increases the per call theoretical
maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec
and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.

AuriStor RX also includes experimental support for RX windows larger
than 255 packets (360KB). This release extends the RX flow control state
machine to support windows larger than the Selective Acknowledgment
table. The new maximum of 65535 packets (90MB) could theoretically fill
a 100 gbit/second pipe provided that the packet allocator and packet
queue management strategies could keep up.  Hint: at present, they =

To saturate a 60 Mbit/sec link with 230ms latency with rxmaxmtu set to
1344 requires a window size of approximately 1284 packets.

> From a user perspective, I wonder if there is any *quick Rx code =
> that could help reduce the throughput gap of (iperf2 =3D 30Mb/s vs =
rxperf =3D
> 800Kb/s) for the following specific case.=20

Probably not.  AuriStor's RX is significant re-implementation of the
protocol with one eye focused on backward compatibility and the other on
the future.

> We are considering the possibility of including two hosts ~230ms RTT =
> as server and client. I used iperf2 and rxperf to test throughput =
> the two. There is no other connection competing with the test. So this =
> different from a low-latency, thread or udp buffer exhaustion =
> iperf2's UDP test shows a bandwidth of ~30Mb/s without packet loss, =
> some of them have been re-ordered at the receiver side. Below 5 Mb/s, =
> receiver sees no packet re-ordering.  Above 30 Mb/s, packet loss is =
seen by
> the receiver. Test result is pretty consistent at multiple time points
> within 24 hours. UDP buffer size used by iperf is 208 KB. Write length =
> set at 1300 (-l 1300) which is below the path MTU.=20

Out of order packet delivery and packet loss have significant
performance impacts on OpenAFS RX.

> Interestingly, a quick skim through the iperf2 source code suggests =
that an
> iperf sender does not wait for the receiver's ack. It simply keeps
> write(mSettins->mSock, mBuf, mSettings->mBufLen) and timing it to =
> the numerical value for the throughput. It only checks, in the end, to =
> if the receiver complains about packet loss.=20

This is because iperf2 is not attempting to perform any flow control,
any error recovery and no fairness model.  RX calls are sequenced data
flows that are modeled on the same principals as TCP.

> rxperf, on the other hand, only gets ~800 Kb/s. What makes it worse is =
> it does not seem to be dependent on the window size (-W 32~255), or =
> (-u default~512*1024). I tried to re-compile rxperf that has #define
> RXPERF_BUFSIZE (1024 * 1024 * 64) instead of the original (512 * =
1024). I
> did not see a throughput improvement from going above -u 512K. =
> some packets are re-transmitted. If I reduce -W or -u to very small =
> I see some penalty.=20

Changing RXPERF_BUFSIZE to 64MB is not going to help when the total
bytes being sent per call is 1000KB.  Given how little data is being
sent per call and the fact that each RX call begins in "slow start" I
suspect that your test isn't growing the window size to 32 packets let
alone 255.
> The theory goes if I have a 32-packet recv/send window (Ack Count) =
with 1344
> bytes of packet size and RTT=3D230ms, I should expect a theoretical =
> bound of 32 x 8 x 1344 / 0.23 / 1000000 =3D  1.5 Mb/s. If the =
> Rx windows size (32) is really the limiting factor of the throughput, =
> the throughput should increase when I increase the window size (-w) =
above 32
> and configure a sufficiently big kernel socket buffer size.

The fact that OpenAFS RX requires large kernel socket buffers to get
reasonable performance is bad indication.  It means that for OpenAFS RX
it is better to deliver packets with long delays than to drop them and
permit timely congestion detection.

> I did not see either of the predictions by the theory above. I wonder =
> some light could be shed on:
> 1. What else may be the limiting factor in my case

Not enough data is being sent per call.  30MB are being sent by iperf2
and rxperf is sending 1000KB.  Its not an equivalent comparison.

> 2. If there is a quick way to increase recv/send window from 32 to 255 =
in Rx
> code without breaking other parts of AFS.=20

As shown in the commits specified above, it doesn't take much to
increase the default maximum window size.  However, performance is
unlikely to increase unless the root causes are addressed.

> 3. If there is any quick (maybe dirty) way to leverage the iperf2
> observation, relax the wait for ack as long as the received packets =
are in
> order and not lost (that is, get me up to 5Mb/s...)

Not without further violating the TCP Fairness principal.

> Thank you in advance.
> =
> Ximeng (Simon) Guan, Ph.D.
> Director of Device Technology
> Reliance Memory
> =

Jeffrey Altman