[OpenAFS] iperf vs rxperf in high latency network

Thu, 8 Aug 2019 19:55:43 -0500

On Thu, Aug 08, 2019 at 11:54:44AM -0700, xguan@reliancememory.com wrote:
> To make sure I captured all the explanations correctly, please allow me to summarize my understandings:

I think your understanding is basically correct (and thanks to Jeffrey for
the detailed explanation!).  A few more remarks inline, though none
particularly actionable.

> Flow control over a high-latency, potentially congested link is a fundamental challenge that both TCP and UDP+Rx face. Both protocol and implementation can pose a problem. The reason why I did not see an improvement when enlarging the window size in rxperf is that firstly I chose too few data bytes to transfer and secondly that OpenAFS's Rx has some implementation limitations that become a limiting factor before the window size limit kicks in. They are non-trivial to fix, as demonstrated in the 1.5.x throughput "hiccup". But AuriStor fixed a significant amount of it in its proprietary Rx re-implementation. 
> 
> One can borrow ideas and principals from algorithm research in TCP's flow control to improve Rx throughput. I am not an expert on this topic, but I wonder if the principals in Google's BBR algorithm can help further improve Rx throughput, and I wonder if there is anything that makes TCP fundamentally superior than UDP in implementing flow control. 

Congestion control algorithms are largely independent of TCP vs. UDP, and
many TCP stacks have a "modular" concestion controller, in which one
algorithm can be swapped out for another by configuration.  The current
draft version of IETF QUIC, which is UDP-based, is using what is
effectively TCP NewReno's congestion control algorithm:
https://tools.ietf.org/html/draft-ietf-quic-recovery-22 .  In principle,
BBR could also be adopted to Rx (or QUIC), but I expect the design
philosophy to have substantial impedence mismatch for the current openafs
Rx implementation.

In general, how to achieve good performance on the so-called "high
bandwidth-delay product" links remains a difficult problem, with active
research efforts underway, as well as engineering work in SDOs like the
IETF.

-Ben

> When it comes to deployment strategy, there may be workarounds to the high-latency limitation. Each of them, of course, has limitations. I can probably use the technique mentioned below to leverage the TCP throughput in RO volume synchronization, 
> https://lists.openafs.org/pipermail/openafs-info/2018-August/042502.html
> and wait until DPF becomes available in vos operations:
> https://openafs-workshop.org/2019/schedule/faster-wan-volume-operations-with-dpf/
> 
> I can also adopt a small home volume, distributed subfolder volume strategy that allows home volumes to move with relocated users across WAN, but keep subdirectory volumes at their respective geographic location. Users can pick a subdirectory that is closest to their current location to work with. When combined with a version control system that uses TCP in syncing, project data synching can be alleviated. 
> 
> There is a commercial path that we can pursue with AuriStor or other vendors. But I guess that is out of the scope of this mail list. 
> 
> Any other strategies that may help?
> 
> Thank you, Jeff!
> 
> Simon Guan
> 
> 
> -----Original Message-----
> From: Jeffrey E Altman <jaltman@auristor.com> 
> Sent: Wednesday, August 7, 2019 9:01 PM
> To: xguan@reliancememory.com; openafs-info@openafs.org
> Subject: Re: [OpenAFS] iperf vs rxperf in high latency network
> 
> On 8/7/2019 9:35 PM, xguan@reliancememory.com wrote:
> > Hello,
> > 
> > Can someone kindly explain again the possible reasons why Rx is so painfully
> > slow for a high latency (~230ms) link? 
> 
> As Simon Wilkinson said on slide 5 of "RX Performance"
> 
>   https://indico.desy.de/indico/event/4756/session/2/contribution/22
> 
>   "There's only two things wrong with RX
>     * The protocol
>     * The implementation"
> 
> This presentation was given at DESY on 5 Oct 2011.  Although there have
> been some improvements in the OpenAFS RX implementation since then the
> fundamental issues described in that presentation still remain.
> 
> To explain slides 3 and 4.  Prior to the 1.5.53 release the following
> commit was merged which increased the default maximum window size from
> 32 packets to 64 packets.
> 
>   commit 3feee9278bc8d0a22630508f3aca10835bf52866
>   Date:   Thu May 8 22:24:52 2008 +0000
> 
>     rx-retain-windowing-per-peer-20080508
> 
>     we learned about the peer in a previous connection... retain the
>     information and keep using it. widen the available window.
>     makes rx perform better over high latency wans. needs to be present
>     in both sides for maximal effect.
> 
> Then prior to 1.5.66 this commit raised the maximum window size to 128
> 
>   commit 310cec9933d1ff3a74bcbe716dba5ade9cc28d15
>   Date:   Tue Sep 29 05:34:30 2009 -0400
> 
>     rx window size increase
> 
>     window size was previously pushed to 64; push to 128.
> 
> and then prior to 1.5.78 which was just before the 1.6 release:
> 
>   commit a99e616d445d8b713934194ded2e23fe20777f9a
>   Date:   Thu Sep 23 17:41:47 2010 +0100
> 
>     rx: Big windows make us sad
> 
>     The commit which took our Window size to 128 caused rxperf to run
>     40 times slower than before. All of the recent rx improvements have
>     reduced this to being around 2x slower than before, but we're still
>     not ready for large window sizes.
> 
>     As 1.6 is nearing release, reset back to the old, fast, window size
>     of 32. We can revist this as further performance improvements and
>     restructuring happen on master.
> 
> After 1.6 AuriStor Inc. (then Your File System Inc.) continued to work
> on reducing the overhead of RX packet processing.  Some of the results
> were presented in Simon Wilkinson's 16 October 2012 talk entitled "AFS
> Performance" slides 25 to 30
> 
>   http://conferences.inf.ed.ac.uk/eakc2012/
> 
> The performance of OpenAFS 1.8 RX is roughly the same as the OpenAFS
> master performance from slide 28.  The Experimental RX numbers were the
> AuriStor RX stack at the time which was not contributed to OpenAFS.
> 
> Since 2012 AuriStor has addressed many of the issues raised in
> the "RX Performance" presentation
> 
>  0. Per-packet processing expense
>  1. Bogus RTT calculations
>  2. Bogus RTO implmentation
>  3. Lack of Congestion avoidance
>  4. Incorrect window estimation when retransmitting
>  5. Incorrect window handling during loss recovery
>  6. Lock contention
> 
> The current AuriStor RX state machine implements SACK based loss
> recovery as documented in RFC6675, with elements of New Reno from
> RFC5682 on top of TCP-style congestion control elements as documented in
> RFC5681. The new RX also implements RFC2861 style congestion window
> validation.
> 
> When sending data the RX peer implementing these changes will be more
> likely to sustain the maximum available throughput while at the same
> time improving fairness towards competing network data flows. The
> improved estimation of available pipe capacity permits an increase in
> the default maximum window size from 60 packets (84.6 KB) to 128 packets
> (180.5 KB). The larger window size increases the per call theoretical
> maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec
> and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.
> 
> AuriStor RX also includes experimental support for RX windows larger
> than 255 packets (360KB). This release extends the RX flow control state
> machine to support windows larger than the Selective Acknowledgment
> table. The new maximum of 65535 packets (90MB) could theoretically fill
> a 100 gbit/second pipe provided that the packet allocator and packet
> queue management strategies could keep up.  Hint: at present, they don't.
> 
> To saturate a 60 Mbit/sec link with 230ms latency with rxmaxmtu set to
> 1344 requires a window size of approximately 1284 packets.
> 
> > From a user perspective, I wonder if there is any *quick Rx code hacking*
> > that could help reduce the throughput gap of (iperf2 = 30Mb/s vs rxperf =
> > 800Kb/s) for the following specific case. 
> 
> Probably not.  AuriStor's RX is significant re-implementation of the
> protocol with one eye focused on backward compatibility and the other on
> the future.
> 
> > We are considering the possibility of including two hosts ~230ms RTT apart
> > as server and client. I used iperf2 and rxperf to test throughput between
> > the two. There is no other connection competing with the test. So this is
> > different from a low-latency, thread or udp buffer exhaustion scenario. 
> > 
> > iperf2's UDP test shows a bandwidth of ~30Mb/s without packet loss, though
> > some of them have been re-ordered at the receiver side. Below 5 Mb/s, the
> > receiver sees no packet re-ordering.  Above 30 Mb/s, packet loss is seen by
> > the receiver. Test result is pretty consistent at multiple time points
> > within 24 hours. UDP buffer size used by iperf is 208 KB. Write length is
> > set at 1300 (-l 1300) which is below the path MTU. 
> 
> Out of order packet delivery and packet loss have significant
> performance impacts on OpenAFS RX.
> 
> > Interestingly, a quick skim through the iperf2 source code suggests that an
> > iperf sender does not wait for the receiver's ack. It simply keeps
> > write(mSettins->mSock, mBuf, mSettings->mBufLen) and timing it to extract
> > the numerical value for the throughput. It only checks, in the end, to see
> > if the receiver complains about packet loss. 
> 
> This is because iperf2 is not attempting to perform any flow control,
> any error recovery and no fairness model.  RX calls are sequenced data
> flows that are modeled on the same principals as TCP.
> 
> > rxperf, on the other hand, only gets ~800 Kb/s. What makes it worse is that
> > it does not seem to be dependent on the window size (-W 32~255), or udpsize
> > (-u default~512*1024). I tried to re-compile rxperf that has #define
> > RXPERF_BUFSIZE (1024 * 1024 * 64) instead of the original (512 * 1024). I
> > did not see a throughput improvement from going above -u 512K. Occasionally
> > some packets are re-transmitted. If I reduce -W or -u to very small values,
> > I see some penalty. 
> 
> Changing RXPERF_BUFSIZE to 64MB is not going to help when the total
> bytes being sent per call is 1000KB.  Given how little data is being
> sent per call and the fact that each RX call begins in "slow start" I
> suspect that your test isn't growing the window size to 32 packets let
> alone 255.
> >
> >[snip]
> >
> > The theory goes if I have a 32-packet recv/send window (Ack Count) with 1344
> > bytes of packet size and RTT=230ms, I should expect a theoretical upper
> > bound of 32 x 8 x 1344 / 0.23 / 1000000 =  1.5 Mb/s. If the AFS-implemented
> > Rx windows size (32) is really the limiting factor of the throughput, then
> > the throughput should increase when I increase the window size (-w) above 32
> > and configure a sufficiently big kernel socket buffer size.
> 
> The fact that OpenAFS RX requires large kernel socket buffers to get
> reasonable performance is bad indication.  It means that for OpenAFS RX
> it is better to deliver packets with long delays than to drop them and
> permit timely congestion detection.
> 
> > I did not see either of the predictions by the theory above. I wonder if
> > some light could be shed on:
> > 
> > 1. What else may be the limiting factor in my case
> 
> Not enough data is being sent per call.  30MB are being sent by iperf2
> and rxperf is sending 1000KB.  Its not an equivalent comparison.
> 
> > 2. If there is a quick way to increase recv/send window from 32 to 255 in Rx
> > code without breaking other parts of AFS. 
> 
> As shown in the commits specified above, it doesn't take much to
> increase the default maximum window size.  However, performance is
> unlikely to increase unless the root causes are addressed.
> 
> > 3. If there is any quick (maybe dirty) way to leverage the iperf2
> > observation, relax the wait for ack as long as the received packets are in
> > order and not lost (that is, get me up to 5Mb/s...)
> 
> Not without further violating the TCP Fairness principal.
> 
> > Thank you in advance.
> > ==========================
> > Ximeng (Simon) Guan, Ph.D.
> > Director of Device Technology
> > Reliance Memory
> > ==========================
> 
> Jeffrey Altman
> 
> > 
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info