[OpenAFS] iperf vs rxperf in high latency network

Jeffrey E Altman jaltman@auristor.com
Thu, 8 Aug 2019 00:00:38 -0400


This is a cryptographically signed message in MIME format.

--------------ms030600050303030804090000
Content-Type: multipart/mixed;
 boundary="------------A29843E0803352522FD02DBC"
Content-Language: en-US

This is a multi-part message in MIME format.
--------------A29843E0803352522FD02DBC
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 8/7/2019 9:35 PM, xguan@reliancememory.com wrote:
> Hello,
>=20
> Can someone kindly explain again the possible reasons why Rx is so pain=
fully
> slow for a high latency (~230ms) link?=20

As Simon Wilkinson said on slide 5 of "RX Performance"

  https://indico.desy.de/indico/event/4756/session/2/contribution/22

  "There's only two things wrong with RX
    * The protocol
    * The implementation"

This presentation was given at DESY on 5 Oct 2011.  Although there have
been some improvements in the OpenAFS RX implementation since then the
fundamental issues described in that presentation still remain.

To explain slides 3 and 4.  Prior to the 1.5.53 release the following
commit was merged which increased the default maximum window size from
32 packets to 64 packets.

  commit 3feee9278bc8d0a22630508f3aca10835bf52866
  Date:   Thu May 8 22:24:52 2008 +0000

    rx-retain-windowing-per-peer-20080508

    we learned about the peer in a previous connection... retain the
    information and keep using it. widen the available window.
    makes rx perform better over high latency wans. needs to be present
    in both sides for maximal effect.

Then prior to 1.5.66 this commit raised the maximum window size to 128

  commit 310cec9933d1ff3a74bcbe716dba5ade9cc28d15
  Date:   Tue Sep 29 05:34:30 2009 -0400

    rx window size increase

    window size was previously pushed to 64; push to 128.

and then prior to 1.5.78 which was just before the 1.6 release:

  commit a99e616d445d8b713934194ded2e23fe20777f9a
  Date:   Thu Sep 23 17:41:47 2010 +0100

    rx: Big windows make us sad

    The commit which took our Window size to 128 caused rxperf to run
    40 times slower than before. All of the recent rx improvements have
    reduced this to being around 2x slower than before, but we're still
    not ready for large window sizes.

    As 1.6 is nearing release, reset back to the old, fast, window size
    of 32. We can revist this as further performance improvements and
    restructuring happen on master.

After 1.6 AuriStor Inc. (then Your File System Inc.) continued to work
on reducing the overhead of RX packet processing.  Some of the results
were presented in Simon Wilkinson's 16 October 2012 talk entitled "AFS
Performance" slides 25 to 30

  http://conferences.inf.ed.ac.uk/eakc2012/

The performance of OpenAFS 1.8 RX is roughly the same as the OpenAFS
master performance from slide 28.  The Experimental RX numbers were the
AuriStor RX stack at the time which was not contributed to OpenAFS.

Since 2012 AuriStor has addressed many of the issues raised in
the "RX Performance" presentation

 0. Per-packet processing expense
 1. Bogus RTT calculations
 2. Bogus RTO implmentation
 3. Lack of Congestion avoidance
 4. Incorrect window estimation when retransmitting
 5. Incorrect window handling during loss recovery
 6. Lock contention

The current AuriStor RX state machine implements SACK based loss
recovery as documented in RFC6675, with elements of New Reno from
RFC5682 on top of TCP-style congestion control elements as documented in
RFC5681. The new RX also implements RFC2861 style congestion window
validation.

When sending data the RX peer implementing these changes will be more
likely to sustain the maximum available throughput while at the same
time improving fairness towards competing network data flows. The
improved estimation of available pipe capacity permits an increase in
the default maximum window size from 60 packets (84.6 KB) to 128 packets
(180.5 KB). The larger window size increases the per call theoretical
maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec
and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.

AuriStor RX also includes experimental support for RX windows larger
than 255 packets (360KB). This release extends the RX flow control state
machine to support windows larger than the Selective Acknowledgment
table. The new maximum of 65535 packets (90MB) could theoretically fill
a 100 gbit/second pipe provided that the packet allocator and packet
queue management strategies could keep up.  Hint: at present, they don't.=


To saturate a 60 Mbit/sec link with 230ms latency with rxmaxmtu set to
1344 requires a window size of approximately 1284 packets.

> From a user perspective, I wonder if there is any *quick Rx code hackin=
g*
> that could help reduce the throughput gap of (iperf2 =3D 30Mb/s vs rxpe=
rf =3D
> 800Kb/s) for the following specific case.=20

Probably not.  AuriStor's RX is significant re-implementation of the
protocol with one eye focused on backward compatibility and the other on
the future.

> We are considering the possibility of including two hosts ~230ms RTT ap=
art
> as server and client. I used iperf2 and rxperf to test throughput betwe=
en
> the two. There is no other connection competing with the test. So this =
is
> different from a low-latency, thread or udp buffer exhaustion scenario.=
=20
>=20
> iperf2's UDP test shows a bandwidth of ~30Mb/s without packet loss, tho=
ugh
> some of them have been re-ordered at the receiver side. Below 5 Mb/s, t=
he
> receiver sees no packet re-ordering.  Above 30 Mb/s, packet loss is see=
n by
> the receiver. Test result is pretty consistent at multiple time points
> within 24 hours. UDP buffer size used by iperf is 208 KB. Write length =
is
> set at 1300 (-l 1300) which is below the path MTU.=20

Out of order packet delivery and packet loss have significant
performance impacts on OpenAFS RX.

> Interestingly, a quick skim through the iperf2 source code suggests tha=
t an
> iperf sender does not wait for the receiver's ack. It simply keeps
> write(mSettins->mSock, mBuf, mSettings->mBufLen) and timing it to extra=
ct
> the numerical value for the throughput. It only checks, in the end, to =
see
> if the receiver complains about packet loss.=20

This is because iperf2 is not attempting to perform any flow control,
any error recovery and no fairness model.  RX calls are sequenced data
flows that are modeled on the same principals as TCP.

> rxperf, on the other hand, only gets ~800 Kb/s. What makes it worse is =
that
> it does not seem to be dependent on the window size (-W 32~255), or udp=
size
> (-u default~512*1024). I tried to re-compile rxperf that has #define
> RXPERF_BUFSIZE (1024 * 1024 * 64) instead of the original (512 * 1024).=
 I
> did not see a throughput improvement from going above -u 512K. Occasion=
ally
> some packets are re-transmitted. If I reduce -W or -u to very small val=
ues,
> I see some penalty.=20

Changing RXPERF_BUFSIZE to 64MB is not going to help when the total
bytes being sent per call is 1000KB.  Given how little data is being
sent per call and the fact that each RX call begins in "slow start" I
suspect that your test isn't growing the window size to 32 packets let
alone 255.
>
>[snip]
>
> The theory goes if I have a 32-packet recv/send window (Ack Count) with=
 1344
> bytes of packet size and RTT=3D230ms, I should expect a theoretical upp=
er
> bound of 32 x 8 x 1344 / 0.23 / 1000000 =3D  1.5 Mb/s. If the AFS-imple=
mented
> Rx windows size (32) is really the limiting factor of the throughput, t=
hen
> the throughput should increase when I increase the window size (-w) abo=
ve 32
> and configure a sufficiently big kernel socket buffer size.

The fact that OpenAFS RX requires large kernel socket buffers to get
reasonable performance is bad indication.  It means that for OpenAFS RX
it is better to deliver packets with long delays than to drop them and
permit timely congestion detection.

> I did not see either of the predictions by the theory above. I wonder i=
f
> some light could be shed on:
>=20
> 1. What else may be the limiting factor in my case

Not enough data is being sent per call.  30MB are being sent by iperf2
and rxperf is sending 1000KB.  Its not an equivalent comparison.

> 2. If there is a quick way to increase recv/send window from 32 to 255 =
in Rx
> code without breaking other parts of AFS.=20

As shown in the commits specified above, it doesn't take much to
increase the default maximum window size.  However, performance is
unlikely to increase unless the root causes are addressed.

> 3. If there is any quick (maybe dirty) way to leverage the iperf2
> observation, relax the wait for ack as long as the received packets are=
 in
> order and not lost (that is, get me up to 5Mb/s...)

Not without further violating the TCP Fairness principal.

> Thank you in advance.
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
> Ximeng (Simon) Guan, Ph.D.
> Director of Device Technology
> Reliance Memory
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D

Jeffrey Altman

>=20

--------------A29843E0803352522FD02DBC
Content-Type: text/x-vcard; charset=utf-8;
 name="jaltman.vcf"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
 filename="jaltman.vcf"

begin:vcard
fn:Jeffrey Altman
n:Altman;Jeffrey
org:AuriStor, Inc.
adr:;;255 W 94TH ST STE 6B;New York;NY;10025-6985;United States
email;internet:jaltman@auristor.com
title:CEO
tel;work:+1-212-769-9018
url:https://www.linkedin.com/in/jeffreyaltman/
version:2.1
end:vcard


--------------A29843E0803352522FD02DBC--

--------------ms030600050303030804090000
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC
DGswggXSMIIEuqADAgECAhBAAWbTGehnfUuu91hYwM5DMA0GCSqGSIb3DQEBCwUAMDoxCzAJ
BgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQgQ0EgQTEy
MB4XDTE4MTEwMjA2MjYyMloXDTE5MTEwMjA2MjYyMlowcDEvMC0GCgmSJomT8ixkAQETH0Ew
MTQyN0UwMDAwMDE2NkQzMTlFODFBMDAwMDdBN0IxGTAXBgNVBAMTEEplZmZyZXkgRSBBbHRt
YW4xFTATBgNVBAoTDEF1cmlTdG9yIEluYzELMAkGA1UEBhMCVVMwggEiMA0GCSqGSIb3DQEB
AQUAA4IBDwAwggEKAoIBAQDqEYwjLORE23Gc8m7YgKqbGzWn/fmVGtoZkBNwOEYlrFOu84Pb
EhV4sxQrChhPyXVW2jquV2rg2/5dsVC8RO+RwlXuAkUvR9KhWJLu6GJXwUnZr83wtEzJ8nqp
THj6W+3velLwWx7qhADyrMnKN0bTYh+5M9HWt2We4qYi6i1/ejgKtM0arWYxVx6Iwb4xZpil
MDNqV15Dwuunnkq4vNEByIT81zDoClqylMxxKJpvc3tqC66+BHHM5RxF+z36Pt8fb3Q54Vry
txXFm+kVSclKGaWgjq5SqV4tR0FWv6OnMY8tAx1YrljfvgxW5npZgBbo+YVoYEfUrz77WIYQ
yzn7AgMBAAGjggKcMIICmDAOBgNVHQ8BAf8EBAMCBPAwgYQGCCsGAQUFBwEBBHgwdjAwBggr
BgEFBQcwAYYkaHR0cDovL2NvbW1lcmNpYWwub2NzcC5pZGVudHJ1c3QuY29tMEIGCCsGAQUF
BzAChjZodHRwOi8vdmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NlcnRzL3RydXN0aWRjYWEx
Mi5wN2MwHwYDVR0jBBgwFoAUpHPa72k1inXMoBl7CDL4a4nkQuwwCQYDVR0TBAIwADCCASsG
A1UdIASCASIwggEeMIIBGgYLYIZIAYb5LwAGAgEwggEJMEoGCCsGAQUFBwIBFj5odHRwczov
L3NlY3VyZS5pZGVudHJ1c3QuY29tL2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRt
bDCBugYIKwYBBQUHAgIwga0agapUaGlzIFRydXN0SUQgQ2VydGlmaWNhdGUgaGFzIGJlZW4g
aXNzdWVkIGluIGFjY29yZGFuY2Ugd2l0aCBJZGVuVHJ1c3QncyBUcnVzdElEIENlcnRpZmlj
YXRlIFBvbGljeSBmb3VuZCBhdCBodHRwczovL3NlY3VyZS5pZGVudHJ1c3QuY29tL2NlcnRp
ZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDBFBgNVHR8EPjA8MDqgOKA2hjRodHRwOi8v
dmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NybC90cnVzdGlkY2FhMTIuY3JsMB8GA1UdEQQY
MBaBFGphbHRtYW5AYXVyaXN0b3IuY29tMB0GA1UdDgQWBBQevV8IqWfIUNkQqAugGhxR938z
+jAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwDQYJKoZIhvcNAQELBQADggEBAKsU
kshF6tfL43itTIVy9vjYqqPErG9n8kX5FlRYbtIVlWIYTxQpeqtDpUPur1jfBiNY+xT+9Pay
O2+XxXu9ZEykCz5T4+3q7s5t5RLsHu1dxYcMnAgfUqb13mhZxY8PVPE4PTHSvZLjPZ6Nt7j0
tXjddZJqjDhr7neNpmYgQWSe+oaIxbUqQ34rVW/hDimv9Y2DnCXL0LopCfABQDK9HDzmsuXd
bVH6LUpS6ncge9kQEh1QIGuwqEv2tHCWeauWM6h3BOXj3dlfbJEawUYz2hvc3nSXpscFlCN5
tGAyUAE8QbKnH1ha/zZVrJY1EglFhnDho34lWl35t7pE5NP4kscwggaRMIIEeaADAgECAhEA
+d5Wf8lNDHdw+WAbUtoVOzANBgkqhkiG9w0BAQsFADBKMQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MScwJQYDVQQDEx5JZGVuVHJ1c3QgQ29tbWVyY2lhbCBSb290IENBIDEw
HhcNMTUwMjE4MjIyNTE5WhcNMjMwMjE4MjIyNTE5WjA6MQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MRcwFQYDVQQDEw5UcnVzdElEIENBIEExMjCCASIwDQYJKoZIhvcNAQEB
BQADggEPADCCAQoCggEBANGRTTzPCic0kq5L6ZrUJWt5LE/n6tbPXPhGt2Egv7plJMoEpvVJ
JDqGqDYymaAsd8Hn9ZMAuKUEFdlx5PgCkfu7jL5zgiMNnAFVD9PyrsuF+poqmlxhlQ06sFY2
hbhQkVVQ00KCNgUzKcBUIvjv04w+fhNPkwGW5M7Ae5K5OGFGwOoRck9GG6MUVKvTNkBw2/vN
MOd29VGVTtR0tjH5PS5yDXss48Yl1P4hDStO2L4wTsW2P37QGD27//XGN8K6amWB6F2XOgff
/PmlQjQOORT95PmLkwwvma5nj0AS0CVp8kv0K2RHV7GonllKpFDMT0CkxMQKwoj+tWEWJTiD
KSsCAwEAAaOCAoAwggJ8MIGJBggrBgEFBQcBAQR9MHswMAYIKwYBBQUHMAGGJGh0dHA6Ly9j
b21tZXJjaWFsLm9jc3AuaWRlbnRydXN0LmNvbTBHBggrBgEFBQcwAoY7aHR0cDovL3ZhbGlk
YXRpb24uaWRlbnRydXN0LmNvbS9yb290cy9jb21tZXJjaWFscm9vdGNhMS5wN2MwHwYDVR0j
BBgwFoAU7UQZwNPwBovupHu+QucmVMiONnYwDwYDVR0TAQH/BAUwAwEB/zCCASAGA1UdIASC
ARcwggETMIIBDwYEVR0gADCCAQUwggEBBggrBgEFBQcCAjCB9DBFFj5odHRwczovL3NlY3Vy
ZS5pZGVudHJ1c3QuY29tL2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDADAgEB
GoGqVGhpcyBUcnVzdElEIENlcnRpZmljYXRlIGhhcyBiZWVuIGlzc3VlZCBpbiBhY2NvcmRh
bmNlIHdpdGggSWRlblRydXN0J3MgVHJ1c3RJRCBDZXJ0aWZpY2F0ZSBQb2xpY3kgZm91bmQg
YXQgaHR0cHM6Ly9zZWN1cmUuaWRlbnRydXN0LmNvbS9jZXJ0aWZpY2F0ZXMvcG9saWN5L3Rz
L2luZGV4Lmh0bWwwSgYDVR0fBEMwQTA/oD2gO4Y5aHR0cDovL3ZhbGlkYXRpb24uaWRlbnRy
dXN0LmNvbS9jcmwvY29tbWVyY2lhbHJvb3RjYTEuY3JsMB0GA1UdJQQWMBQGCCsGAQUFBwMC
BggrBgEFBQcDBDAOBgNVHQ8BAf8EBAMCAYYwHQYDVR0OBBYEFKRz2u9pNYp1zKAZewgy+GuJ
5ELsMA0GCSqGSIb3DQEBCwUAA4ICAQAN4YKu0vv062MZfg+xMSNUXYKvHwvZIk+6H1pUmivy
DI4I6A3wWzxlr83ZJm0oGIF6PBsbgKJ/fhyyIzb+vAYFJmyI8I/0mGlc+nIQNuV2XY8cypPo
VJKgpnzp/7cECXkX8R4NyPtEn8KecbNdGBdEaG4a7AkZ3ujlJofZqYdHxN29tZPdDlZ8fR36
/mAFeCEq0wOtOOc0Eyhs29+9MIZYjyxaPoTS+l8xLcuYX3RWlirRyH6RPfeAi5kySOEhG1qu
NHe06QIwpigjyFT6v/vRqoIBr7WpDOSt1VzXPVbSj1PcWBgkwyGKHlQUOuSbHbHcjOD8w8wH
SDbL+L2he8hNN54doy1e1wJHKmnfb0uBAeISoxRbJnMMWvgAlH5FVrQWlgajeH/6NbYbBSRx
ALuEOqEQepmJM6qz4oD2sxdq4GMN5adAdYEswkY/o0bRKyFXTD3mdqeRXce0jYQbWm7oapqS
ZBccFvUgYOrB78tB6c1bxIgaQKRShtWR1zMM0JfqUfD9u8Fg7G5SVO0IG/GcxkSvZeRjhYcb
TfqF2eAgprpyzLWmdr0mou3bv1Sq4OuBhmTQCnqxAXr4yVTRYHkp5lCvRgeJAme1OTVpVPth
/O7HJ7VuEP9GOr6kCXCXmjB4P3UJ2oU0NqfoQdcSSSt9hliALnExTEjii20B2nSDojGCAxQw
ggMQAgEBME4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5UcnVzdDEXMBUGA1UEAxMO
VHJ1c3RJRCBDQSBBMTICEEABZtMZ6Gd9S673WFjAzkMwDQYJYIZIAWUDBAIBBQCgggGXMBgG
CSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTE5MDgwODA0MDA0MVow
LwYJKoZIhvcNAQkEMSIEIJFFx8Rein4r0azKXzXAkeAIo8G0OZO/+1qnPD8/XFkKMF0GCSsG
AQQBgjcQBDFQME4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5UcnVzdDEXMBUGA1UE
AxMOVHJ1c3RJRCBDQSBBMTICEEABZtMZ6Gd9S673WFjAzkMwXwYLKoZIhvcNAQkQAgsxUKBO
MDoxCzAJBgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQg
Q0EgQTEyAhBAAWbTGehnfUuu91hYwM5DMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEq
MAsGCWCGSAFlAwQBAjAKBggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwIC
AUAwBwYFKw4DAgcwDQYIKoZIhvcNAwICASgwDQYJKoZIhvcNAQEBBQAEggEAjTFr4UB0NAzE
+rAZVCJ6atHTSiUGJGN6VvpD00tdXyL065ml3VjftQJuiPD79nOFfFOhSNhGa2t+LXp4g95D
SWz6PLywDrb7iRIg5gDqqFuIXm6Mys99isl6WcrIHYXggvcqo8vGfqlPqYTDfQNHK0BNZNif
fNDPtBWTHrXcTJzB33yBTQEwki6FVrb8jSrkxP+2k+K6XuRcgAQTthRbx4d0obXYrXNovnTe
CF9dr43H1a2Ti/RiPleRCtry7Kv/SM38LDfIcmn2+f5db8Dw+d7Qv05a5jXvWmFcqCpYk1U0
WgYaS64GWbFwLRzr16S3qapbPcSOr6CjomOeZExi0gAAAAAAAA==
--------------ms030600050303030804090000--