[OpenAFS-devel] Proposal: Extending the RX ACK SACK table
Jeffrey E Altman
jaltman@auristor.com
Wed, 27 Apr 2022 23:12:57 -0400
This is a cryptographically signed message in MIME format.
--------------ms050502040306020409090007
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Since 20 July 2021, AuriStor has presented on the RX Extended SACK
Protocol at
the Fall 2021 HEPiX
https://indico.cern.ch/event/1078853/contributions/4583101/
and privately received positive feedback on the implementation from
three parties
familiar with the RX protocol. AuriStor has also extended its
implementation of
RX Extended SACK to three extra SACK tables for a total of 8192 packets;
approximately 11MB.
No changes to the proposed protocol extension were required subsequent
to the
20 July 2021 update to https://gerrit.openafs.org/#/c/14693.
AuriStor intends to ship these extension to end users next month and will
offer to deliver an updated version of the Fall 2021 HEPiX presentation
at the
June 2022 AFS Technology Workshop.
Jeffrey Altman
On 7/20/2021 2:14 AM, Jeffrey E Altman (jaltman@auristor.com) wrote:
> Throughout the history of AFS there has been recognition that growing
> the Rx window size is necessary to increase the throughput on high
> latency or fat pipes where the meaning of "high-latency" and "fat" have
> changed over time as networks have become faster. The maximum window
> sizes were increased in both IBM AFS 3.4 and 3.5 resulting in the
> current default OpenAFS Rx window size of 32 packets (44KB). Prior to
> the release of OpenAFS 1.6, there were efforts to grow the default Rx
> window size to 64 packets (88KB) in May 2008 and then to 128 packets
> (176KB) in Sept 2009 with the expectation that there would be an
> increase in throughput. These changes were reverted in Sept 2010 after
> the late Andrei Maslennikov presented his findings in Pilsen that
> OpenAFS 1.5.77 was 50-60% slower than 1.4.12.
>
> At DESY in 2011 Simon Wilkinson presented his findings and the
> improvements that were subsequently made to OpenAFS Rx to slightly
> improve the situation. Simon said at the time, "There's only two things
> wrong with RX: the protocol and the implementation". To sustain a
> 10gbit/second flow Rx needs to consistently process 175,000 DATA
> packets/second as well the matching ACK packets. That requires not only
> highly efficient packet processing but it also requires the ability to
> maintain a full network pipe instead of stalling each time the DATA
> sender has filled the peer's advertised receive window.
>
> Over the last decade AuriStor has continued to invest in its Rx
> implementation in order to reduce the costs associated with DATA and ACK
> packet processing, more effectively measure the pipe's congestion
> window, more efficiently recover from packet loss, and improve
> fairness. These efforts have paid off in that AuriStor has been able
> to increase the default window size to 60 packets (82KB) in 2014, 128
> packets (176KB) in 2018, and 255 packets (351KB) in 2021.
>
> One of the reasons that filesystems such as Lustre and GPFS can achieve
> high throughput is because they support TCP window sizes of 8MB or
> larger. In order for AFS to match their performance Rx needs to
> support windows sizes on the order of 6000 packets. The ACK packet's
> receiveWindow field has ample room to advertise larger window sizes as
> its an unsigned 32-bit integer. In 2018 AuriStor removed the
> restriction that the maximum window size be restricted by the number of
> packets that can be represented in the ACK packet's Selective
> Acknowledgement (SACK) table. There is TCP research that describes how
> to perform congestion avoidance when the SACK provides limited
> visibility into the state of the in-flight packets. However, it is
> always preferred to have access to SACK data for all of the in-flight
> packets.
>
> AuriStor is therefore proposing a backward compatible protocol extension
> which will permit incrementally growing the ACK packet's SACK table and
> address two other design weaknesses in the ACK packet: the inconsistent
> use of the 'previousPacket' field which makes it unusable and the lack
> of a count for the number of ACK trailer fields.
>
> There are three commits in OpenAFS Gerrit.
>
> "rx: compare RX_ACK_TYPE_ACK as a bit-field"
> https://gerrit.openafs.org/#/c/14465/ is a code change that ensures that
> OpenAFS Rx will only examine Bit-0 of each SACK table element. This
> permits Bit-1 through Bit-7 of each SACK element to be defined for
> future use when the rx_maxWindow is increased above 255 packets.
> AuriStor Rx already implements this behavior.
>
> "doc: rx-spec Update for accuracy with current Rx implementations"
> https://gerrit.openafs.org/#/c/14692/2 is an update to Nickolai
> Zeldovich's Rx Specification. I hope it improves the description of the
> protocol correcting a number of misconceptions and explains how it
> should be used. The Historical Implementation Notes section is
> particularly important in the context of ACK packet processing and
> possible extensions.
>
> "doc: rx-spec Document the Extended SACK Table protocol extension"
> https://gerrit.openafs.org/#/c/14693/2 describes the proposed
> EXTENDED-SACK ACK packet protocol extension which defines ACK packet
> Flags Bit-3 as EXTENDED-SACK when set in an ACK packet; Bit-3 currently
> only has meaning for DATA packets (MORE-PACKETS). When the
> EXTENDED-SACK flag is set the following is true:
>
> * The previousPacket field must be the largest DATA packet sequence
> number
> accepted by the peer. This allows (previousPacket - firstPacket +
> 1) to
> represent the number of DATA packets that should be represented in SACK
> tables.
>
> * The SACK table can grow up to 256 octets instead of 255 octets by
> leveraging
> one of the three unused octets between the SACK and the first trailer.
>
> * The SACK table can represent the ACK/NACK state for up to 2048 DATA
> packets
> using horizontal striping.
>
> * The second unused octet between the SACK and the first trailer is
> used for
> a count of the number of unsigned 32-bit trailer fields. This
> will permit
> future extensibility. The current value for this field is 4.
>
> * The third unused octet is a count of the number of additional SACK
> tables
> which are appended after the final trailer field. Each SACK is
> variable
> length and can grow up to 256 octets representing up to 2048 DATA
> packets.
>
> With these changes up to 2048 DATA packets can be represented by an ACK
> packet that fits within the minimum IPv4 MTU size and up to 8192 DATA
> packets can be represented by an ACK packet that fits within the minimum
> IPv6 MTU size. Larger window sizes can be represented with larger ACK
> packet but 8192 DATA packets is 11MB which should be more than
> sufficient for now.
>
> Even though it is unlikely that OpenAFS Rx will be able to increase the
> default window sizes to benefit from these changes in the near term,
> there are still benefits to OpenAFS Rx implementing the EXTENDED-SACK
> flag and its associated meanings of previousPacket and the unused
> octets. As documented by gerrit 14692 the prior usage of
> previousPacket makes the field unusable as a means of detecting
> out-of-sequence ACK packets and having an accurate view of the leading
> edge of the in-flight window that has been received by the peer. The
> trailer and extra SACK counts provide much needed clarity of the ACK
> packet size before Path MTU discovery padding.
>
> AuriStor has implemented the EXTENDED-SACK proposal with up to one extra
> SACK table or 4096 DATA packets (5.5MB). With these changes AuriStor
> is prepared to ship a default window size of 4096 in our September 2021
> release provided that there is review from and consensus with the
> OpenAFS community.
>
> Your review and feedback will be appreciated. AuriStor is prepared to
> make changes as needed.
>
> Sincerely,
>
> Jeffrey Altman
>
>
>
--------------ms050502040306020409090007
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature
MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC
DGswggXSMIIEuqADAgECAhBAAW0B1qVVQ32wvx2EXYU6MA0GCSqGSIb3DQEBCwUAMDoxCzAJ
BgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQgQ0EgQTEy
MB4XDTE5MDkwNTE0MzE0N1oXDTIyMTEwMTE0MzE0N1owcDEvMC0GCgmSJomT8ixkAQETH0Ew
MTQxMEMwMDAwMDE2RDAxRDZBNTQwMDAwMDQ0NDcxGTAXBgNVBAMTEEplZmZyZXkgRSBBbHRt
YW4xFTATBgNVBAoTDEF1cmlTdG9yIEluYzELMAkGA1UEBhMCVVMwggEiMA0GCSqGSIb3DQEB
AQUAA4IBDwAwggEKAoIBAQCY1TC9QeWnUgEoJ81FcAVnhGn/AWuzvkYRUG5/ZyXDdaM212e8
ybCklgSmZweqNdrfaaHXk9vwjpvpD4YWgb07nJ1QBwlvRV/VPAaDdneIygJJWBCzaMVLttKO
0VimH/I/HUwFBQT2mrktucCEf2qogdi2P+p5nuhnhIUiyZ71Fo43gF6cuXIMV/1rBNIJDuwM
Q3H8zi6GL0p4mZFZDDKtbYq2l8+MNxFvMrYcLaJqejQNQRBuZVfv0Fq9pOGwNLAk19baIw3U
xdwx+bGpTtS63Py1/57MQ0W/ZXE/Ocnt1qoDLpJeZIuEBKgMcn5/iN9+Ro5zAuOBEKg34wBS
8QCTAgMBAAGjggKcMIICmDAOBgNVHQ8BAf8EBAMCBPAwgYQGCCsGAQUFBwEBBHgwdjAwBggr
BgEFBQcwAYYkaHR0cDovL2NvbW1lcmNpYWwub2NzcC5pZGVudHJ1c3QuY29tMEIGCCsGAQUF
BzAChjZodHRwOi8vdmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NlcnRzL3RydXN0aWRjYWEx
Mi5wN2MwHwYDVR0jBBgwFoAUpHPa72k1inXMoBl7CDL4a4nkQuwwCQYDVR0TBAIwADCCASsG
A1UdIASCASIwggEeMIIBGgYLYIZIAYb5LwAGAgEwggEJMEoGCCsGAQUFBwIBFj5odHRwczov
L3NlY3VyZS5pZGVudHJ1c3QuY29tL2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRt
bDCBugYIKwYBBQUHAgIwga0MgapUaGlzIFRydXN0SUQgQ2VydGlmaWNhdGUgaGFzIGJlZW4g
aXNzdWVkIGluIGFjY29yZGFuY2Ugd2l0aCBJZGVuVHJ1c3QncyBUcnVzdElEIENlcnRpZmlj
YXRlIFBvbGljeSBmb3VuZCBhdCBodHRwczovL3NlY3VyZS5pZGVudHJ1c3QuY29tL2NlcnRp
ZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDBFBgNVHR8EPjA8MDqgOKA2hjRodHRwOi8v
dmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NybC90cnVzdGlkY2FhMTIuY3JsMB8GA1UdEQQY
MBaBFGphbHRtYW5AYXVyaXN0b3IuY29tMB0GA1UdDgQWBBR7pHsvL4H5GdzNToI9e5BuzV19
bzAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwDQYJKoZIhvcNAQELBQADggEBAFlm
JYk4Ff1v/n0foZkv661W4LCRtroBaVykOXetrDDOQNK2N6JdTa146uIZVgBeU+S/0DLvJBKY
tkUHQ9ovjXJTsuCBmhIIw3YlHoFxbku0wHEpXMdFUHV3tUodFJJKF3MbC8j7dOMkag59/Mdz
Sjszdvit0av9nTxWs/tRKKtSQQlxtH34TouIke2UgP/Nn901QLOrJYJmtjzVz8DW3IYVxfci
SBHhbhJTdley5cuEzphELo5NR4gFjBNlxH7G57Hno9+EWILpx302FJMwTgodIBJbXLbPMHou
xQbOL2anOTUMKO8oH0QdQHCtC7hpgoQa7UJYJxDBI+PRaQ/HObkwggaRMIIEeaADAgECAhEA
+d5Wf8lNDHdw+WAbUtoVOzANBgkqhkiG9w0BAQsFADBKMQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MScwJQYDVQQDEx5JZGVuVHJ1c3QgQ29tbWVyY2lhbCBSb290IENBIDEw
HhcNMTUwMjE4MjIyNTE5WhcNMjMwMjE4MjIyNTE5WjA6MQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MRcwFQYDVQQDEw5UcnVzdElEIENBIEExMjCCASIwDQYJKoZIhvcNAQEB
BQADggEPADCCAQoCggEBANGRTTzPCic0kq5L6ZrUJWt5LE/n6tbPXPhGt2Egv7plJMoEpvVJ
JDqGqDYymaAsd8Hn9ZMAuKUEFdlx5PgCkfu7jL5zgiMNnAFVD9PyrsuF+poqmlxhlQ06sFY2
hbhQkVVQ00KCNgUzKcBUIvjv04w+fhNPkwGW5M7Ae5K5OGFGwOoRck9GG6MUVKvTNkBw2/vN
MOd29VGVTtR0tjH5PS5yDXss48Yl1P4hDStO2L4wTsW2P37QGD27//XGN8K6amWB6F2XOgff
/PmlQjQOORT95PmLkwwvma5nj0AS0CVp8kv0K2RHV7GonllKpFDMT0CkxMQKwoj+tWEWJTiD
KSsCAwEAAaOCAoAwggJ8MIGJBggrBgEFBQcBAQR9MHswMAYIKwYBBQUHMAGGJGh0dHA6Ly9j
b21tZXJjaWFsLm9jc3AuaWRlbnRydXN0LmNvbTBHBggrBgEFBQcwAoY7aHR0cDovL3ZhbGlk
YXRpb24uaWRlbnRydXN0LmNvbS9yb290cy9jb21tZXJjaWFscm9vdGNhMS5wN2MwHwYDVR0j
BBgwFoAU7UQZwNPwBovupHu+QucmVMiONnYwDwYDVR0TAQH/BAUwAwEB/zCCASAGA1UdIASC
ARcwggETMIIBDwYEVR0gADCCAQUwggEBBggrBgEFBQcCAjCB9DBFFj5odHRwczovL3NlY3Vy
ZS5pZGVudHJ1c3QuY29tL2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDADAgEB
GoGqVGhpcyBUcnVzdElEIENlcnRpZmljYXRlIGhhcyBiZWVuIGlzc3VlZCBpbiBhY2NvcmRh
bmNlIHdpdGggSWRlblRydXN0J3MgVHJ1c3RJRCBDZXJ0aWZpY2F0ZSBQb2xpY3kgZm91bmQg
YXQgaHR0cHM6Ly9zZWN1cmUuaWRlbnRydXN0LmNvbS9jZXJ0aWZpY2F0ZXMvcG9saWN5L3Rz
L2luZGV4Lmh0bWwwSgYDVR0fBEMwQTA/oD2gO4Y5aHR0cDovL3ZhbGlkYXRpb24uaWRlbnRy
dXN0LmNvbS9jcmwvY29tbWVyY2lhbHJvb3RjYTEuY3JsMB0GA1UdJQQWMBQGCCsGAQUFBwMC
BggrBgEFBQcDBDAOBgNVHQ8BAf8EBAMCAYYwHQYDVR0OBBYEFKRz2u9pNYp1zKAZewgy+GuJ
5ELsMA0GCSqGSIb3DQEBCwUAA4ICAQAN4YKu0vv062MZfg+xMSNUXYKvHwvZIk+6H1pUmivy
DI4I6A3wWzxlr83ZJm0oGIF6PBsbgKJ/fhyyIzb+vAYFJmyI8I/0mGlc+nIQNuV2XY8cypPo
VJKgpnzp/7cECXkX8R4NyPtEn8KecbNdGBdEaG4a7AkZ3ujlJofZqYdHxN29tZPdDlZ8fR36
/mAFeCEq0wOtOOc0Eyhs29+9MIZYjyxaPoTS+l8xLcuYX3RWlirRyH6RPfeAi5kySOEhG1qu
NHe06QIwpigjyFT6v/vRqoIBr7WpDOSt1VzXPVbSj1PcWBgkwyGKHlQUOuSbHbHcjOD8w8wH
SDbL+L2he8hNN54doy1e1wJHKmnfb0uBAeISoxRbJnMMWvgAlH5FVrQWlgajeH/6NbYbBSRx
ALuEOqEQepmJM6qz4oD2sxdq4GMN5adAdYEswkY/o0bRKyFXTD3mdqeRXce0jYQbWm7oapqS
ZBccFvUgYOrB78tB6c1bxIgaQKRShtWR1zMM0JfqUfD9u8Fg7G5SVO0IG/GcxkSvZeRjhYcb
TfqF2eAgprpyzLWmdr0mou3bv1Sq4OuBhmTQCnqxAXr4yVTRYHkp5lCvRgeJAme1OTVpVPth
/O7HJ7VuEP9GOr6kCXCXmjB4P3UJ2oU0NqfoQdcSSSt9hliALnExTEjii20B2nSDojGCAxQw
ggMQAgEBME4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5UcnVzdDEXMBUGA1UEAxMO
VHJ1c3RJRCBDQSBBMTICEEABbQHWpVVDfbC/HYRdhTowDQYJYIZIAWUDBAIBBQCgggGXMBgG
CSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTIyMDQyODAzMTI1N1ow
LwYJKoZIhvcNAQkEMSIEIFqWdomDF04Uam8E7ASom8zvC9bXHF4EM1ty6DWfULg0MF0GCSsG
AQQBgjcQBDFQME4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5UcnVzdDEXMBUGA1UE
AxMOVHJ1c3RJRCBDQSBBMTICEEABbQHWpVVDfbC/HYRdhTowXwYLKoZIhvcNAQkQAgsxUKBO
MDoxCzAJBgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQg
Q0EgQTEyAhBAAW0B1qVVQ32wvx2EXYU6MGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEq
MAsGCWCGSAFlAwQBAjAKBggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwIC
AUAwBwYFKw4DAgcwDQYIKoZIhvcNAwICASgwDQYJKoZIhvcNAQEBBQAEggEAVltlPwg1tqiQ
pzcR6FldQHK9vStdr+zoNaLVyUzpiQ+RSYI4w8ccPmMzGPPIs7AZmhJ78VQ4KrZK8Tz19mka
QNfW70pBT4ZQoD8tK2tb/T8DC/3/UyvVVA1JmxabdrlXDrcSmtfnlfx/vWxDSYgoXm/QngK8
aTxU8XSEb2SSzpfZUthtkM3RSGEpABYVv4MhHDClejlw0PUosAXiQg3KxHJO+3y/btTVIEdq
WYOH6LE0jr27KEOBjiLbIiWVOEPz7rGXYElHBLefL/8rU4+xaYYSfDJ1W4x7bFCrq0+hh0UJ
+1ntyZ0EOKPc04t00EO+0SHC932Qwuongny/Oq8VzgAAAAAAAA==
--------------ms050502040306020409090007--