[OpenAFS-devel] Proposal: Extending the RX ACK SACK table

Jeffrey E Altman jaltman@auristor.com
Tue, 20 Jul 2021 02:14:57 -0400


This is a cryptographically signed message in MIME format.

--------------ms090002030704080901090907
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Language: en-US

Throughout the history of AFS there has been recognition that growing
the Rx window size is necessary to increase the throughput on high
latency or fat pipes where the meaning of "high-latency" and "fat" have
changed over time as networks have become faster.=C2=A0=C2=A0 The maximum=
 window
sizes were increased in both IBM AFS 3.4 and 3.5 resulting in the
current default OpenAFS Rx window size of 32 packets (44KB).=C2=A0=C2=A0 =
Prior to
the release of OpenAFS 1.6, there were efforts to grow the default Rx
window size to 64 packets (88KB) in May 2008 and then to 128 packets
(176KB) in Sept 2009 with the expectation that there would be an
increase in throughput.=C2=A0=C2=A0 These changes were reverted in Sept 2=
010 after
the late Andrei Maslennikov presented his findings in Pilsen that
OpenAFS 1.5.77 was 50-60% slower than 1.4.12.

At DESY in 2011 Simon Wilkinson presented his findings and the
improvements that were subsequently made to OpenAFS Rx to slightly
improve the situation.=C2=A0 Simon said at the time, "There's only two th=
ings
wrong with RX: the protocol and the implementation".=C2=A0=C2=A0 To susta=
in a
10gbit/second flow Rx needs to consistently process 175,000 DATA
packets/second as well the matching ACK packets.=C2=A0 That requires not =
only
highly efficient packet processing but it also requires the ability to
maintain a full network pipe instead of stalling each time the DATA
sender has filled the peer's advertised receive window.

Over the last decade AuriStor has continued to invest in its Rx
implementation in order to reduce the costs associated with DATA and ACK
packet processing, more effectively measure the pipe's congestion
window, more efficiently recover from packet loss, and improve
fairness.=C2=A0=C2=A0 These efforts have paid off in that AuriStor has be=
en able
to increase the default window size to 60 packets (82KB) in 2014, 128
packets (176KB) in 2018, and 255 packets (351KB) in 2021.

One of the reasons that filesystems such as Lustre and GPFS can achieve
high throughput is because they support TCP window sizes of 8MB or
larger.=C2=A0=C2=A0=C2=A0 In order for AFS to match their performance Rx =
needs to
support windows sizes on the order of 6000 packets.=C2=A0=C2=A0 The ACK p=
acket's
receiveWindow field has ample room to advertise larger window sizes as
its an unsigned 32-bit integer.=C2=A0=C2=A0 In 2018 AuriStor removed the
restriction that the maximum window size be restricted by the number of
packets that can be represented in the ACK packet's Selective
Acknowledgement (SACK) table.=C2=A0 There is TCP research that describes =
how
to perform congestion avoidance when the SACK provides limited
visibility into the state of the in-flight packets.=C2=A0 However, it is
always preferred to have access to SACK data for all of the in-flight
packets.

AuriStor is therefore proposing a backward compatible protocol extension
which will permit incrementally growing the ACK packet's SACK table and
address two other design weaknesses in the ACK packet: the inconsistent
use of the 'previousPacket' field which makes it unusable and the lack
of a count for the number of ACK trailer fields.

There are three commits in OpenAFS Gerrit.=C2=A0=C2=A0

"rx: compare RX_ACK_TYPE_ACK as a bit-field"
https://gerrit.openafs.org/#/c/14465/ is a code change that ensures that
OpenAFS Rx will only examine Bit-0 of each SACK table element.=C2=A0=C2=A0=
 This
permits Bit-1 through Bit-7 of each SACK element to be defined for
future use when the rx_maxWindow is increased above 255 packets.=C2=A0=C2=
=A0
AuriStor Rx already implements this behavior.

"doc: rx-spec Update for accuracy with current Rx implementations"
https://gerrit.openafs.org/#/c/14692/2 is an update to Nickolai
Zeldovich's Rx Specification.=C2=A0 I hope it improves the description of=
 the
protocol correcting a number of misconceptions and explains how it
should be used.=C2=A0 The Historical Implementation Notes section is
particularly important in the context of ACK packet processing and
possible extensions.

"doc: rx-spec Document the Extended SACK Table protocol extension"
https://gerrit.openafs.org/#/c/14693/2 describes the proposed
EXTENDED-SACK ACK packet protocol extension which defines ACK packet
Flags Bit-3 as EXTENDED-SACK when set in an ACK packet; Bit-3 currently
only has meaning for DATA packets (MORE-PACKETS).=C2=A0=C2=A0 When the
EXTENDED-SACK flag is set the following is true:

=C2=A0 * The previousPacket field must be the largest DATA packet sequenc=
e
number
=C2=A0=C2=A0=C2=A0=C2=A0 accepted by the peer.=C2=A0 This allows (previou=
sPacket - firstPacket +
1) to
=C2=A0=C2=A0=C2=A0=C2=A0 represent the number of DATA packets that should=
 be represented in SACK
=C2=A0=C2=A0=C2=A0=C2=A0 tables.

=C2=A0 * The SACK table can grow up to 256 octets instead of 255 octets b=
y
leveraging
=C2=A0=C2=A0=C2=A0 one of the three unused octets between the SACK and th=
e first trailer.

=C2=A0 * The SACK table can represent the ACK/NACK state for up to 2048 D=
ATA
packets
=C2=A0=C2=A0=C2=A0=C2=A0 using horizontal striping.

=C2=A0 * The second unused octet between the SACK and the first trailer i=
s
used for
=C2=A0=C2=A0=C2=A0=C2=A0 a count of the number of unsigned 32-bit trailer=
 fields.=C2=A0=C2=A0 This
will permit
=C2=A0=C2=A0=C2=A0=C2=A0 future extensibility.=C2=A0=C2=A0 The current va=
lue for this field is 4.

=C2=A0 * The third unused octet is a count of the number of additional SA=
CK
tables
=C2=A0=C2=A0=C2=A0=C2=A0 which are appended after the final trailer field=
=2E=C2=A0=C2=A0 Each SACK is
variable
=C2=A0=C2=A0=C2=A0=C2=A0 length and can grow up to 256 octets representin=
g up to 2048 DATA
packets.

With these changes up to 2048 DATA packets can be represented by an ACK
packet that fits within the minimum IPv4 MTU size and up to 8192 DATA
packets can be represented by an ACK packet that fits within the minimum
IPv6 MTU size.=C2=A0 Larger window sizes can be represented with larger A=
CK
packet but 8192 DATA packets is 11MB which should be more than
sufficient for now.

Even though it is unlikely that OpenAFS Rx will be able to increase the
default window sizes to benefit from these changes in the near term,
there are still benefits to OpenAFS Rx implementing the EXTENDED-SACK
flag and its associated meanings of previousPacket and the unused
octets.=C2=A0=C2=A0 As documented by gerrit 14692 the prior usage of
previousPacket makes the field unusable as a means of detecting
out-of-sequence ACK packets and having an accurate view of the leading
edge of the in-flight window that has been received by the peer.=C2=A0=C2=
=A0=C2=A0 The
trailer and extra SACK counts provide much needed clarity of the=C2=A0 AC=
K
packet size before Path MTU discovery padding.

AuriStor has implemented the EXTENDED-SACK proposal with up to one extra
SACK table or 4096 DATA packets (5.5MB).=C2=A0=C2=A0 With these changes A=
uriStor
is prepared to ship a default window size of 4096 in our September 2021
release provided that there is review from and consensus with the
OpenAFS community.

Your review and feedback will be appreciated.=C2=A0 AuriStor is prepared =
to
make changes as needed.

Sincerely,

Jeffrey Altman




--------------ms090002030704080901090907
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC
DGswggXSMIIEuqADAgECAhBAAW0B1qVVQ32wvx2EXYU6MA0GCSqGSIb3DQEBCwUAMDoxCzAJ
BgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQgQ0EgQTEy
MB4XDTE5MDkwNTE0MzE0N1oXDTIyMTEwMTE0MzE0N1owcDEvMC0GCgmSJomT8ixkAQETH0Ew
MTQxMEMwMDAwMDE2RDAxRDZBNTQwMDAwMDQ0NDcxGTAXBgNVBAMTEEplZmZyZXkgRSBBbHRt
YW4xFTATBgNVBAoTDEF1cmlTdG9yIEluYzELMAkGA1UEBhMCVVMwggEiMA0GCSqGSIb3DQEB
AQUAA4IBDwAwggEKAoIBAQCY1TC9QeWnUgEoJ81FcAVnhGn/AWuzvkYRUG5/ZyXDdaM212e8
ybCklgSmZweqNdrfaaHXk9vwjpvpD4YWgb07nJ1QBwlvRV/VPAaDdneIygJJWBCzaMVLttKO
0VimH/I/HUwFBQT2mrktucCEf2qogdi2P+p5nuhnhIUiyZ71Fo43gF6cuXIMV/1rBNIJDuwM
Q3H8zi6GL0p4mZFZDDKtbYq2l8+MNxFvMrYcLaJqejQNQRBuZVfv0Fq9pOGwNLAk19baIw3U
xdwx+bGpTtS63Py1/57MQ0W/ZXE/Ocnt1qoDLpJeZIuEBKgMcn5/iN9+Ro5zAuOBEKg34wBS
8QCTAgMBAAGjggKcMIICmDAOBgNVHQ8BAf8EBAMCBPAwgYQGCCsGAQUFBwEBBHgwdjAwBggr
BgEFBQcwAYYkaHR0cDovL2NvbW1lcmNpYWwub2NzcC5pZGVudHJ1c3QuY29tMEIGCCsGAQUF
BzAChjZodHRwOi8vdmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NlcnRzL3RydXN0aWRjYWEx
Mi5wN2MwHwYDVR0jBBgwFoAUpHPa72k1inXMoBl7CDL4a4nkQuwwCQYDVR0TBAIwADCCASsG
A1UdIASCASIwggEeMIIBGgYLYIZIAYb5LwAGAgEwggEJMEoGCCsGAQUFBwIBFj5odHRwczov
L3NlY3VyZS5pZGVudHJ1c3QuY29tL2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRt
bDCBugYIKwYBBQUHAgIwga0MgapUaGlzIFRydXN0SUQgQ2VydGlmaWNhdGUgaGFzIGJlZW4g
aXNzdWVkIGluIGFjY29yZGFuY2Ugd2l0aCBJZGVuVHJ1c3QncyBUcnVzdElEIENlcnRpZmlj
YXRlIFBvbGljeSBmb3VuZCBhdCBodHRwczovL3NlY3VyZS5pZGVudHJ1c3QuY29tL2NlcnRp
ZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDBFBgNVHR8EPjA8MDqgOKA2hjRodHRwOi8v
dmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NybC90cnVzdGlkY2FhMTIuY3JsMB8GA1UdEQQY
MBaBFGphbHRtYW5AYXVyaXN0b3IuY29tMB0GA1UdDgQWBBR7pHsvL4H5GdzNToI9e5BuzV19
bzAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwDQYJKoZIhvcNAQELBQADggEBAFlm
JYk4Ff1v/n0foZkv661W4LCRtroBaVykOXetrDDOQNK2N6JdTa146uIZVgBeU+S/0DLvJBKY
tkUHQ9ovjXJTsuCBmhIIw3YlHoFxbku0wHEpXMdFUHV3tUodFJJKF3MbC8j7dOMkag59/Mdz
Sjszdvit0av9nTxWs/tRKKtSQQlxtH34TouIke2UgP/Nn901QLOrJYJmtjzVz8DW3IYVxfci
SBHhbhJTdley5cuEzphELo5NR4gFjBNlxH7G57Hno9+EWILpx302FJMwTgodIBJbXLbPMHou
xQbOL2anOTUMKO8oH0QdQHCtC7hpgoQa7UJYJxDBI+PRaQ/HObkwggaRMIIEeaADAgECAhEA
+d5Wf8lNDHdw+WAbUtoVOzANBgkqhkiG9w0BAQsFADBKMQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MScwJQYDVQQDEx5JZGVuVHJ1c3QgQ29tbWVyY2lhbCBSb290IENBIDEw
HhcNMTUwMjE4MjIyNTE5WhcNMjMwMjE4MjIyNTE5WjA6MQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MRcwFQYDVQQDEw5UcnVzdElEIENBIEExMjCCASIwDQYJKoZIhvcNAQEB
BQADggEPADCCAQoCggEBANGRTTzPCic0kq5L6ZrUJWt5LE/n6tbPXPhGt2Egv7plJMoEpvVJ
JDqGqDYymaAsd8Hn9ZMAuKUEFdlx5PgCkfu7jL5zgiMNnAFVD9PyrsuF+poqmlxhlQ06sFY2
hbhQkVVQ00KCNgUzKcBUIvjv04w+fhNPkwGW5M7Ae5K5OGFGwOoRck9GG6MUVKvTNkBw2/vN
MOd29VGVTtR0tjH5PS5yDXss48Yl1P4hDStO2L4wTsW2P37QGD27//XGN8K6amWB6F2XOgff
/PmlQjQOORT95PmLkwwvma5nj0AS0CVp8kv0K2RHV7GonllKpFDMT0CkxMQKwoj+tWEWJTiD
KSsCAwEAAaOCAoAwggJ8MIGJBggrBgEFBQcBAQR9MHswMAYIKwYBBQUHMAGGJGh0dHA6Ly9j
b21tZXJjaWFsLm9jc3AuaWRlbnRydXN0LmNvbTBHBggrBgEFBQcwAoY7aHR0cDovL3ZhbGlk
YXRpb24uaWRlbnRydXN0LmNvbS9yb290cy9jb21tZXJjaWFscm9vdGNhMS5wN2MwHwYDVR0j
BBgwFoAU7UQZwNPwBovupHu+QucmVMiONnYwDwYDVR0TAQH/BAUwAwEB/zCCASAGA1UdIASC
ARcwggETMIIBDwYEVR0gADCCAQUwggEBBggrBgEFBQcCAjCB9DBFFj5odHRwczovL3NlY3Vy
ZS5pZGVudHJ1c3QuY29tL2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDADAgEB
GoGqVGhpcyBUcnVzdElEIENlcnRpZmljYXRlIGhhcyBiZWVuIGlzc3VlZCBpbiBhY2NvcmRh
bmNlIHdpdGggSWRlblRydXN0J3MgVHJ1c3RJRCBDZXJ0aWZpY2F0ZSBQb2xpY3kgZm91bmQg
YXQgaHR0cHM6Ly9zZWN1cmUuaWRlbnRydXN0LmNvbS9jZXJ0aWZpY2F0ZXMvcG9saWN5L3Rz
L2luZGV4Lmh0bWwwSgYDVR0fBEMwQTA/oD2gO4Y5aHR0cDovL3ZhbGlkYXRpb24uaWRlbnRy
dXN0LmNvbS9jcmwvY29tbWVyY2lhbHJvb3RjYTEuY3JsMB0GA1UdJQQWMBQGCCsGAQUFBwMC
BggrBgEFBQcDBDAOBgNVHQ8BAf8EBAMCAYYwHQYDVR0OBBYEFKRz2u9pNYp1zKAZewgy+GuJ
5ELsMA0GCSqGSIb3DQEBCwUAA4ICAQAN4YKu0vv062MZfg+xMSNUXYKvHwvZIk+6H1pUmivy
DI4I6A3wWzxlr83ZJm0oGIF6PBsbgKJ/fhyyIzb+vAYFJmyI8I/0mGlc+nIQNuV2XY8cypPo
VJKgpnzp/7cECXkX8R4NyPtEn8KecbNdGBdEaG4a7AkZ3ujlJofZqYdHxN29tZPdDlZ8fR36
/mAFeCEq0wOtOOc0Eyhs29+9MIZYjyxaPoTS+l8xLcuYX3RWlirRyH6RPfeAi5kySOEhG1qu
NHe06QIwpigjyFT6v/vRqoIBr7WpDOSt1VzXPVbSj1PcWBgkwyGKHlQUOuSbHbHcjOD8w8wH
SDbL+L2he8hNN54doy1e1wJHKmnfb0uBAeISoxRbJnMMWvgAlH5FVrQWlgajeH/6NbYbBSRx
ALuEOqEQepmJM6qz4oD2sxdq4GMN5adAdYEswkY/o0bRKyFXTD3mdqeRXce0jYQbWm7oapqS
ZBccFvUgYOrB78tB6c1bxIgaQKRShtWR1zMM0JfqUfD9u8Fg7G5SVO0IG/GcxkSvZeRjhYcb
TfqF2eAgprpyzLWmdr0mou3bv1Sq4OuBhmTQCnqxAXr4yVTRYHkp5lCvRgeJAme1OTVpVPth
/O7HJ7VuEP9GOr6kCXCXmjB4P3UJ2oU0NqfoQdcSSSt9hliALnExTEjii20B2nSDojGCAxQw
ggMQAgEBME4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5UcnVzdDEXMBUGA1UEAxMO
VHJ1c3RJRCBDQSBBMTICEEABbQHWpVVDfbC/HYRdhTowDQYJYIZIAWUDBAIBBQCgggGXMBgG
CSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTIxMDcyMDA2MTQ1N1ow
LwYJKoZIhvcNAQkEMSIEILvDdK7miCfZ5mSu48LNS+ll8W62cjcyvvwfv/hKzD40MF0GCSsG
AQQBgjcQBDFQME4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5UcnVzdDEXMBUGA1UE
AxMOVHJ1c3RJRCBDQSBBMTICEEABbQHWpVVDfbC/HYRdhTowXwYLKoZIhvcNAQkQAgsxUKBO
MDoxCzAJBgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQg
Q0EgQTEyAhBAAW0B1qVVQ32wvx2EXYU6MGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEq
MAsGCWCGSAFlAwQBAjAKBggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwIC
AUAwBwYFKw4DAgcwDQYIKoZIhvcNAwICASgwDQYJKoZIhvcNAQEBBQAEggEAZcBcYtd7T59n
JFdJO93vYah07VkkNfMHEy5fsHDPmtzAGN/quKlKiiyVp49e1cB6iavDGbq8BaLy7ME1zLT8
19Qz81z7Vei0PeutUcUAaCNwAftddrekCKXt4EjEwOKhgWLbCMdcQcEMK+2YYMHS2H1Zb08o
12KX08Mxwh5fp7W1TkxUXT0lrsK+QbhYQsitPI5rDC8b7UOV1M2e0eaDvQzL2RQBHirnXfkE
M3+E7n2H4d/tg+cpdYkd1TUjLS0Xfpt6c5YuptYNOdq0DLGn11cRRHguQCG/6ecZVH1scH8T
kaK9/Qfd8pH8EznRiBD23fgLxh/OjlUxNRklhe332AAAAAAAAA==
--------------ms090002030704080901090907--