[OpenAFS-devel] read performance 1.5.74 Linux

Jeffrey Altman jaltman@secure-endpoints.com
Tue, 18 May 2010 14:50:43 -0400


This is a cryptographically signed message in MIME format.

--------------ms070404050200050300090903
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 5/18/2010 12:25 PM, Jeffrey Hutzelman wrote:
> --On Tuesday, May 18, 2010 12:03:46 PM -0400 Jeffrey Altman
> <jaltman@secure-endpoints.com> wrote:
>=20
>> On 5/18/2010 11:16 AM, Jeffrey Hutzelman wrote:
>>> I'm concerned here that this might mean you are lying about window
>>> sizes.
>>
>> I (personally) am not lying about anything.  I really wish that *you*
>> could make a distinction between *me* and the open source code when
>> making comments.
>=20
> Apologies.  Of course I meant that the rx implementation was lying.
> And apparently I also misinterpreted what you wrote, because it sounded=

> to me like you were describing changes you and Derrick made which
> resulted in dropping received in-window data.

Thank you.  I didn't really want to single you out but I do think that
we have to be careful about what we say.  I find that software engineers
all too often when describing interactions among software interfaces use
personal pronouns.  This has the negative consequence of blurring the
lines between the ability to say that the code does not meet its
requirements and that the person that wrote the code is bad in some way.

OpenAFS used to be a very small community of close friends.  We now have
more than 250 contributors over the last decade and more than 50 in the
last year.  Those numbers are continuously increasing as OpenAFS gains
more exposure.

It is critical to our ability to recruit and retain new developers that
we communicate in a highly respectful manner with one another.  Not just
for the sake of the feelings of the people involved in the discussion at
the time but to ensure that lurkers feel comfortable speaking up and
becoming an active part of our community.

I have certainly slipped as well.  One thing working with Google Summer
of Code has taught me is that its a lesson that is easy to learn and a
habit that is hard to break.

>>> The reason some data buffers are allocated in advance is because you
>>> must be prepared to receive any data that can be in flight according =
to
>>> the advertised window size _without blocking_ or at least without
>>> blocking in a way that prevents traffic in another stream from being
>>> received and processed.
>>
>> A packet is made up of multiple data buffers which are themselves
>> packets.  The window size in Rx is not measured in bytes.  It is
>> measured in packets and we have no idea how large the incoming packet
>> might be.  It can be as large as RX_MAX_PACKET_SIZE.  As such, before
>> any receive operation is performed the library must ensure that the fu=
ll
>> number of data buffers has been attached to the packet.
>=20
> Well, it has to have someplace to put the received UDP datagram.  That
> doesn't necessarily mean "attached to the packet", which I gather is th=
e
> part that created the concurrency problem you saw in July 2009.  But
> doing it some other way isn't necessarily easier, since you still need
> to account for the total number of buffers that must be available to
> meet window commitments.

If the Rx window size was tracked by the number of bytes in the window,
then it would be a lot easier to ensure that the right number of buffers
were available for any outstanding connection/call.  As you are aware,
the receiver never wants to be in a position where it has to allocate
memory or reclaim packet buffers at the moment it is supposed to be
pulling data off the socket.

>>> Tearing apart other packets to reclaim buffers
>>> is acceptable, but not if it means you need to wait for a buffer to
>>> drain before you can receive more packets.
>>
>> It is acceptable but it is also extremely inefficient because most
>> packets are not jumbo packets and as such only require a single data
>> buffer beyond the buffer used for the header.
>=20
> Yes.
>=20
>=20
>=20
>>>  Dropping received data on
>>> the floor when it was received within the advertised window is _not_
>>> acceptable; that breaks flow control and exacerbates congestion.
>>
>> Of course it is but I believe that the early developers made a wise
>> choice being causing a kernel panic and being inefficient on the wire.=

>> If you have to choose one, drop the data on the floor and let it be
>> retransmitted.
>=20
> Well, sure.  But the "right" way to be inefficient here is to advertise=

> a smaller window, so that you don't get data that has to be
> retransmitted. Nonetheless, this is a decision that was made ages ago,
> and now that I realize that, there's not much benefit in debating its
> merits.

And here is where the bug that was just fixed comes into play.  When the
window size is increased, there is a request for more packets that is
signaled by setting rxi_NeedMorePackets.  The assumption is that this
event will trigger the allocation of more packet buffers when it is safe
/ efficient to do so.  Unfortunately, the only function that tests to
see if rxi_NeedMorePackets is non-zero, rx_CheckPackets() was never
called except during rx_InitHost().

This wasn't a big problem for userland implementations because they
would just allocate packets as the free packet queue became empty.
However, in the kernel implementations this is not possible.  The #ifdef
KERNEL code in rxi_ReceiveDataPacket() is a last ditch effort to keep
the overall system running when the number of free packets gets too low.
 What the code does is blow away the current call, reclaim the data
buffers, and sets the 'rxi_NeedMorePackets' flag to TRUE.  In the hopes
that once this call terminates that rx_CheckPackets() would notice that
more packets need to be allocated.  Since rx_CheckPackets() was never
called, more packets were never allocated and without performing the
rx_TrimDataBuffers dance the system would eventually panic.

>> The goal is to ensure that we never get into this case which is why if=

>> the rxi_NeedMorePackets global variable is TRUE we must actually go an=
d
>> allocate more packets the next time it is safe to do so.
>>
>> The patch that was committed today does that for the first time in the=

>> history of Rx.
>=20
> Now I'm really going to have to go back and reread things, because I
> examined this fairly closely a couple of months ago when I was working
> out fileserver tuning, and one of the conclusions I came to at the time=

> was that at least in user mode, Rx would always allocate more packets
> when needed, so setting the fileserver's -rxpck parameter should never
> be necessary.

The rxpck parameter should not be required anywhere.  It was added at a
time when no one had time to profile and performance test the
implementation.  When Derrick and I removed the rx_TimeDataBuffers()
dance we tested in userland.  It was only after Hartmut noticed this
problem with his kernel testing that I went back and noticed that packet
counts never increase in the kernel after the initial allocation.  This
problem should now be created and in the future, increases in the window
size should behave appropriately.

As an aside, I find that when I'm working on code that was written years
ago and has been in deployment for decades that I tend to make the
assumption that the code must be working as designed.  It never occurred
to me to check whether or not rx_CheckPackets() was doing the job it was
written for.  rx_CheckPackets() and the rxi_NeedMorePackets variable
were added in AFS 3.6.  It is unfortunate that rx_CheckPackets() was not
actually called back then.  If someone has the RCS data from the commit
that added it, I would be very interested in seeing what the log entry sa=
ys.


Jeffrey Altman


--------------ms070404050200050300090903
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJeTCC
AxcwggKAoAMCAQICEAMF9RTCGOz151fTpHLih+cwDQYJKoZIhvcNAQEFBQAwYjELMAkGA1UE
BhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMT
I1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBMB4XDTA5MDgyODA0MDExOVoX
DTEwMDgyODA0MDExOVowczEPMA0GA1UEBBMGQWx0bWFuMRUwEwYDVQQqEwxKZWZmcmV5IEVy
aWMxHDAaBgNVBAMTE0plZmZyZXkgRXJpYyBBbHRtYW4xKzApBgkqhkiG9w0BCQEWHGphbHRt
YW5Ac2VjdXJlLWVuZHBvaW50cy5jb20wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIB
AQDZNscYIvF6xzGSAfa/QUIqiElyn0EUxL2b86eKiYqe91bj0gLr/MJoErLnb+OmokxqSAH6
y0zlFqSbiFwgNM8m69K6m/6YO+x3+5zBc+u6snwTWMEWygnhx3rQ/lMhoQOgArraL+/k9aWL
kNdaXQKk6EZVW9pfV2A4Lk4DoZGFjY8tJRWWDLlFkYnxDuIEpLYwJpwakv3QHOaq/G8KW0iE
jVhVzPobuZzwD2tuepY/bsClwqxz/gfAEpUvAn/lYTqnoT7RYljZlCIdbrgcG/HSYMxAy1Zp
Yh8Fx+9cqsG8O4nqo26SVfYZvrYhh8m6OqW8Vakdt7vBLCTa/QhIdJ4hAgMBAAGjOTA3MCcG
A1UdEQQgMB6BHGphbHRtYW5Ac2VjdXJlLWVuZHBvaW50cy5jb20wDAYDVR0TAQH/BAIwADAN
BgkqhkiG9w0BAQUFAAOBgQBvbvJNXUJ4atv1CExIe0J38jZqoEUTttkXOfCDT9e3mSmVboOK
ifHDyLZQC4qSsCUfP7vdwAXjKtjak22HbfX2sEKCUgtnOkxRqXMM2V/NW/ESNVQZF0TO7L/Z
cW3icObO9FIZCSmgFMt2Al7VPfMQmaJNlqu9SLmXSwbRFJ5b4zCCAxcwggKAoAMCAQICEAMF
9RTCGOz151fTpHLih+cwDQYJKoZIhvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoT
HFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25h
bCBGcmVlbWFpbCBJc3N1aW5nIENBMB4XDTA5MDgyODA0MDExOVoXDTEwMDgyODA0MDExOVow
czEPMA0GA1UEBBMGQWx0bWFuMRUwEwYDVQQqEwxKZWZmcmV5IEVyaWMxHDAaBgNVBAMTE0pl
ZmZyZXkgRXJpYyBBbHRtYW4xKzApBgkqhkiG9w0BCQEWHGphbHRtYW5Ac2VjdXJlLWVuZHBv
aW50cy5jb20wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDZNscYIvF6xzGSAfa/
QUIqiElyn0EUxL2b86eKiYqe91bj0gLr/MJoErLnb+OmokxqSAH6y0zlFqSbiFwgNM8m69K6
m/6YO+x3+5zBc+u6snwTWMEWygnhx3rQ/lMhoQOgArraL+/k9aWLkNdaXQKk6EZVW9pfV2A4
Lk4DoZGFjY8tJRWWDLlFkYnxDuIEpLYwJpwakv3QHOaq/G8KW0iEjVhVzPobuZzwD2tuepY/
bsClwqxz/gfAEpUvAn/lYTqnoT7RYljZlCIdbrgcG/HSYMxAy1ZpYh8Fx+9cqsG8O4nqo26S
VfYZvrYhh8m6OqW8Vakdt7vBLCTa/QhIdJ4hAgMBAAGjOTA3MCcGA1UdEQQgMB6BHGphbHRt
YW5Ac2VjdXJlLWVuZHBvaW50cy5jb20wDAYDVR0TAQH/BAIwADANBgkqhkiG9w0BAQUFAAOB
gQBvbvJNXUJ4atv1CExIe0J38jZqoEUTttkXOfCDT9e3mSmVboOKifHDyLZQC4qSsCUfP7vd
wAXjKtjak22HbfX2sEKCUgtnOkxRqXMM2V/NW/ESNVQZF0TO7L/ZcW3icObO9FIZCSmgFMt2
Al7VPfMQmaJNlqu9SLmXSwbRFJ5b4zCCAz8wggKooAMCAQICAQ0wDQYJKoZIhvcNAQEFBQAw
gdExCzAJBgNVBAYTAlpBMRUwEwYDVQQIEwxXZXN0ZXJuIENhcGUxEjAQBgNVBAcTCUNhcGUg
VG93bjEaMBgGA1UEChMRVGhhd3RlIENvbnN1bHRpbmcxKDAmBgNVBAsTH0NlcnRpZmljYXRp
b24gU2VydmljZXMgRGl2aXNpb24xJDAiBgNVBAMTG1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBDQTErMCkGCSqGSIb3DQEJARYccGVyc29uYWwtZnJlZW1haWxAdGhhd3RlLmNvbTAeFw0w
MzA3MTcwMDAwMDBaFw0xMzA3MTYyMzU5NTlaMGIxCzAJBgNVBAYTAlpBMSUwIwYDVQQKExxU
aGF3dGUgQ29uc3VsdGluZyAoUHR5KSBMdGQuMSwwKgYDVQQDEyNUaGF3dGUgUGVyc29uYWwg
RnJlZW1haWwgSXNzdWluZyBDQTCBnzANBgkqhkiG9w0BAQEFAAOBjQAwgYkCgYEAxKY8VXNV
+065yplaHmjAdQRwnd/p/6Me7L3N9VvyGna9fww6YfK/Uc4B1OVQCjDXAmNaLIkVcI7dyfAr
hVqqP3FWy688Cwfn8R+RNiQqE88r1fOCdz0Dviv+uxg+B79AgAJk16emu59l0cUqVIUPSAR/
p7bRPGEEQB5kGXJgt/sCAwEAAaOBlDCBkTASBgNVHRMBAf8ECDAGAQH/AgEAMEMGA1UdHwQ8
MDowOKA2oDSGMmh0dHA6Ly9jcmwudGhhd3RlLmNvbS9UaGF3dGVQZXJzb25hbEZyZWVtYWls
Q0EuY3JsMAsGA1UdDwQEAwIBBjApBgNVHREEIjAgpB4wHDEaMBgGA1UEAxMRUHJpdmF0ZUxh
YmVsMi0xMzgwDQYJKoZIhvcNAQEFBQADgYEASIzRUIPqCy7MDaNmrGcPf6+svsIXoUOWlJ1/
TCG4+DYfqi2fNi/A9BxQIJNwPP2t4WFiw9k6GX6EsZkbAMUaC4J0niVQlGLH2ydxVyWN3amc
OY6MIE9lX5Xa9/eH1sYITq726jTlEBpbNU1341YheILcIRk13iSx0x1G/11fZU8xggNxMIID
bQIBATB2MGIxCzAJBgNVBAYTAlpBMSUwIwYDVQQKExxUaGF3dGUgQ29uc3VsdGluZyAoUHR5
KSBMdGQuMSwwKgYDVQQDEyNUaGF3dGUgUGVyc29uYWwgRnJlZW1haWwgSXNzdWluZyBDQQIQ
AwX1FMIY7PXnV9OkcuKH5zAJBgUrDgMCGgUAoIIB0DAYBgkqhkiG9w0BCQMxCwYJKoZIhvcN
AQcBMBwGCSqGSIb3DQEJBTEPFw0xMDA1MTgxODUwNDNaMCMGCSqGSIb3DQEJBDEWBBQ9FNGe
swk8hEy8Hmqos/OC8uMizzBfBgkqhkiG9w0BCQ8xUjBQMAsGCWCGSAFlAwQBAjAKBggqhkiG
9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYIKoZIhvcN
AwICASgwgYUGCSsGAQQBgjcQBDF4MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0
ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhADBfUUwhjs9edX06Ry4ofnMIGHBgsqhkiG9w0BCRACCzF4oHYw
YjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5nIChQdHkpIEx0ZC4x
LDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFpbCBJc3N1aW5nIENBAhADBfUUwhjs
9edX06Ry4ofnMA0GCSqGSIb3DQEBAQUABIIBADDhDEbRmCFihSW7t/nz0MSL6lZu18SthHD0
TnHwXeHBCbKt2fI/CI4bZs2i/WBkagBIcV8qFszS29tPDH0LWfECmr7ER2CCkTzyC5ygOAmz
LJe9LsNLw8g8ibje63sG5jBXHUv0HfyhvkAWeBsCEw6Yh5Yu17kuzR3PP4Cy/lHUmmJBx48Z
Ml64X+IziHY8zRtxZy4V3QkuOjTqqdO2lMWPSpWSXac9p1Q+4kidzd88xa+Zs+uPSVzL6xXh
uDctlDARl0is4OeESUtVhuKh5tay7pgJCp8utePmcB3NGYLoBaI1ZLCOycj5V6gqfiAGpmYY
WDrtDQXwPUbS5aADPEAAAAAAAAA=
--------------ms070404050200050300090903--