[OpenAFS-devel] .35 sec rx delay bug?
Rainer Toebbicke
rtb@pclella.cern.ch
Mon, 06 Nov 2006 14:44:46 +0100
While I've seen this 350 ms delay oddity about a year ago during
tests, I have not been able to reproduce the problem. At the time I
was convinced that it was caused by ACKs being lost occasionally,
actually the "ack every other packet" algorithm.
Lately however we've run RX tests again, worried about the pthreaded
stack's performance which is significantly worse than the lwp one.
. There are a few more places in the protocol that need a "dpf" macro
in order to make the RX trace useful. A lock (...), the current thread
ID in the output and microsecond resolution in rx_debugPrint are a
must for any serious work.
. In order to work against performance drops due to high latency one
might be tempted to increase the window sizes, however the way the
(single) send queue is organized this causes repeated traversals (in
order to recalculate the timeouts for example) starting to take
macroscopic amounts of time under locks. I worked on this a little,
with so far the only result being more timeouts... ;-)
. The maximum window size is 255 (or 254...) due to the way the ACKs work.
. With bigger windows, and a routed network, the windows of 350 ms for
ACKs is actually low, the price for retransmits is high. Here is makes
sense to increase the timeout.
. Allocating new packets is done under a lock. As a result incoming
ACKs get processed late and contribute to keeping the queue size high.
I introduced a "hint" in the call which causes the alloc to release
and re-grab the lock between packets. That helped quite a lot.
. In the past free packets were queued instead of stacked... something
which is level-2-cache counter-productive (for the headers). With the
new allocation system this might be different, I haven't checked.
. I'm currently trying to understand another puzzling case of ACKs
being received but processed only about a millisecond later. Probably
yet another locking problem.
I manage to fill the GigE interface with about 100-110 MB/s
(megabytes) when machines are on the same switch, more the 50-60 you
see when crossing a router. This is admittedly my own RX application,
not rxperf.
Performance however drops dramatically once the sending end should do
something in addition, such as reading a disk. No matter what double
buffering tricks, if you're slow in producing data the send queue is
empty whereas if you're fast it's not better either, with sweet-spots
depending on 16 / 32 / 48 packet window sizes. Again, I suspect the
implementation of a single send/resend queue degrades once it fills up.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155