[OpenAFS-devel] .35 sec rx delay bug?

Rainer Toebbicke rtb@pclella.cern.ch
Mon, 06 Nov 2006 14:44:46 +0100


While I've seen this 350 ms delay oddity about a year ago during 
tests, I have not been able to reproduce the problem. At the time I 
was convinced that it was caused by ACKs being lost occasionally, 
actually the "ack every other packet" algorithm.

Lately however we've run RX tests again, worried about the pthreaded 
stack's performance which is significantly worse than the lwp one.

. There are a few more places in the protocol that need a "dpf" macro 
in order to make the RX trace useful. A lock (...), the current thread 
ID in the output and microsecond resolution in rx_debugPrint are a 
must for any serious work.

. In order to work against performance drops due to high latency one 
might be tempted to increase the window sizes, however the way the 
(single) send queue is organized this causes repeated traversals (in 
order to recalculate the timeouts for example) starting to take 
macroscopic amounts of time under locks. I worked on this a little, 
with so far the only result being more timeouts...  ;-)

. The maximum window size is 255 (or 254...) due to the way the ACKs work.

. With bigger windows, and a routed network, the windows of 350 ms for 
ACKs is actually low, the price for retransmits is high. Here is makes 
sense to increase the timeout.

. Allocating new packets is done under a lock. As a result incoming 
ACKs get processed late and contribute to keeping the queue size high. 
I introduced a "hint" in the call which causes the alloc to release 
and re-grab the lock between packets. That helped quite a lot.

. In the past free packets were queued instead of stacked... something 
which is level-2-cache counter-productive (for the headers). With the 
new allocation system this might be different, I haven't checked.

. I'm currently trying to understand another puzzling case of ACKs 
being received but processed only about a millisecond later. Probably 
yet another locking problem.

I manage to fill the GigE interface with about 100-110 MB/s 
(megabytes) when machines are on the same switch, more the 50-60 you 
see when crossing a router. This is admittedly my own RX application, 
not rxperf.

Performance however drops dramatically once the sending end should do 
something in addition, such as reading a disk. No matter what double 
buffering tricks, if you're slow in producing data the send queue is 
empty whereas if you're fast it's not better either, with sweet-spots 
depending on 16 / 32 / 48 packet window sizes. Again, I suspect the 
implementation of a single send/resend queue degrades once it fills up.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155