[OpenAFS-devel] .35 sec rx delay bug?

Mon, 6 Nov 2006 10:08:32 -0500

On 11/6/06, Rainer Toebbicke <rtb@pclella.cern.ch> wrote:
> While I've seen this 350 ms delay oddity about a year ago during
> tests, I have not been able to reproduce the problem. At the time I
> was convinced that it was caused by ACKs being lost occasionally,
> actually the "ack every other packet" algorithm.
>
> Lately however we've run RX tests again, worried about the pthreaded
> stack's performance which is significantly worse than the lwp one.
>
> . There are a few more places in the protocol that need a "dpf" macro
> in order to make the RX trace useful. A lock (...), the current thread
> ID in the output and microsecond resolution in rx_debugPrint are a
> must for any serious work.
>
> . In order to work against performance drops due to high latency one
> might be tempted to increase the window sizes, however the way the
> (single) send queue is organized this causes repeated traversals (in
> order to recalculate the timeouts for example) starting to take
> macroscopic amounts of time under locks. I worked on this a little,
> with so far the only result being more timeouts...  ;-)
>
> . The maximum window size is 255 (or 254...) due to the way the ACKs work.
>
> . With bigger windows, and a routed network, the windows of 350 ms for
> ACKs is actually low, the price for retransmits is high. Here is makes
> sense to increase the timeout.
>
> . Allocating new packets is done under a lock. As a result incoming
> ACKs get processed late and contribute to keeping the queue size high.
> I introduced a "hint" in the call which causes the alloc to release
> and re-grab the lock between packets. That helped quite a lot.
>

What version of the code were you using for these experiments?  In
general, acquiring and releasing a mutex for every packet, while
iterating over a list, is going to be very expensive due to all the
added membars and atomic operations.  I'd really need to see your
patch to appreciate what's going on here.

Pthreads packet allocation has been lock-less (for the common case)
since 1.3.82 (with further improvements in 1.3.83).  In addition,
there are special per-thread packets for cases where we immediately
throw away the packet (e.g. sending ACKs).  Beyond 1.3.81, there are
very few places in the code where we loop over packet queues.  In
order to reduce stores and cache invalidates, I added several new
rx_queue interfaces to perform operations such as splicing.

> . In the past free packets were queued instead of stacked... something
> which is level-2-cache counter-productive (for the headers). With the
> new allocation system this might be different, I haven't checked.
>

I'm pretty sure I fixed this glaring locality bug, but I'll have to
verify that and get back to you ;)

> . I'm currently trying to understand another puzzling case of ACKs
> being received but processed only about a millisecond later. Probably
> yet another locking problem.
>
> I manage to fill the GigE interface with about 100-110 MB/s
> (megabytes) when machines are on the same switch, more the 50-60 you
> see when crossing a router. This is admittedly my own RX application,
> not rxperf.
>
> Performance however drops dramatically once the sending end should do
> something in addition, such as reading a disk. No matter what double
> buffering tricks, if you're slow in producing data the send queue is
> empty whereas if you're fast it's not better either, with sweet-spots
> depending on 16 / 32 / 48 packet window sizes. Again, I suspect the
> implementation of a single send/resend queue degrades once it fills up.
>

I've definitely seen the sweet spots you're talking about.  As an
interesting anecdote, on the various machines I've tested, if you do
absolutely nothing besides rx_readv/rx_writev in a tight loop, it has
generally led to spectacularly awful performance (e.g. <10 Mb/s on
gigE).

Regards,

-Tom