[OpenAFS-devel] .35 sec rx delay bug?

Wed, 08 Nov 2006 08:03:56 -0500

In message <4551A0AA.8030305@pclella.cern.ch>,Rainer Toebbicke writes:
>is it naive to consider that if RX only works efficiently with 
>jumbograms enabled, than there is something wrong with the 
>implementation? What would it be that makes packet fragmentation and 
>reassembly so immensely more efficient compared with RX packet 
>handling? Why can TCP fill up a GigE leisurely and RX just gets about 
>half of it sweating a complete CPU?

jumbograms are a win for two reasons (atleast in my view).  first,
it gets more data into the kernel per syscall().  tcp gets around
this by hiding mtu of the pipe from you and chopping up whatever
buffer you gave it inside the kernel instead of in userspace.
for those with mtus > 1500, jumbograms are certainly a win.

secondly, jumbograms tend to cut down on the ack processing.  handling
acks is what kills your cpu performance.  each time the rx sender gets
an ack it goes through the entire transmit list looking for resends
(including packets that are in the future as i recall).  this is a
significant problem when your (rx) window grows large.  the dual ack
(soft and hard) just makes this worse.

otherwise, the other cpu performance killer seems to be memcpy.
i had some ptraces showing this.  someday i will find them.

>As a side question: anybody got an opinion whether rxi_TrimDataBufs() 
>does a job worth spending a single cycle? Aren't we talking about 
>saving on a couple dozen buffers each 1.5 kilobytes in length - on 
>machines which nowadays count memory in gigabytes? How about 
>allocating a fixed buffer with each packet and leave it with it forever?

i dont know much about rxi_TrimDataBufs().  it looks like its trying
to get empty rx buffers (due to a "short" read i guess) into the queue
faster.  that's going to happen anyway.  benchmark it but i dont
think i have every seen any traces showing significant time being
spent in this code path.