[OpenAFS-devel] delays and lost contact with fileserver with 1.3.84 and higher

Sat, 29 Oct 2005 16:06:12 -0400

On 10/29/05, Harald Barth <haba@pdc.kth.se> wrote:
>
> > I've seen those hangs with 1.3.84, 1.3.85, 1.4.0rc1 and rc5 clients on
> > Linux (Kernel 2.6). 1.3.80 and 1.3.82 work fine, so I expect that some
> > change between 1.3.82 and 1.3.84 causes the problems.
>
> I have looked at the diff between 82 and 84, and there is major
> changes in rx which are a bit to big for me to get hold on (lots of
> queues here and there). I have not found a way to get a grip on all
> the queues and connection flags that are used in rx.
>
> > The fileserver is
> > from transarc:
> > # rxdebug c-hoernchen -version
> > Trying 137.208.3.48 (port 7000):
> > AFS version: Base configuration afs3.4 5.77
>
> That is not very - uhm - recent.
>
> (c-hoernchen: I was not aware that there were other related chipmunks
> beyond Chip and Dale [A-H=F6rnchen und B-H=F6rnchen] [piff och puff] :-)
>
> > To track down the problem, I've captured the network traffic between
> > client and server while creating 10 files with 100k each.
>
> Was the capture done on the client or the server?
>
> I've looked and looked and now my eyes are crossed. I have found some
> things:
>
> 1.
>
> openafs-1.3.84-slow.pcap frame 4 has a fetch data with Length 999999999.
>
> 2.
>
> openafs-1.3.82-fast frame 75-78 is a store data for file f_5. This
> seems to be part of call 5474 spanning 2 IP packets with 4 fragments
> each. This shows how it should look.
>
> openafs-1.3.84-slow call 256 frame 84-85 is the corresponding one. But
> where are fragments 3 and 4? They should be in the following frames
> within milliseconds. Then call 256 stalls completely for a long time
> until it is finnished in frame 96. I suspect major fishyness in the
> code that assembles and resends rx packets.
>
> I'd like to hear more about the changes to rx that were made between
> 82 and 84, what was the intended outcome?
>

As Jeff pointed out, there were substantial changes to the Rx packet
allocator.  However, these changes are all protected by the
RX_ENABLE_TSFPQ preprocessor macro, which is only defined when
AFS_PTHREAD_ENV is defined, and even then only on some platforms (at
least back when the code was committed, certain platforms were
explicitly building without the thread-local allocator).  The point
I'm trying to make here is that the problems are happening in kernel
rx, where that code is not defined.

However, those patches also made a few optimizations that are not
pthread-specific, which were mostly lock coarsening optimizations.=20
There were many queue scanning loops that called queue_Remove and
rxi_FreePacket for every element.  The last patch created a new set of
functions rxi_AllocPackets() and rxi_FreePackets().  The old
allocation and packet freeing loops were removed in favor of these
optimized calls.

In order to figure out where the problem is, we'll need audit the Rx
changes that affect kernel builds.  Luckily, that's a lot less code to
audit.

> > I've also noticed that in versions 1.3.80 and 1.3.82 (those that do not
> > show the delays) each store-data UDP-packet is 5700 bytes and is
> > splitted in four UDP fragments. However, this is also true for 1.3.84,
> > which already shows the problems. In 1.4.0rcX, the store-data packets
> > seem to be smaller, the UDP packet is only 2896 bytes and comes in two
> > fragments. Is there any specific reason why all those packets are large=
r
> > than the MTU?
>
> I don't know anything about the change to 1.4.0rcX, but the 4 fragments
> are a "feature" of rx. Has something changed how rx fragments are
> handled in 1.4.0?
>
> There are 2 ways in which rx tries to reduce overhead. It may or may
> not be effective.
>
> 1. It puts more than one rx packet into an IP packet. I think that's
> called jumboframe. I think that feature is handshaked between client
> and server and as all my servers have -nojumbo I don't get such
> packets.
>
> 2. It gerenates IP packets up to 4 times MTU, according to a
> RX_MAX_FRAG in src/rx/rx_globals.h. I usually (when I don't forget it)
> patch that to 1. I think this comes from the times of the Sun SS10 or
> earlier when it was faster to send ONE IP packet with FOUR fragments
> instead of FOUR unfragmented IP packets. IMHO (*) this is bull today
> as your throuhput is devastated if you combine this scheme with packet
> loss. A packet loss of say 10% is multiplied to at least 40% because
> of all resends and resends of resends. Todays computers are way faster
> in making IP packets than a SS10.
>

You have a good point here, but there is a tradeoff.  Coalescing more
data into a single syscall has performance advantages when the packet
drop rate is low enough.  Of course, it's hard to quantify what's
"nearby".  Perhaps a good heuristic would be to dynamically scale
fragmenting based upon the packet drop rate?

Regards,

--
Tom Keiser
tkeiser@gmail.com