[OpenAFS-devel] .35 sec rx delay bug?

Sat, 04 Nov 2006 12:08:04 -0500

In message <200611032026.kA3KQpXL002009@ginger.cmf.nrl.navy.mil>,Ken Hornstein writes:
>- 50 Mbyte/sec is only 40% of theoretical ... which I would call
>  "not wonderful".  I can get TCP easily up much higher, and that's mostly
>  out of the box.

rx really isnt that bad.  some testing with rxperf (lwp) shows:

~/openafs/trunk/openafs/src/rx archer.64% ./rxperf client -c send -b 1048576 -T 100 -p 7009 -s 134.207.12.89
send    100 time(s)     1048576 write bytes     1048576 recv bytes:         1397 msec
[600.473013Mb/s]

~/openafs/trunk/openafs/src/rx archer.66% ./rxperf client -c send -b 262144 -T 100 -p 7009 -s 134.207.12.89
send    100 time(s)     1048576 write bytes     1048576 recv bytes:          371 msec
[565.270080Mb/s]

~/openafs/trunk/openafs/src/rx archer.68% ./rxperf client -c send -b 131072 -T 200 -p 7009 -s 134.207.12.89
send    200 time(s)     1048576 write bytes     1048576 recv bytes:          389 msec
[539.113624Mb/s]

~/openafs/trunk/openafs/src/rx archer.70% ./rxperf client -c send -b 65536 -T 400 -p 7009 -s 134.207.12.89
send    400 time(s)     1048576 write bytes     1048576 recv bytes:          432 msec
[485.451851Mb/s]

~/openafs/trunk/openafs/src/rx archer.72% ./rxperf client -c send -b 32768 -T 800 -p 7009 -s 134.207.12.89
send    800 time(s)     1048576 write bytes     1048576 recv bytes:          553 msec
[379.231826Mb/s]

i should mention that i did tune twind to 16, which would tend to
improve the decreasing -b cases some.  no, 60% of a 1Gb link isnt great,
but this performance is already much better than anyone every sees out a
fileserver.  i swear i had a test that pushed 800Mb/s but i cant find it.
but the primary point would be that the rx protocol doesnt handle lots
of short calls very well and unfortunately that is the bulk of cache
manager's traffic.  this is likely the primary reason that tuning your
chunksize bigger in the cache manager is such a big win.

if you do any i/o benchmarking with memcache you will notice that the
writes always go faster than reads (counter to just about anyone's
intuition).  writes are consolidated and sent inside a single call
which is a big win.  some might say, read prefetch can help here.
no it cant.  the filesystem read holds the afs glock too long preventing
the prefetch in the queue from running until its ready to read the chunk
being prefetched.

this version of rxperf also was changed to use readv instead of read (this
is fair since just anything of import uses readv anyway).  with jumbograms
this is win, since you only send an ack at the end of jumbogram processing
which can significantly cut down on the number of acks (it also sends
"hard" acks only and avoids "soft" acks).  acks are a big time waster
on high bw*delay networks.  you want to avoid processing long lists of
packets on the sender and spending time in the receiver sending acks.

rx could probably benefit by keeping more of its state from call to
call and being less conversative.  the fixed rx packet size is a problem,
but jumbograms are good enough to alleviate this problem for now.
however, the way afs uses rx isnt helping.

>- If the bug happens on fast SMP boxes, that makes me think things will get
>  worse as systems get faster.

dont know.  races are wierd.  sometimes its just a certain speed that
is bad.  faster or slower and the problem disappears until you hit the
next mode.