[OpenAFS] even more namei/inode fileserver performance tests

chas williams chas@cmf.nrl.navy.mil
Sat, 23 Nov 2002 20:31:13 -0500


benchmarking is one of those addictive things.  you really cant stop
once you start --  i just wanted to double check some of those results
that i obtained earlier so i did some client -> fileserver changes 
over the network:

solaris8 inode fileserver [-L -p 16 -udpSize 1m]

 Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
 		    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
 Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
 redhat72 [memcache=18m]
                256M  3669  68  7695  33  1413   7  1566  24  1739   2 218.5   2
 solaris7 [memcache=146m}
                256M  2415  41  3054  34  1546  13  1966  17  1895   8 155.0   6

solaris8 namei fileserver [-L -p 16 -udpSize 1m]

 redhat72 [memcache=18m]
                258M  3730  68  7817  34  1418   7  1564  23  1738   2 220.0   2
 solaris7 [memache=146m]
                256M  2458  43  2927  28  1457  12  1838  17  1769   6 149.9   6

to minimize local caching performance both machines used memcaches.  based on
the above testing the memcache size seems to be relatively unimportant.  from
the previous results going over the loopback to the namei fileserver didnt seem to
perform as well. i believe the above results are 'more correct'.  so it seems the
raw performance is virtually the same between namei/inode for a reasonable 
configuration.  i will mention that i still think that file 'meta' operations 
will be about 50% slower on namei.  that aspect doesnt concern me -- it might
bother others.

what really bothers me is the truly abysmal performance (excluding the linux
write performance).  the disk drive on the fileserver can easily sustain
20MB/s.  truss'ing the file server shows that almost all i/o was 8k in size. 
this isnt really a surprise, since the chunksize set by afsd is 2^13 (although
the default mentioned in afs_chunkops.h is 2^16).  tuning afsd to a bigger
chunksize (15 in this case) yields the following:

solaris8 (stock inode fileserver) [-L -p 16 -udpSize 1m]

 Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
 		    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
 Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
 redhat72 [memcache=18m,chunksize=15]
                256M  3968  70  8966  32  2935  12  3381  50  4310   5  96.7   1
 solaris7 [memcache=146m,chunksize=15]
                256M  4666  56  6339  41  3457  20  4465  39  5447  15  87.3   5

solaris8 (stock namei fileserver) [-L -p 16 -udpSize 1m]

 redhat72 [memcache=18m,chunksize=15]
                256M  3943  70  8914  32  2925  12  3410  50  4341   4 100.3   1
 solaris7 [memcache=146m,chunksize=15]
                256M  4632  55  6407  41  3499  20  4549  40  5603  14  90.1   4

i would call this performance substantially better.  however, trussing
the fileserver shows that the i/o size is 16k instead of 32k (2^15). 
now we are running into a limiter in the fileserver, AFSV_BUFFERSIZE. 
changing this limiter to 32k, we get:

solaris8 namei fileserver (AFSV_BUFFERSIZE=32k) [-L -p 16 -udpSize 1m]

 redhat72 [memcache=18m,chunksize=15]
                256M  3861  68  8909  33  2893  13  3529  53  4549   5  94.9   1
 solaris7 [memache=146m,chunksize=15]
                256M  4536  55  6361  40  3297  20  4483  40  5494  14  83.3   5

no performance change (slightly worse in fact).  truss shows that i/o's
are about 22k in size.  this seems strange at first, until you find that
the reads and writes are done with writev/readv, RX_MAXIOVECS is 16,
and the biggest rx message is 1412 bytes -- 22592.  you cannot increase
RX_MAXIOVECS since it is generally limited by the operating system.

this should provide good performance with the exception of the
scatter/gather on i/o.  the data is readv into a bundle of rx datagrams
(and jumbograms) and sent via sendmesg.  this looks good in theory but
has a number of problems.  scatter/gather may not be implemented in the
underlying layers.  8 1k requests is different than 1 8k requests to
a drive.  i am quite certain the hme driver on solaris does a msgpullup
on scatter/gather.  and of course with a fixed size rx message, you are
limited in your maximum tranfer size.

some work has been done in this area to solve this problem --
_CITI Technical Report 01-3 Improving File Fetches in AFS_, Charles
J. Antonelli, Kevin Coffman, Jim Rees {cja,kwc,rees}@citi.umich.edu,
http://www.citi.umich.edu/.  indeed, my local group sponsored this
work.  the rx changes to support the atm channel would probably also be
generally beneficial to the udp-based rx protocol.  since rx already has
some congestion control features now, the biggest change that would be
needed would be changing the rx message size to a value somewhat larger
than 1412.  i believe someone else suggest this at some point in past
on one of these lists.  if i have time i will try to dig up the post.
this would solve several key issues in improving the performance of the
fileserver (for a single greedy client of course).

i would appreciate comments on the above from people who are more in
know about afs internals with regard to the rx protocol.  is there a
reason rx messages cant be bigger?  the jumbojumbogram?

btw (if anyone is still reading) perhaps the default chunksize in
afsd could be changed to 14 since that better matches optSize in
the fileserver.

for the truly bored, here are some code profilings from the
fileserver which show there is nothing left to optimize:

   %  cumulative    self              self    total
 time   seconds   seconds    calls  ms/call  ms/call name
 33.8       4.55     4.55                            _writev [3]
 19.6       7.19     2.64    31073     0.08     0.08  _so_recvmsg [9]
 15.7       9.30     2.11    45378     0.05     0.05  _so_sendmsg [12]
  7.1      10.26     0.96   269520     0.00     0.00  _time [19]
  3.3      10.70     0.44                            _mcount (2065)
  2.6      11.05     0.35                            oldarc [22]
  2.4      11.38     0.33   186396     0.00     0.00  fc_ecb_encrypt [21]
  1.4      11.57     0.19    31040     0.01     0.10  rxi_ReadPacket [5]


   %  cumulative    self              self    total
 time   seconds   seconds    calls  ms/call  ms/call name
 25.6       3.15     3.15                            _readv [5]
 17.2       5.27     2.12    22223     0.10     0.10  _so_sendmsg [12]
 12.3       6.78     1.51    46493     0.03     0.03  _so_recvmsg [16]
  7.4       7.69     0.91                            _mcount (2065)
  5.3       8.34     0.65   211308     0.00     0.00  _time [20]
  4.1       8.85     0.51    44031     0.01     0.07  rxi_ReceiveAckPacket [4]
  3.2       9.24     0.39                            oldarc [23]
  2.5       9.55     0.31    52592     0.01     0.06  rxi_Start [6]