[OpenAFS] even more namei/inode fileserver performance tests
chas williams
chas@cmf.nrl.navy.mil
Sat, 23 Nov 2002 20:31:13 -0500
benchmarking is one of those addictive things. you really cant stop
once you start -- i just wanted to double check some of those results
that i obtained earlier so i did some client -> fileserver changes
over the network:
solaris8 inode fileserver [-L -p 16 -udpSize 1m]
Version 1.02c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
redhat72 [memcache=18m]
256M 3669 68 7695 33 1413 7 1566 24 1739 2 218.5 2
solaris7 [memcache=146m}
256M 2415 41 3054 34 1546 13 1966 17 1895 8 155.0 6
solaris8 namei fileserver [-L -p 16 -udpSize 1m]
redhat72 [memcache=18m]
258M 3730 68 7817 34 1418 7 1564 23 1738 2 220.0 2
solaris7 [memache=146m]
256M 2458 43 2927 28 1457 12 1838 17 1769 6 149.9 6
to minimize local caching performance both machines used memcaches. based on
the above testing the memcache size seems to be relatively unimportant. from
the previous results going over the loopback to the namei fileserver didnt seem to
perform as well. i believe the above results are 'more correct'. so it seems the
raw performance is virtually the same between namei/inode for a reasonable
configuration. i will mention that i still think that file 'meta' operations
will be about 50% slower on namei. that aspect doesnt concern me -- it might
bother others.
what really bothers me is the truly abysmal performance (excluding the linux
write performance). the disk drive on the fileserver can easily sustain
20MB/s. truss'ing the file server shows that almost all i/o was 8k in size.
this isnt really a surprise, since the chunksize set by afsd is 2^13 (although
the default mentioned in afs_chunkops.h is 2^16). tuning afsd to a bigger
chunksize (15 in this case) yields the following:
solaris8 (stock inode fileserver) [-L -p 16 -udpSize 1m]
Version 1.02c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
redhat72 [memcache=18m,chunksize=15]
256M 3968 70 8966 32 2935 12 3381 50 4310 5 96.7 1
solaris7 [memcache=146m,chunksize=15]
256M 4666 56 6339 41 3457 20 4465 39 5447 15 87.3 5
solaris8 (stock namei fileserver) [-L -p 16 -udpSize 1m]
redhat72 [memcache=18m,chunksize=15]
256M 3943 70 8914 32 2925 12 3410 50 4341 4 100.3 1
solaris7 [memcache=146m,chunksize=15]
256M 4632 55 6407 41 3499 20 4549 40 5603 14 90.1 4
i would call this performance substantially better. however, trussing
the fileserver shows that the i/o size is 16k instead of 32k (2^15).
now we are running into a limiter in the fileserver, AFSV_BUFFERSIZE.
changing this limiter to 32k, we get:
solaris8 namei fileserver (AFSV_BUFFERSIZE=32k) [-L -p 16 -udpSize 1m]
redhat72 [memcache=18m,chunksize=15]
256M 3861 68 8909 33 2893 13 3529 53 4549 5 94.9 1
solaris7 [memache=146m,chunksize=15]
256M 4536 55 6361 40 3297 20 4483 40 5494 14 83.3 5
no performance change (slightly worse in fact). truss shows that i/o's
are about 22k in size. this seems strange at first, until you find that
the reads and writes are done with writev/readv, RX_MAXIOVECS is 16,
and the biggest rx message is 1412 bytes -- 22592. you cannot increase
RX_MAXIOVECS since it is generally limited by the operating system.
this should provide good performance with the exception of the
scatter/gather on i/o. the data is readv into a bundle of rx datagrams
(and jumbograms) and sent via sendmesg. this looks good in theory but
has a number of problems. scatter/gather may not be implemented in the
underlying layers. 8 1k requests is different than 1 8k requests to
a drive. i am quite certain the hme driver on solaris does a msgpullup
on scatter/gather. and of course with a fixed size rx message, you are
limited in your maximum tranfer size.
some work has been done in this area to solve this problem --
_CITI Technical Report 01-3 Improving File Fetches in AFS_, Charles
J. Antonelli, Kevin Coffman, Jim Rees {cja,kwc,rees}@citi.umich.edu,
http://www.citi.umich.edu/. indeed, my local group sponsored this
work. the rx changes to support the atm channel would probably also be
generally beneficial to the udp-based rx protocol. since rx already has
some congestion control features now, the biggest change that would be
needed would be changing the rx message size to a value somewhat larger
than 1412. i believe someone else suggest this at some point in past
on one of these lists. if i have time i will try to dig up the post.
this would solve several key issues in improving the performance of the
fileserver (for a single greedy client of course).
i would appreciate comments on the above from people who are more in
know about afs internals with regard to the rx protocol. is there a
reason rx messages cant be bigger? the jumbojumbogram?
btw (if anyone is still reading) perhaps the default chunksize in
afsd could be changed to 14 since that better matches optSize in
the fileserver.
for the truly bored, here are some code profilings from the
fileserver which show there is nothing left to optimize:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
33.8 4.55 4.55 _writev [3]
19.6 7.19 2.64 31073 0.08 0.08 _so_recvmsg [9]
15.7 9.30 2.11 45378 0.05 0.05 _so_sendmsg [12]
7.1 10.26 0.96 269520 0.00 0.00 _time [19]
3.3 10.70 0.44 _mcount (2065)
2.6 11.05 0.35 oldarc [22]
2.4 11.38 0.33 186396 0.00 0.00 fc_ecb_encrypt [21]
1.4 11.57 0.19 31040 0.01 0.10 rxi_ReadPacket [5]
% cumulative self self total
time seconds seconds calls ms/call ms/call name
25.6 3.15 3.15 _readv [5]
17.2 5.27 2.12 22223 0.10 0.10 _so_sendmsg [12]
12.3 6.78 1.51 46493 0.03 0.03 _so_recvmsg [16]
7.4 7.69 0.91 _mcount (2065)
5.3 8.34 0.65 211308 0.00 0.00 _time [20]
4.1 8.85 0.51 44031 0.01 0.07 rxi_ReceiveAckPacket [4]
3.2 9.24 0.39 oldarc [23]
2.5 9.55 0.31 52592 0.01 0.06 rxi_Start [6]