[OpenAFS-devel] how does fileserver read from disk?

Marcus Watts mdw@umich.edu
Sat, 17 Sep 2005 05:38:52 -0400


Various wrote:
> Hi Chas!
> 
> On 16 Sep 2005, at 14:42, chas williams - CONTRACTOR wrote:
> 
> > In message <6E0E8B0D-4DAF-445C-959C-3E9B212EF35D@e18.physik.tu- 
> > muenchen.de>,Roland Kuhn writes:
> >
> >> Why can't this be replaced by read(big segment)->buffer->sendmsg 
> >> (small
> >> segments). AFAIK readv() is implemented in terms of read() in the
> >> kernel for almost all filesystems, so it should really only have the
> >> effect of making the disk transfer more efficient. The msg headers
> >> interspersed with the data have to come from userspace in any case,
> >> right?
> >>
> >
> > no reason you couldnt do this i suppose.  you would need twice the
> > number of entries in the iovec though.  you would need a special  
> > version
> > of rx_AllocWritev() that only allocated packet headers and chops up a
> > buffer you pass in.
> >
> > curious, i rewrote rx_FetchData() to read into a single buffer and  
> > then
> > memcpy() into the already allocated rx packets.  this had no impact on
> > performance as far as i could tell (my typical test read was a 16k  
> > read
> > split across 12/13 rx packets).  the big problem with iovec is not  
> > iovec
> > really but rather than you only get 1k for each rx packet you process.
> > it quite a bit of work to handle an rx packet.  (although if your  
> > lower
> > level disk driver didnt support scatter/gather you might seem some
> > benefit from this).
> 
> I know already that 16k-reads are non-optimal ;-) What I meant was  
> doing chunksize (1MB in my case) reads. But what I gather from this  
> discussion is that this would really be some work as this read-ahead  
> would have to be managed across several rx jumbograms, wouldn't it?
> 
> Ciao,
>                      Roland

I'm not surprised you're not seeing a difference changing things.

There are several potential bottlenecks in this process:

/1/ reading off of disk
/2/ cpu handling - kernel & user mode
/3/ network output

All the work you're doing with iovecs and such is mainly manipulating
the cpu handling.  If you were doing very small disk reads (say, < 4096 bytes),
there's a win here.  The difference between 16k and 1m is much less extreme.

Regardless of whether you do small reads or big, the net result of all
this is that the system somehow has to schedule disk reads on a regular
basis, and can't do anything with the data until it has it.
Once you are doing block aligned reads, there's little further win
to doing larger reads.  The system probably already has read-ahead
logic engaged - it will read the next block into the buffer cache while
your logic is crunching on the previous one.  That's already giving you
all the advantages "async I/O" would have given you.

It's difficult to do much to improve the network overhead,
without making incompatible changes to things.  This is your
most likely bottlneck though.  If you try this with repeated reads
from the same file (so it doesn't have to go out to disk), this
will be the dominant factor.

If you have a real load, with lots of people accessing parts of your
filesystem, the next most serious bottleneck after network is
your disk.  If you have more than one person accessing your
fileserver, seek time is probably a dominant factor.
The disk driver is probably better at minimum response time
rather than maximum throughput.  So if somebody else requests
1k while you are in the midst of your 1mb transfer, chances
there will be a head seek to satify their request at the
expense of yours.  Also, there may not be much you can do from
the application layer to alter this behavior.

You'll probably have the best luck improving things if you can
come up with some good way to instrument the existing code to
figure out where the bottlenecks are in it, and by trying various
experiments to find out what maximal theoretical performance you
should expect from each piece and under what circumstances.

				-Marcus Watts
				UM ITCS Umich Systems Group