[OpenAFS-devel] how does fileserver read from disk?

Marcus Watts mdw@umich.edu
Tue, 20 Sep 2005 01:55:15 -0400


> Received: by 10.70.78.1 with HTTP; Sat, 17 Sep 2005 16:11:38 -0700 (PDT)
> Message-ID: <217964d605091716117d6f5e3b@mail.gmail.com>
> From: Tom Keiser <tkeiser@gmail.com>
> Reply-To: tkeiser@gmail.com
> To: Marcus Watts <mdw@umich.edu>
> Cc: Roland Kuhn <rkuhn@e18.physik.tu-muenchen.de>,
>         chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil>,
>         Harald Barth <haba@pdc.kth.se>, openafs-devel@openafs.org
> In-Reply-To: <200509170938.FAA14243@quince.ifs.umich.edu>
> Mime-Version: 1.0
> References: <65FF232D-00B7-4C5D-8020-6786970D4740@e18.physik.tu-muenchen.de>
> 	 <200509170938.FAA14243@quince.ifs.umich.edu>
> Subject: Re: [OpenAFS-devel] how does fileserver read from disk?
> Sender: openafs-devel-admin@openafs.org
> Errors-To: openafs-devel-admin@openafs.org
> Date: Sat, 17 Sep 2005 19:11:38 -0400
> Status: U
> 
> On 9/17/05, Marcus Watts <mdw@umich.edu> wrote:
> > Various wrote:
> > > Hi Chas!
> > >
> > > On 16 Sep 2005, at 14:42, chas williams - CONTRACTOR wrote:
> > >
> > > > In message <6E0E8B0D-4DAF-445C-959C-3E9B212EF35D@e18.physik.tu-
> > > > muenchen.de>,Roland Kuhn writes:
> > > >
> > > >> Why can't this be replaced by read(big segment)->buffer->sendmsg
> > > >> (small
> > > >> segments). AFAIK readv() is implemented in terms of read() in the
> > > >> kernel for almost all filesystems, so it should really only have the
> > > >> effect of making the disk transfer more efficient. The msg headers
> > > >> interspersed with the data have to come from userspace in any case,
> > > >> right?
> > > >>
> > > >
> > > > no reason you couldnt do this i suppose.  you would need twice the
> > > > number of entries in the iovec though.  you would need a special
> > > > version
> > > > of rx_AllocWritev() that only allocated packet headers and chops up a
> > > > buffer you pass in.
> > > >
> > > > curious, i rewrote rx_FetchData() to read into a single buffer and
> > > > then
> > > > memcpy() into the already allocated rx packets.  this had no impact o=
> n
> > > > performance as far as i could tell (my typical test read was a 16k
> > > > read
> > > > split across 12/13 rx packets).  the big problem with iovec is not
> > > > iovec
> > > > really but rather than you only get 1k for each rx packet you process=
> .
> > > > it quite a bit of work to handle an rx packet.  (although if your
> > > > lower
> > > > level disk driver didnt support scatter/gather you might seem some
> > > > benefit from this).
> > >
> > > I know already that 16k-reads are non-optimal ;-) What I meant was
> > > doing chunksize (1MB in my case) reads. But what I gather from this
> > > discussion is that this would really be some work as this read-ahead
> > > would have to be managed across several rx jumbograms, wouldn't it?
> > >
> > > Ciao,
> > >                      Roland
> >=20
> > I'm not surprised you're not seeing a difference changing things.
> >=20
> > There are several potential bottlenecks in this process:
> >=20
> > /1/ reading off of disk
> > /2/ cpu handling - kernel & user mode
> > /3/ network output
> >=20
> > All the work you're doing with iovecs and such is mainly manipulating
> > the cpu handling.  If you were doing very small disk reads (say, < 4096 b=
> ytes),
> > there's a win here.  The difference between 16k and 1m is much less extre=
> me.
> >=20
> 
> Syscall overhead on many of our supported platforms is still the
> dominant player.
> For instance, statistical profiling on Solaris points to about 75% of
> fileserver time being spent in syscalls.  Mode switches are very
> expensive operations.  I'd have to say aggregating more data into each
> syscall will have to be one of the main goals of any fileserver
> improvement project.

I don't think you have enough information to conclude that using
aio will improve anything.  Mode switches are expensive, but that's
not necessarily the most expensive part of whatever's happening.
Validating parameters, managing address space, handling device interrupts
and doing kernel/user space memory copies are also expensive, and some of
that is unavoidable.  You also need to check your results on several
different platforms.

> 
> 
> > Regardless of whether you do small reads or big, the net result of all
> > this is that the system somehow has to schedule disk reads on a regular
> > basis, and can't do anything with the data until it has it.
> > Once you are doing block aligned reads, there's little further win
> > to doing larger reads.  The system probably already has read-ahead
> > logic engaged - it will read the next block into the buffer cache while
> > your logic is crunching on the previous one.  That's already giving you
> > all the advantages "async I/O" would have given you.
> >=20
> 
> Read-ahead is hardly equivalent to true aync i/o.  The fileserver
> performs synchronous reads.  This means we can only parallelize disk
> i/o transactions _across_ several RPC calls, and where the kernel
> happens to guess correctly with a read-ahead  However, the kernel has
> far less information at its disposal regarding future i/o patterns
> than does the fileserver itself.  Thus, the read-ahead decisions made
> are far from optimal.  Async i/o gives the kernel i/o scheduler (and
> SCSI TCQ scheduler) more atomic ops to deal with.  This added level of
> asynchronosity can dramatically improve performance by allowing the
> lower levels to elevator seek more optimally.  After all, I think the
> real goal here is to improve throughput, not necessarily latency.=20
> Obviously, this would be a bigger win for storedata_rxstyle, where
> there's less of a head-of-line blocking effect, and thus order of iop
> completion would not negatively affect QoS.
> 
> Let's face it, the stock configuration fileserver has around a dozen
> worker threads.  With synchronous i/o that's nowhere near enough
> independent i/o ops to keep even one moderately fast SCSI disk's TCQ
> filled to the point where it can do appropriate seek optimization.

On some if not most systems, aio is done using pthreads.
The syscall overhead you fear is still there, it's just hidden.

On linux, the documentation sucks, which is hardly confidence
inspiring.

On solaris, the implementation of aio depends on which solaris rev
and also on the filesystem type.

Aio doesn't handle file opens, creates or unlinks, which are liable
to be particularly expensive operations in terms of filesystem overhead.
Generally these operations involve scattered disk I/O, and most
filesystems take extra pains for reliability at the expense of performance.

> 
> > It's difficult to do much to improve the network overhead,
> > without making incompatible changes to things.  This is your
> > most likely bottlneck though.  If you try this with repeated reads
> > from the same file (so it doesn't have to go out to disk), this
> > will be the dominant factor.
> >
> 
> Yes, network syscalls are taking up over 50% of the fileserver kernel
> time in such cases, but the readv()/writev() syscalls are taking up a
> nontrivial amount of time too.  The mode switch and associated icache
> flush are big hits to performance.  We need to aggregate as much data
> as possible into each mode switch if we ever hope to substantially
> improve throughput.

Too bad there isn't any portable way to send or receive multiple messages
at once.

> 
> > If you have a real load, with lots of people accessing parts of your
> > filesystem, the next most serious bottleneck after network is
> > your disk.  If you have more than one person accessing your
> > fileserver, seek time is probably a dominant factor.
> > The disk driver is probably better at minimum response time
> > rather than maximum throughput.  So if somebody else requests
> > 1k while you are in the midst of your 1mb transfer, chances
> > there will be a head seek to satify their request at the
> > expense of yours.  Also, there may not be much you can do from
> > the application layer to alter this behavior.
> >=20
> 
> Which is why aync read-ahead under the control of userspace would
> dramatically improve QoS.  Since we have a 1:1 call to thread mapping,
> we need to do as much in the background as possible.  The best way to
> optimize this is to give the kernel sufficient information about
> expected future i/o patterns.  As it stands right now, the posix aync
> interfaces are one of the best ways to do that.

The posix async interface is a library interface standard, not a kernel
interface.  I believe you're making far too many assumptions regarding
its implementation.

> 
> Regards,
> 
> --=20
> Tom Keiser
> tkeiser@gmail.com
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>