[OpenAFS-devel] how does fileserver read from disk?

Tom Keiser tkeiser@gmail.com
Sat, 17 Sep 2005 19:11:38 -0400


On 9/17/05, Marcus Watts <mdw@umich.edu> wrote:
> Various wrote:
> > Hi Chas!
> >
> > On 16 Sep 2005, at 14:42, chas williams - CONTRACTOR wrote:
> >
> > > In message <6E0E8B0D-4DAF-445C-959C-3E9B212EF35D@e18.physik.tu-
> > > muenchen.de>,Roland Kuhn writes:
> > >
> > >> Why can't this be replaced by read(big segment)->buffer->sendmsg
> > >> (small
> > >> segments). AFAIK readv() is implemented in terms of read() in the
> > >> kernel for almost all filesystems, so it should really only have the
> > >> effect of making the disk transfer more efficient. The msg headers
> > >> interspersed with the data have to come from userspace in any case,
> > >> right?
> > >>
> > >
> > > no reason you couldnt do this i suppose.  you would need twice the
> > > number of entries in the iovec though.  you would need a special
> > > version
> > > of rx_AllocWritev() that only allocated packet headers and chops up a
> > > buffer you pass in.
> > >
> > > curious, i rewrote rx_FetchData() to read into a single buffer and
> > > then
> > > memcpy() into the already allocated rx packets.  this had no impact o=
n
> > > performance as far as i could tell (my typical test read was a 16k
> > > read
> > > split across 12/13 rx packets).  the big problem with iovec is not
> > > iovec
> > > really but rather than you only get 1k for each rx packet you process=
.
> > > it quite a bit of work to handle an rx packet.  (although if your
> > > lower
> > > level disk driver didnt support scatter/gather you might seem some
> > > benefit from this).
> >
> > I know already that 16k-reads are non-optimal ;-) What I meant was
> > doing chunksize (1MB in my case) reads. But what I gather from this
> > discussion is that this would really be some work as this read-ahead
> > would have to be managed across several rx jumbograms, wouldn't it?
> >
> > Ciao,
> >                      Roland
>=20
> I'm not surprised you're not seeing a difference changing things.
>=20
> There are several potential bottlenecks in this process:
>=20
> /1/ reading off of disk
> /2/ cpu handling - kernel & user mode
> /3/ network output
>=20
> All the work you're doing with iovecs and such is mainly manipulating
> the cpu handling.  If you were doing very small disk reads (say, < 4096 b=
ytes),
> there's a win here.  The difference between 16k and 1m is much less extre=
me.
>=20

Syscall overhead on many of our supported platforms is still the
dominant player.
For instance, statistical profiling on Solaris points to about 75% of
fileserver time being spent in syscalls.  Mode switches are very
expensive operations.  I'd have to say aggregating more data into each
syscall will have to be one of the main goals of any fileserver
improvement project.


> Regardless of whether you do small reads or big, the net result of all
> this is that the system somehow has to schedule disk reads on a regular
> basis, and can't do anything with the data until it has it.
> Once you are doing block aligned reads, there's little further win
> to doing larger reads.  The system probably already has read-ahead
> logic engaged - it will read the next block into the buffer cache while
> your logic is crunching on the previous one.  That's already giving you
> all the advantages "async I/O" would have given you.
>=20

Read-ahead is hardly equivalent to true aync i/o.  The fileserver
performs synchronous reads.  This means we can only parallelize disk
i/o transactions _across_ several RPC calls, and where the kernel
happens to guess correctly with a read-ahead  However, the kernel has
far less information at its disposal regarding future i/o patterns
than does the fileserver itself.  Thus, the read-ahead decisions made
are far from optimal.  Async i/o gives the kernel i/o scheduler (and
SCSI TCQ scheduler) more atomic ops to deal with.  This added level of
asynchronosity can dramatically improve performance by allowing the
lower levels to elevator seek more optimally.  After all, I think the
real goal here is to improve throughput, not necessarily latency.=20
Obviously, this would be a bigger win for storedata_rxstyle, where
there's less of a head-of-line blocking effect, and thus order of iop
completion would not negatively affect QoS.

Let's face it, the stock configuration fileserver has around a dozen
worker threads.  With synchronous i/o that's nowhere near enough
independent i/o ops to keep even one moderately fast SCSI disk's TCQ
filled to the point where it can do appropriate seek optimization.

> It's difficult to do much to improve the network overhead,
> without making incompatible changes to things.  This is your
> most likely bottlneck though.  If you try this with repeated reads
> from the same file (so it doesn't have to go out to disk), this
> will be the dominant factor.
>

Yes, network syscalls are taking up over 50% of the fileserver kernel
time in such cases, but the readv()/writev() syscalls are taking up a
nontrivial amount of time too.  The mode switch and associated icache
flush are big hits to performance.  We need to aggregate as much data
as possible into each mode switch if we ever hope to substantially
improve throughput.

> If you have a real load, with lots of people accessing parts of your
> filesystem, the next most serious bottleneck after network is
> your disk.  If you have more than one person accessing your
> fileserver, seek time is probably a dominant factor.
> The disk driver is probably better at minimum response time
> rather than maximum throughput.  So if somebody else requests
> 1k while you are in the midst of your 1mb transfer, chances
> there will be a head seek to satify their request at the
> expense of yours.  Also, there may not be much you can do from
> the application layer to alter this behavior.
>=20

Which is why aync read-ahead under the control of userspace would
dramatically improve QoS.  Since we have a 1:1 call to thread mapping,
we need to do as much in the background as possible.  The best way to
optimize this is to give the kernel sufficient information about
expected future i/o patterns.  As it stands right now, the posix aync
interfaces are one of the best ways to do that.

Regards,

--=20
Tom Keiser
tkeiser@gmail.com