[OpenAFS-devel] how does fileserver read from disk?

Tom Keiser tkeiser@gmail.com
Tue, 20 Sep 2005 06:59:20 -0400


On 9/20/05, Marcus Watts <mdw@umich.edu> wrote:
> > Received: by 10.70.78.1 with HTTP; Sat, 17 Sep 2005 16:11:38 -0700 (PDT=
)
> > Message-ID: <217964d605091716117d6f5e3b@mail.gmail.com>
> > From: Tom Keiser <tkeiser@gmail.com>
> > Reply-To: tkeiser@gmail.com
> > To: Marcus Watts <mdw@umich.edu>
> > Cc: Roland Kuhn <rkuhn@e18.physik.tu-muenchen.de>,
> >         chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil>,
> >         Harald Barth <haba@pdc.kth.se>, openafs-devel@openafs.org
> > In-Reply-To: <200509170938.FAA14243@quince.ifs.umich.edu>
> > Mime-Version: 1.0
> > References: <65FF232D-00B7-4C5D-8020-6786970D4740@e18.physik.tu-muenche=
n.de>
> >        <200509170938.FAA14243@quince.ifs.umich.edu>
> > Subject: Re: [OpenAFS-devel] how does fileserver read from disk?
> > Sender: openafs-devel-admin@openafs.org
> > Errors-To: openafs-devel-admin@openafs.org
> > Date: Sat, 17 Sep 2005 19:11:38 -0400
> > Status: U
> >
> > On 9/17/05, Marcus Watts <mdw@umich.edu> wrote:
> > > Various wrote:
> > > > Hi Chas!
> > > >
> > > > On 16 Sep 2005, at 14:42, chas williams - CONTRACTOR wrote:
> > > >
> > > > > In message <6E0E8B0D-4DAF-445C-959C-3E9B212EF35D@e18.physik.tu-
> > > > > muenchen.de>,Roland Kuhn writes:
> > > > >
> > > > >> Why can't this be replaced by read(big segment)->buffer->sendmsg
> > > > >> (small
> > > > >> segments). AFAIK readv() is implemented in terms of read() in th=
e
> > > > >> kernel for almost all filesystems, so it should really only have=
 the
> > > > >> effect of making the disk transfer more efficient. The msg heade=
rs
> > > > >> interspersed with the data have to come from userspace in any ca=
se,
> > > > >> right?
> > > > >>
> > > > >
> > > > > no reason you couldnt do this i suppose.  you would need twice th=
e
> > > > > number of entries in the iovec though.  you would need a special
> > > > > version
> > > > > of rx_AllocWritev() that only allocated packet headers and chops =
up a
> > > > > buffer you pass in.
> > > > >
> > > > > curious, i rewrote rx_FetchData() to read into a single buffer an=
d
> > > > > then
> > > > > memcpy() into the already allocated rx packets.  this had no impa=
ct o=3D
> > n
> > > > > performance as far as i could tell (my typical test read was a 16=
k
> > > > > read
> > > > > split across 12/13 rx packets).  the big problem with iovec is no=
t
> > > > > iovec
> > > > > really but rather than you only get 1k for each rx packet you pro=
cess=3D
> > .
> > > > > it quite a bit of work to handle an rx packet.  (although if your
> > > > > lower
> > > > > level disk driver didnt support scatter/gather you might seem som=
e
> > > > > benefit from this).
> > > >
> > > > I know already that 16k-reads are non-optimal ;-) What I meant was
> > > > doing chunksize (1MB in my case) reads. But what I gather from this
> > > > discussion is that this would really be some work as this read-ahea=
d
> > > > would have to be managed across several rx jumbograms, wouldn't it?
> > > >
> > > > Ciao,
> > > >                      Roland
> > >=3D20
> > > I'm not surprised you're not seeing a difference changing things.
> > >=3D20
> > > There are several potential bottlenecks in this process:
> > >=3D20
> > > /1/ reading off of disk
> > > /2/ cpu handling - kernel & user mode
> > > /3/ network output
> > >=3D20
> > > All the work you're doing with iovecs and such is mainly manipulating
> > > the cpu handling.  If you were doing very small disk reads (say, < 40=
96 b=3D
> > ytes),
> > > there's a win here.  The difference between 16k and 1m is much less e=
xtre=3D
> > me.
> > >=3D20
> >
> > Syscall overhead on many of our supported platforms is still the
> > dominant player.
> > For instance, statistical profiling on Solaris points to about 75% of
> > fileserver time being spent in syscalls.  Mode switches are very
> > expensive operations.  I'd have to say aggregating more data into each
> > syscall will have to be one of the main goals of any fileserver
> > improvement project.
>=20
> I don't think you have enough information to conclude that using
> aio will improve anything.  Mode switches are expensive, but that's
> not necessarily the most expensive part of whatever's happening.
> Validating parameters, managing address space, handling device interrupts
> and doing kernel/user space memory copies are also expensive, and some of
> that is unavoidable.  You also need to check your results on several
> different platforms.
>=20

I have data from several platforms.  I'm using Solaris as a canonical examp=
le.

Obviously, mode switches are just part of the CPU time usage.=20
However, the other time
sinks you mentioned are strongly correlated to the number of IO
related syscalls being processed.  So, I'm not sure what your point
is.

Forget the CPU time for a minute.  My arguments regarding aio and cpu
utilization are totally orthogonal.  AIO is about improving disk
utilization, not about reducing CPU utilization.  Not to mention, it's
a fundamental design principle that when it comes to disk i/o
scheduling, wasting some CPU time is ok since the ratio of
instructions retired per completed disk IO op is so large.

Fundamentally, we're dealing with a disk throughput optimization
problem.  Disk IO schedulers are designed to balance the tradeoff
between QoS of each individual IO transaction, and seek optimization
over the pool of outstanding transactions at any given point in time.=20
Obviously, maintaining a large queue of independent IO transactions at
all times is essential to making the tradeoff result in an efficient
outcome.  The current fileserver implementation cannot do this.

If we can improve CPU utilization, that's great.  But, the bigger
problem right now is that we're not utilizing our disks efficiently.

Since I'm not sure why you believe parallelizing i/o will not help,
let me concisely re-iterate why you are wrong:

1) the fileserver uses sync i/o over about a dozen worker threads (by defau=
lt)
2) each rx worker thread can only have one outstanding i/o at a time
3) one singular scsi disk can handle an order of magnitude more outstanding
atomic IO ops at a time in its TCQ than the fileserver can presently provid=
e
4) fileservers generally have more than one disk
5) despite what you say below, many platforms have very robust aio
implementations
that do not use pthreads, and in some cases do not use kernel threads
either, and
are instead purely even-driven
6) even for those platforms that are stuck in the early 90's using
pthreads to emulate
kernel aio, there is a distinct advantage: dedicating a thread pool to
aio greatly reduces
the strain on icache and stack-related dcache for these threads
compared to turning
them into full-blown rx worker threads. (not to mention the rx lock
contention benefits,
ability to have multiple outstanding IOs for a single rpc call, less
register window
thrashing on SPARC, etc...)
7) as I've pointed out before, the fileserver is in a much better
position to do optimal
read-aheads than an adaptive algorithm in the kernel i/o scheduler
8) fileserver controlled read-aheads either need aio, or you have to
implement your
own equivalent of junky userspace aio with a thread pool
9) on robust platforms, aio along with a redesign of the rx server api
would allow the
number of threads to come close to the number of cpus, which would dramatic=
ally=20
reduce context switch rates

I'm not a steadfast supporter of the posix async spec.  But, code
reuse bets writing our own pthreaded i/o pool.  Plus, it would let us
leverage the more advanced kernel aio implementations available on
some platforms.

In case you still don't believe me, here's a quick intuitionistic proof:

If you have a sequence of IO transactions, execution time will be
dominated by seek time.  However, if you perform dependency analysis
on the sequence, and issue all of the independent IO transactions in
parallel, and continue to do this as transactions get retired, seek
time per transaction will be less since the seek costs are amortized
over all the transactions performed during each elevator seek.

The current fileserver design simply can't keep disks busy.  Sure,
they may appear "busy" with tools such as iostat.  But, if you examine
the output more closely, you'll quickly realize that there aren't
enough concurrent transactions in flight at any one time to make disk
busy AND efficient.  Utilization can be high, but it is NOT efficient
utilization.

> >
> >
> > > Regardless of whether you do small reads or big, the net result of al=
l
> > > this is that the system somehow has to schedule disk reads on a regul=
ar
> > > basis, and can't do anything with the data until it has it.
> > > Once you are doing block aligned reads, there's little further win
> > > to doing larger reads.  The system probably already has read-ahead
> > > logic engaged - it will read the next block into the buffer cache whi=
le
> > > your logic is crunching on the previous one.  That's already giving y=
ou
> > > all the advantages "async I/O" would have given you.
> > >=3D20
> >
> > Read-ahead is hardly equivalent to true aync i/o.  The fileserver
> > performs synchronous reads.  This means we can only parallelize disk
> > i/o transactions _across_ several RPC calls, and where the kernel
> > happens to guess correctly with a read-ahead  However, the kernel has
> > far less information at its disposal regarding future i/o patterns
> > than does the fileserver itself.  Thus, the read-ahead decisions made
> > are far from optimal.  Async i/o gives the kernel i/o scheduler (and
> > SCSI TCQ scheduler) more atomic ops to deal with.  This added level of
> > asynchronosity can dramatically improve performance by allowing the
> > lower levels to elevator seek more optimally.  After all, I think the
> > real goal here is to improve throughput, not necessarily latency.=3D20
> > Obviously, this would be a bigger win for storedata_rxstyle, where
> > there's less of a head-of-line blocking effect, and thus order of iop
> > completion would not negatively affect QoS.
> >
> > Let's face it, the stock configuration fileserver has around a dozen
> > worker threads.  With synchronous i/o that's nowhere near enough
> > independent i/o ops to keep even one moderately fast SCSI disk's TCQ
> > filled to the point where it can do appropriate seek optimization.
>=20
> On some if not most systems, aio is done using pthreads.
> The syscall overhead you fear is still there, it's just hidden.
>=20

For many platforms, this hasn't been true in a long time.  Yes, many
platforms fall back on a pthreads implementation when a specific
filesystem doesn't support kernel async io, or when the flags passed
to open aren't supported.

But more fundamentally, who cares that it's backed by pthreads in some
cases?  Your assertion that this leads to the same performance
problems is patently false.  For the purposes of this argument, I
don't care about CPU utilization.  Faster CPUs are cheap, whereas
storage draws a lot of power, and costs a considerable amount.  As
I've stated many times, disk i/o subsystems are designed to deal with
high degrees of asynchrony.  They don't scale well with low levels of
concurrency because physics and disk geometry just don't make that
feasible.

Thus, having a large pool of userspace threads performing blocking i/o
on behalf of a small collection of concurrently executing RPC calls
beats the heck out of the current i/o model.  Sure, it will use the
CPUs in a suboptimal manner.  But, AFS is a filesystem -- the goal
should be to increase disk throughput, not to keep CPU utilization low
on the fileserver.

> On linux, the documentation sucks, which is hardly confidence
> inspiring.
>=20

Well, Linux documentation is bad in general.

SGI donated a fairly robust aio implementation for 2.4 a long time
ago.  It used a special syscall, and then had threads wait for i/o
completion.  The 2.6 aio implementation was developed as part of the
LSE, and it is fully event-driven for unbuffered i/o.  With patches
contributed by IBM LTC, that support is extended to buffered i/o as
well.

> On solaris, the implementation of aio depends on which solaris rev
> and also on the filesystem type.
>=20

True, but what does this have to do with aio being useful?  That just
points to a software engineering issue, not a fundamental problem.


> Aio doesn't handle file opens, creates or unlinks, which are liable
> to be particularly expensive operations in terms of filesystem overhead.
> Generally these operations involve scattered disk I/O, and most
> filesystems take extra pains for reliability at the expense of performanc=
e.
>=20

Indeed.  But how does the premise that aio cannot solve every corner
case i/o problem lead to the conclusion that aio is not useful?  This
thread is about optimizing reading and writing of data files, not
dealing with corner case metadata operations.  Optimizing metadata ops
is an entirely orthogonal topic, and unfortunately pthreads is the
best answer to that problem for the moment.

> >
> > > It's difficult to do much to improve the network overhead,
> > > without making incompatible changes to things.  This is your
> > > most likely bottlneck though.  If you try this with repeated reads
> > > from the same file (so it doesn't have to go out to disk), this
> > > will be the dominant factor.
> > >
> >
> > Yes, network syscalls are taking up over 50% of the fileserver kernel
> > time in such cases, but the readv()/writev() syscalls are taking up a
> > nontrivial amount of time too.  The mode switch and associated icache
> > flush are big hits to performance.  We need to aggregate as much data
> > as possible into each mode switch if we ever hope to substantially
> > improve throughput.
>=20
> Too bad there isn't any portable way to send or receive multiple messages
> at once.
>=20

I'm aware of at least one OS that's working on a new syscall to
mitigate this bottleneck.

> >
> > > If you have a real load, with lots of people accessing parts of your
> > > filesystem, the next most serious bottleneck after network is
> > > your disk.  If you have more than one person accessing your
> > > fileserver, seek time is probably a dominant factor.
> > > The disk driver is probably better at minimum response time
> > > rather than maximum throughput.  So if somebody else requests
> > > 1k while you are in the midst of your 1mb transfer, chances
> > > there will be a head seek to satify their request at the
> > > expense of yours.  Also, there may not be much you can do from
> > > the application layer to alter this behavior.
> > >=3D20
> >
> > Which is why aync read-ahead under the control of userspace would
> > dramatically improve QoS.  Since we have a 1:1 call to thread mapping,
> > we need to do as much in the background as possible.  The best way to
> > optimize this is to give the kernel sufficient information about
> > expected future i/o patterns.  As it stands right now, the posix aync
> > interfaces are one of the best ways to do that.
>=20
> The posix async interface is a library interface standard, not a kernel
> interface.  I believe you're making far too many assumptions regarding
> its implementation.
>=20

Huh? What are you reading in that paragraph that conflates userspace
posix compliance libraries with kernel apis?  What do the
implementation details of aio on arbitrary platform X have to do with
my claim that in order for the kernel to make optimal i/o scheduling
decisions, it needs to have many more in-flight IO transactions?

For arguments sake, assume that platform X emulates aio with pthreads.
 Even the most simplistic implementation gives the kernel orders of
magnitude more concurrent atomic IO transactions to schedule.  The
only difference between pthread emulation and a real kernel aio layer
is the amount of overhead.  Even with such a sub-par aio
implementation, we're still able to give the kernel orders of
magnitude more atomic IO ops to play with.  And, we can do this all
without scaling the number of rx threads, and thus we simultaneously
lift the limitation of one pending i/o per rpc call.  Obviously, this
improves the kernel io scheduler's ability to optimize seeks.

Furthermore, your argument regarding two independent IOs from
userspace reducing each other's QoS is totally mitigated by
read-ahead.  So long as you have adequate buffering somewhere between
the disk and network interface, this will have absolutely no effect on
QoS, assuming the disk's bandwidth can sustain both rpcs at wire
speed.  And, as I've mentioned numerous times, this type of read-ahead
is best handled by the fileserver itself, since it knows the expected
size of all running io transfers a priori.  As it stands right now,
the only form of read-ahead we have is past a mode-switch boundary,
and is subject to predictive algorithms outside of our control.=20
That's far from optimal.

The operating systems I deal with on a daily basis have entire kernel
subsystems dedicated to aio, aio-specific system calls, and posix
compliance libraries wrapping the syscalls. The days of aio being a
joke are over (well, except for sockets...aio support for sockets is
still a tad rough even on the better commercial unices).

Any way you slice it, increasing i/o parallelism is the only way to
make disks busy AND efficient.  In the worst case aio implementation,
you're simply looking at a configartion where the number of threads is
orders of magnitude higher than the number of cpus.  Sure, this is
unwanted, but on those sub-par platforms you're either going to
increase parallelism this way, or by increasing the number of rx
worker threads.  And, it's pretty obvious that a bunch of dedicated io
worker threads is going to be faster for reasons I mentioned above.=20
Not to mention, it's also a more flexible i/o architecture.

--=20
Tom Keiser
tkeiser@gmail.com