[OpenAFS-devel] how does fileserver read from disk?

Marcus Watts mdw@umich.edu
Tue, 20 Sep 2005 20:16:29 -0400


tkeiser@gmail.com writes:
...
> I'm not a steadfast supporter of the posix async spec.  But, code
> reuse bets writing our own pthreaded i/o pool.  Plus, it would let us
> leverage the more advanced kernel aio implementations available on
> some platforms.
...

I'm not inspired by the aio design either.  I like code reuse.
But I fear you have underestimated the design problem, and
overestimated the gains.  More on this in just a bit...

> Huh? What are you reading in that paragraph that conflates userspace
> posix compliance libraries with kernel apis?  What do the
> implementation details of aio on arbitrary platform X have to do with
> my claim that in order for the kernel to make optimal i/o scheduling
> decisions, it needs to have many more in-flight IO transactions?
> 
> For arguments sake, assume that platform X emulates aio with pthreads.
>  Even the most simplistic implementation gives the kernel orders of
> magnitude more concurrent atomic IO transactions to schedule.  The
> only difference between pthread emulation and a real kernel aio layer
> is the amount of overhead.  Even with such a sub-par aio
> implementation, we're still able to give the kernel orders of
> magnitude more atomic IO ops to play with.  And, we can do this all
> without scaling the number of rx threads, and thus we simultaneously
> lift the limitation of one pending i/o per rpc call.  Obviously, this
> improves the kernel io scheduler's ability to optimize seeks.

There are too many unknowns here for me to be at all sure
what you're talking about.  Unknonwns include: kernel/lib aio
implementation, and fileserver rx/io request handling.  Several
of your claimed goals seem to contradict each other, leaving
me very unsure as to what you are actually proposing.  So, rather
than argue vague generalities, let me propose several specific
implementations.  

Let me propose several cases:

case 1.  rx thread using pthreads does read.  This is what
		we have today.
case 2.  rx thread makes aio_read request, stupid aio implementation
		hands request to i/o worker pool, which assigns
		a free thread to use read to queue up request.
		Upon completion, aio thread wakes up rx thread,
		and then looks for more work in the i/o pool.
		Note the "waking" problem.
case 3.  rx thread makes aio_read request to smart aio implementation.
		Aio implementation hands request to kernel.  Kernel
		processes request and returns control to rx thread
		upon completion.  Same "waking" problem as 2.
case 4.  rx thread wants to do i/o.  It hands request to another
		thread which accumulates several requests and
		passed them to kernel using lio_listio,LIO_WAIT.
		kernel uses super-efficient magic to do the i/o
		and returns control to fileserver thread when
		all request are complete.
case 5.  rx thread wants to do i/o.  It hands request to another
		thread which accumulates several requests and
		passed them to kernel using lio_listio,LIO_NOWAIT.
		kernel uses super-efficient magic to do the i/o.
		The waking problem is here as well: as each i/o
		completes, the kernel sends a notification,
		which somehow wakes the appropriate thread.

There are certainly plenty more models; if you have a better alternative
or especially if you have a particular alternative in mind, you should
certainly describe it.  If you are assuming alternative X and I am
assuming alternative Y, the resulting disagreement isn't interesting.

Case 1 is the "direct" pthreaded i/o pool implementation.
the queue of rx requests that have not yet been assigned a
thread is the waiting i/o pool; each rx worker thread is an i/o
processing thread.  A trivial way to increase i/o parallelism
would be to increase the # of rx worker threads.  Since this is
the code we have today, this approach maximizes code reuse.

Case 2 is going to be slightly less efficient than case 1, because
of the need to hand off requests between threads.  A good name
for this is the "indirect" pthreaded i/o pool implementation.
Actual runtime performance vs. case 1 is going to depend on the number
of rx threads and i/o worker threads.  Note this approach requires twice
as many active threads total as case 1 to service the same # of requests.

Case 3 is going to be nearly identical to case 1.  Substituting an
aio request for a read request on a 1-1 basis just means we've
replaced read with a more complicated way to do the identical work.
All of the opportunities for i/o parallelism that exist with aio
here also exist for read.

Case 4 is certainly more complicated.  The need here to accumulate
requests and issue them via lio_listio presents an interesting
metering problem.  Delaying requests until other requests complete
will reduce icache contention to a minimum, at the expense of response
time.

Case 5 here has the metering problem of case 4, plus a bad
case of the "waking" problem.

All but cases 1 and (mostly) 4 have the "waking" problem.  At the
completion of an io request the kernel needs to notify the user process
it copmleted, which is essentially an "up" call from the kernel.  The
kernel supplies information on which request completed, then somehow
returns control to the fileserver which wakes the appropriate thread
and continues execution.  At the least this requires some parameter
copying and at least one context switch.  The overhead could be a lot
worse, if the user process then has to wake and schedule threads.  All
of this is badly documented on linux, and not much better described on
solaris.  However, the alternatives seem to include SIGEV_THREAD,
SIGEV_SIGNAL, and SIGEV_THREAD_ID.  SIGEV_THREAD is a callback from
another thread.  Not clear if that thread is created for the purpose or
recycled, but a thread context switch seems inevitable.  SIGEV_SIGNAL
sends a global signal to the process, so would requires a separate
signal listener proces to do sigwait, followed by logic to somehow
retrieve the io completion and wake a thread.  SIGEV_THREAD_ID does not
appear to be universally available, which is a shame, since it seems to
require the least thread switching.  This whole area seems messy.

All of the above possibilities assume that the basic rx structure of
the fileserver remains intact.  This means that for the life of each
active rpc call, an rx thread is bound to that call.  This is not an
absolute requirement of rx per se, merely an attractive property of the
server interface that rxgen generates.  You could instead provide your own
rx_listenerproc and have a smaller # of threads manage requests using
something besides the local stack for per-request temporary store.  You
won't get much afs code reuse this way, but if you have a basic
objection to using rx threads directly as the pthreaded i/o pool
(ie, case 1 above), this may be what you really want.

> 
> Furthermore, your argument regarding two independent IOs from
> userspace reducing each other's QoS is totally mitigated by
> read-ahead.  So long as you have adequate buffering somewhere between
> the disk and network interface, this will have absolutely no effect on
> QoS, assuming the disk's bandwidth can sustain both rpcs at wire
> speed.  And, as I've mentioned numerous times, this type of read-ahead
> is best handled by the fileserver itself, since it knows the expected
> size of all running io transfers a priori.  As it stands right now,
> the only form of read-ahead we have is past a mode-switch boundary,
> and is subject to predictive algorithms outside of our control.=20
> That's far from optimal.

I don't buy this.  If you got two people fighting over the same disk,
you have arm contention.  Read-ahead and buffering may reduce this, but
it can't eliminate it.  The best you can hope for is to reduce it below
your network bandwidth, which is fortunately probably achievable for
common hardware situations.  Since you actually have N people, not just
two, and average file size could be quite small, interference between
people may be more important to solve than than optimizing
read-ahead.

The fileserver doesn't know enough to help with read-ahead.  All you
know there is that random N sized chunk requests come from various
cache managers, sometimes sequentially.  The place where optimal
read-ahead knowledge lives is on the client side in the user's
application.  Good luck getting that knowledge.

> 
> The operating systems I deal with on a daily basis have entire kernel
> subsystems dedicated to aio, aio-specific system calls, and posix
> compliance libraries wrapping the syscalls. The days of aio being a
> joke are over (well, except for sockets...aio support for sockets is
> still a tad rough even on the better commercial unices).

Bully for you.  We've got a slightly more cost sensitive environment,
so we're busy retiring the last of our rapidly aging aix and solaris
machines.

> 
> Any way you slice it, increasing i/o parallelism is the only way to
> make disks busy AND efficient.  In the worst case aio implementation,
> you're simply looking at a configartion where the number of threads is
> orders of magnitude higher than the number of cpus.  Sure, this is
> unwanted, but on those sub-par platforms you're either going to
> increase parallelism this way, or by increasing the number of rx
> worker threads.  And, it's pretty obvious that a bunch of dedicated io
> worker threads is going to be faster for reasons I mentioned above.=20
> Not to mention, it's also a more flexible i/o architecture.

The proof of the pudding is in the eating.  So do you have a working
implementation yet?  How well does it work?

> 
> --=20
> Tom Keiser
> tkeiser@gmail.com
> 

				-Marcus Watts
				UM ITCS Umich Systems Group