[OpenAFS-devel] how does fileserver read from disk?

Tom Keiser tkeiser@gmail.com
Wed, 14 Sep 2005 17:05:19 -0400


On 9/14/05, Roland Kuhn <rkuhn@e18.physik.tu-muenchen.de> wrote:
> Hi Tom!
>=20
> On 14 Sep 2005, at 11:15, Tom Keiser wrote:
>=20
> > On 9/14/05, Roland Kuhn <rkuhn@e18.physik.tu-muenchen.de> wrote:
> >
> >> Dear experts!
> >>
> >> Having just strace'd the fileserver (non-LWP, single-threaded) on
> >> Linux, I noticed that the data are read from disk using readv in
> >> packets of 1396bytes, 16kB per syscall. In the face of chunksize=3D1MB
> >> from the client side this does not seem terribly efficient to me, but
> >> of course I see the benefit of reading chunks which can readily be
> >> transferred. If my interpretation is wrong or this is an artifact of
> >> not using tviced, please say so (if possible with a short reference
> >> to the source), otherwise it would be nice to know why the fileserver
> >> cannot read(fd, buf, 1048576) as that would give at least one order
> >> of magnitude better performance from the RAID and (journalled)
> >> filesystem.
> >>
> >>
> >
> > This is an artifact of the bad decisions that were made when
> > implemeting the rx jumbogram protocol many years ago.  Unfortunately,
> > jumbogram extension headers are interspersed between each data
> > continuation vector.  Thus, we need a separate system iovec for each
> > rx packet continuation buffer.  The end result is storedata_rxstyle
> > and fetchdata_rxstyle end up doing two vector io syscalls
> > (recvmsg+writev or readv+sendmsg) per ~16kb of data.  The jumbogram
> > protocol needs to be replaced.
>=20
> Thanks for the explanation. Wouldn't it be possible to keep the
> network protocol (including the sendmsg) as it is, but still to read
> bigger chunks? The outgoing messages are constructed using iovecs
> anyway, so why not intersperse the extension headers at sendmsg time?
>=20

There are some "workarounds" to this problem.  First, we could abandon
the current zero-copy semantics and just do very large reads and
writes to the disk, and then do memcpy's in userspace.  For fast
machines, this will almost certainly beat the current algorithm for
raw throughput.  But, it's certainly not what I'd call an elegant
solution.

Second, we could use iovecs for the extension headers.  Unfortunately,
most OS's limit us to 16 iovecs, so this would cut our max jumbogram
size nearly in half.

There is a third alternative, however: using posix async io's
lio_listio() method to perform read-ahead / async write-behind.  For
storedata_rxstyle, we could queue as much i/o as
possible, and only block on disk i/o once all the data is queued in
the kernel (or when the async queue fills).  Implementing
fetchdata_rxstyle would be more involved, as we would probably want to
implement some form of adaptive read-ahead scheduler.

--=20
Tom Keiser
tkeiser@gmail.com