[OpenAFS-devel] how does fileserver read from disk?

Harald Barth haba@pdc.kth.se
Thu, 15 Sep 2005 12:18:42 +0200 (MEST)


If I for a moment ignore all the headers (*) and I want to send 12000
bytes that boils down to:

|-------------------------------------------------------| 1 rx jumbogram
|---------------------------|---------------------------| 2 UDP packets 
|------|------|------|------|------|------|------|------| 8 Fragments a 1500 (MTU)

If we now instead have a 1to1 mapping between rx and UDP:

|---------------------------|---------------------------| 2 rx packets  
|---------------------------|---------------------------| 2 UDP packets 
|------|------|------|------|------|------|------|------| 8 Fragments a 1500 (MTU)

And with an 1to1 mapping between UDP and MTU:

|-------------------------------------------------------| 1 rx jumbogram
|------|------|------|------|------|------|------|------| 8 UDP packets 
|------|------|------|------|------|------|------|------| 8 Fragments a 1500 (MTU)

Or both:

|------|------|------|------|------|------|------|------| 8 rx packets  
|------|------|------|------|------|------|------|------| 8 UDP packets 
|------|------|------|------|------|------|------|------| 8 Fragments a 1500 (MTU)

Right?

I think we have to think about what gains and losses each of the
alternatives have. Without jumbograms you don't need to care about the
special case of contiuation headers. Without fragments you don't need
to resend 4 times as much data when you drop one ethernet frame. And
then it seems to impact on the way how the data is read from the HD.
That I know very little about.

> There are some "workarounds" to this problem.  First, we could abandon
> the current zero-copy semantics and just do very large reads and
> writes to the disk, and then do memcpy's in userspace.  For fast
> machines, this will almost certainly beat the current algorithm for
> raw throughput.  But, it's certainly not what I'd call an elegant
> solution.

Yes, the data would go diskIO->kernel->userspace->kernel->net.
On the diskIO side it will be in big chunks. In the net side
it will be in MTU or MTU*4 size chunks. Bad? 

> Second, we could use iovecs for the extension headers.  Unfortunately,
> most OS's limit us to 16 iovecs, so this would cut our max jumbogram
> size nearly in half.

What impact would that have? Measurements? Speculations? If half the
jumbogram size does not kill us, it sounds like an alternative worth
to test.

> There is a third alternative, however: using posix async io's
> lio_listio() method to perform read-ahead / async write-behind.  For
> storedata_rxstyle, we could queue as much i/o as
> possible, and only block on disk i/o once all the data is queued in
> the kernel (or when the async queue fills).  Implementing
> fetchdata_rxstyle would be more involved, as we would probably want to
> implement some form of adaptive read-ahead scheduler.

read-ahead-scheduler sounds not very inviting.

Harald.

(*) I should really recalculate that example starting from a typical payload,
    say one block fetchdata but that would not have fit on my virutal napkin
    in my head.