[OpenAFS-devel] Faster reads on Linux

Wed, 15 Jul 2009 00:10:25 +0100

I've been done some work recently at improving the read performance of  
the Linux client. We're in the process of moving an RPM delivery  
system away from using squid caches, to using an httpd backed by the  
AFS disk cache - in doing so, the very poor performance of this  
solution compared to squid was noted. The work I've done has more than  
doubled performance in this application, and shows a substantial  
benefit in many other cases - see http://homepages.inf.ed.ac.uk/sxw/2d-read.png 
  for iozone's view.

The patch series at /afs/inf.ed.ac.uk/user/s/sxw/Public/faster-reads/  
(against 1.4.11) is presented for comment. Essentially this series  
splits the changes into four distinct chunks:

Firstly, minimise calls to crref(), and give cache hits a fast path at  
the start of readpages. We take advantage of that fact that we know  
that reads will only occur in page sized chunks and that, providing  
chunksize>pagesize, a read will never cross a page boundary, to  
significantly reduce the amount of work that we need to do in  
preparing for a read.

Secondly, change from kmap()ing our page and using ->read() to load a  
page to using the backing filesystem's own readpage() call. Due to the  
fact that we can't swap pages between filesystems, this requires us to  
create a page to read the cached data into, and then to perform a page  
copy between the backing cache's page and the filesystem page. None  
the less, this is significantly faster than the memory management  
tricks we were playing with read().

Thirdly, introduce readahead. Typically AFS has had readahead disabled  
(as we use prefetch to get the next chunk from the fileserver,  
anyway). However, disabling readahead means that there is no  
opportunity to prefill the page cache with the 'next' page that the  
process wants. This means that the process will block whilst that page  
is fetched in from disk, which seriously degrades performance in  
situations where you are streaming data off disk and out onto the  
network. This patch enables readahead, but does so in the foreground.  
This means that the calling process will block not only while it's  
required data is read, but whilst the whole readahead chunk is pulled  
from disk. Obviously not ideal - but still manages to be faster than  
the vanilla cache manager when streaming files.

Finally, do the readahead in the background. This actually ends up  
being harder than it looks, as Linux won't share the necessary  
information with us. Whilst very new kernels do have the ability to  
watch for page unlocks, we can't do it. Nor can we create our own  
worker queues in which to do the waiting. So, we take a two pronged  
approach. The AFS module gains a new kernel thread, which examines all  
of the pages which are being read from disk. As each page becomes  
unlocked (indicating that the read has completed), it transfers them  
into a queue for a separate worker task. This worker task lives in the  
kernel's work queue, and takes care of copying the data from the page  
that has just been read into the AFS page, and unlocking that page (to  
mark it as read for use). We do this in the work queue, as it means we  
can be copying one page per processor on the system, giving us some  
kind of parallelism. All of this is implemented by using 'real' Linux  
locks, so we don't serialise around the GLOCK at all.

Comments, questions, flames?

S.