[OpenAFS] openafs on Fedora 12?

Simon Wilkinson sxw@inf.ed.ac.uk
Fri, 11 Dec 2009 09:39:57 +0000

> BTW: on decent machines an individual 1 GiB write does not make the  
> user wait: on write the data is first copied in the the AFS file's  
> mapping, later into the cache file's mapping (the former step can be  
> avoided by writing into the chunk files directly).  On reads the  
> reader is woken up on every RX packet, ensuring streaming to the  
> user. Here again, the double copy can be avoided.

What happens here depends on the VM model of the machine, and how we  
interact with it. But on Linux, at least, this isn't strictly true.  
Here's how things work in 1.4

There are two different codepaths, one for writes from the write()  
syscall, and the other invoked when a page that is mmap'd gets written  
to. With write()s, what we do currently is that we prepare a page for  
the kernel - the kernel then takes care of copying the buffer passed  
by the user to that page, and lets us know when it has completed. We  
then take that data, from that page, and do a write() of it against  
the backing store. We then return control to the user, who's had to  
wait whilst all this occurs. In the background, the pdflush process  
then takes care of outputing this data to disk.

With mmap, things are a little different. pdflush is in charge of our  
writing and, at intervals, will call our writepage() operation on  
pages that the user has dirtied. This all happens completely behind  
the scenes. We then write the AFS dirty page out into the backing  
store (by using that store's write command), and it's scheduled for  
another background flush.

In 1.5 this is streamlined a little by only working at the page level,  
which avoids some context swaps, and copies. As I noted in an earlier  
email, we also do more in the background in order to get control back  
to the user quicker. One further optimisation is that we shouldn't be  
doing the write to the backing cache from the write() syscall. All  
write is supposed to do is to copy the data from the user into the  
filesystem's mapping, and mark the page dirty. It should then be up to  
the pdflush process to move this out to the backing store - I intend  
to revisit this at some point, but my previous attempts have resulted  
in a cache manager that is very prone to deadlocks.

As you note, our Linux implementation creates two copies of the data -  
one in AFS's mapping, the other in the backing files. However, we  
cannot easily get rid of this duplication - there's no simple  
mechanism of bypassing the VM and 'writing into the chunk files  
directly'. Using direct-IO would be a possibility, but we'd need to  
handle doing this in the backgound, otherwise the user would end up  
having to wait until chunk files actually made it to the disk, and it  
would limit the range of filesystems we can use as a backing cache.