[OpenAFS] openafs on Fedora 12?
Fri, 11 Dec 2009 09:39:57 +0000
> BTW: on decent machines an individual 1 GiB write does not make the
> user wait: on write the data is first copied in the the AFS file's
> mapping, later into the cache file's mapping (the former step can be
> avoided by writing into the chunk files directly). On reads the
> reader is woken up on every RX packet, ensuring streaming to the
> user. Here again, the double copy can be avoided.
What happens here depends on the VM model of the machine, and how we
interact with it. But on Linux, at least, this isn't strictly true.
Here's how things work in 1.4
There are two different codepaths, one for writes from the write()
syscall, and the other invoked when a page that is mmap'd gets written
to. With write()s, what we do currently is that we prepare a page for
the kernel - the kernel then takes care of copying the buffer passed
by the user to that page, and lets us know when it has completed. We
then take that data, from that page, and do a write() of it against
the backing store. We then return control to the user, who's had to
wait whilst all this occurs. In the background, the pdflush process
then takes care of outputing this data to disk.
With mmap, things are a little different. pdflush is in charge of our
writing and, at intervals, will call our writepage() operation on
pages that the user has dirtied. This all happens completely behind
the scenes. We then write the AFS dirty page out into the backing
store (by using that store's write command), and it's scheduled for
another background flush.
In 1.5 this is streamlined a little by only working at the page level,
which avoids some context swaps, and copies. As I noted in an earlier
email, we also do more in the background in order to get control back
to the user quicker. One further optimisation is that we shouldn't be
doing the write to the backing cache from the write() syscall. All
write is supposed to do is to copy the data from the user into the
filesystem's mapping, and mark the page dirty. It should then be up to
the pdflush process to move this out to the backing store - I intend
to revisit this at some point, but my previous attempts have resulted
in a cache manager that is very prone to deadlocks.
As you note, our Linux implementation creates two copies of the data -
one in AFS's mapping, the other in the backing files. However, we
cannot easily get rid of this duplication - there's no simple
mechanism of bypassing the VM and 'writing into the chunk files
directly'. Using direct-IO would be a possibility, but we'd need to
handle doing this in the backgound, otherwise the user would end up
having to wait until chunk files actually made it to the disk, and it
would limit the range of filesystems we can use as a backing cache.