[OpenAFS-devel] Re: Linux readpage handler

Andrew Deason adeason@sinenomine.net
Thu, 26 May 2011 11:46:55 -0500


On Thu, 26 May 2011 14:38:59 +0100
Simon Wilkinson <sxw@inf.ed.ac.uk> wrote:

> When we write_begin on ext3 we start a journal, which isn't completed
> until write_end is called. So, if we page fault whilst we are copying
> between userspace and kernel, we will re-enter the journal, and see
> the assert you see. However, the kernel should prevent this page fault
> from ever occurring, as it can cause deadlocks (the page fault may
> result in memory pressure which causes pages to be flushed, but you're
> already in a filesystem, and you then deadlock). So, write() ensures
> that all user pages required for the copy are in memory before calling
> write_begin, and then actually disables pagefaults during the duration
> of the copy.
> 
> I suspect that the reason why you can't reproduce this on your test
> system, but are seeing it in the wild, is that 2.6.9 has some, but not
> all, of this logic, and so when testing you're seeing pagefaults
> occurring before begin_write (prepare_write, on something that old) is
> called, but on the "real" system, memory pressure is causing a race
> whereby a page that has been swapped in is being swapped out again
> before it can be used.

Okay yeah, that makes sense. The vanilla 2.6.9 has basically:

fault_in_pages_readable(buf, bytes);
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
status = a_ops->prepare_write(file, page, offset, offset+bytes);
filemap_copy_from_user(page, offset, buf, bytes);

I was looking at a backtrace in my tests, but it's not always easy to
see where exactly I'm being called from, since that's just recording the
last N functions you've been in, if I understand it correctly. (the
panic traces I trust more, though)

So (in a hypothetical panic in vanilla 2.6.9) the page fault happens
before page_write, but they are evicted from memory again before
filemap_copy_from_user is called again, presumably. It doesn't seem like
it's possible for this to be our fault, then, unless we somehow screwed
up something during the pre-prepare_write fault?

It's not immediately clear to me how even modern Linux handles this,
though. Say, for example, a callback break comes in between those calls
and invalidates the pages; would the call to truncate_inode_pages or
whatever block until the write finishes (from some Linux lock), or would
filemap_copy_from_user (or whatever the modern analogue is) return an
error that causes the operation to be retried or something?

-- 
Andrew Deason
adeason@sinenomine.net