[OpenAFS-devel] Linux readpage handler

Thu, 26 May 2011 14:38:59 +0100

On 25 May 2011, at 22:52, Andrew Deason wrote:

Firstly, I haven't looked specifically at the versions you are running - =
your Linux kernel is sufficiently ancient that it isn't in the kernel =
git repo, and I don't have my linux-prehistory tree on my laptop. So =
what follows is how things work in recent kernels. There have been =
significant changes here since 2.6.9.

> So, my question here is what is supposed to happen? Is
> current->journal_info supposed to have the journal transaction of the
> current process (in which case I assume the readpage handler is not
> allowed to start write transactions, but I can't find this warned
> against anywhere), or is something supposed to reset the current =
task's
> journal_info or otherwise somehow guard against this?

The way that jbd is currently implemented, a thread cannot have two =
journals open at the same time - you can't call journal_start() on a =
different fs when you already have a journal started. If you are on the =
_same_ fs, then you can get away with this, as you just get a reference =
to the current handle, rather than an error.

When we write_begin on ext3 we start a journal, which isn't completed =
until write_end is called. So, if we page fault whilst we are copying =
between userspace and kernel, we will re-enter the journal, and see the =
assert you see. However, the kernel should prevent this page fault from =
ever occurring, as it can cause deadlocks (the page fault may result in =
memory pressure which causes pages to be flushed, but you're already in =
a filesystem, and you then deadlock). So, write() ensures that all user =
pages required for the copy are in memory before calling write_begin, =
and then actually disables pagefaults during the duration of the copy.

I suspect that the reason why you can't reproduce this on your test =
system, but are seeing it in the wild, is that 2.6.9 has some, but not =
all, of this logic, and so when testing you're seeing pagefaults =
occurring before begin_write (prepare_write, on something that old) is =
called, but on the "real" system, memory pressure is causing a race =
whereby a page that has been swapped in is being swapped out again =
before it can be used.

Hope that's of some use!

Cheers,

Simon.