[OpenAFS-devel] Re: [OpenAFS] namei interface lockf buggy on Solaris (and probably HP-UX and AIX)

Tom Keiser tkeiser@gmail.com
Thu, 14 Sep 2006 17:49:05 -0400


On 9/14/06, Rainer Toebbicke <rtb@pclella.cern.ch> wrote:
> Ken Hornstein wrote:
>
> >
> > Whew, you ain't kidding.  When I looked at that, I believe a lot of it
> > was the link table.  I have been idly thinking about simply removing
> > most of those fsync() calls, or collapsing a whole bunch of them ... it
> > would probably speed up operations like volume clones a whole lot.  A
> > few thought experiments made me think that perhaps the consequences of
> > an incorrect link count aren't so catastrophic that that salvager
> > couldn't easily recover from it ... but AFS has fooled me before, so
> > I'm not convinced of that yet :-)
> >
>
> We've been running since several years with syncing gradually reduced
> to now all syncs batched and done as a precaution in a separate
> thread, every now and then.
>
> All vos operations speed up a couple of hundred times on big volumes,
> life would be impossible without that (we have volumes of 1 million
> files). Furthermore we had a performance requirement for creating so
> many directories with one file per second which was not reachable with
> all those syncs.
>
>  From what I understood from the namei code the link table sync does
> not eliminate the need for a salvager in case of a crash, there's
> always a window. And even with a sync, how about a power cut and disk
> caches, RAID systems, and the like.
>

Exactly!  Doing the fsync as-soon-as-possible is simply a hack to
reduce the likelihood of an inevitable possibility.  There is no
notion of atomic transactions in namei; namei link count updates and
other metadata updates are effectively unrelated.  So, lacking a
metadata journal to help us recover, we just punt and try to make this
possible source of inconsistency as unlikely as possible.
Essentially, the fsync barrier just reduces the number of vnodes which
are simultaneously in an inconsistent state from N to 1.

When doing bulk link count updates on a volume group (e.g. clone),
this methodology seems rather suspect.  Clearly, the fsync does not
eliminate the possibility of metadata inconsistencies, it only reduces
the number of vnodes that are likely to be inconsistent.  Once you
peer down through enough levels of the storage subsystem, your atomic
unit of data update becomes rather large (e.g. block size, raid5
stripe width, disk cache line size, etc.)  So what, exactly, is the
point of doing one vnode at a time?  Every time we update a single
link count we put a much larger quantum of data at risk (unless we're
on the small handful of filesystems which support data journalling,
and sufficiently complex checksumming schemes to deal with such
phenomena as writes to the wrong location).

Since we cannot eliminate the non-atomic nature of our metadata
updates, and we can't easily tack on a metadata journal, I think there
is an argument worth considering that doing batch updates during
volume operations actually reduces our risk of data corruption (by
reducing the total number of physical writes to media).

Regards,

-Tom