[OpenAFS-devel] to fsync() or not to fsync()

Thu, 14 Sep 2006 13:35:37 -0400

On 9/14/06, Robert Banz <banz@umbc.edu> wrote:
>
> Unless you're somehow just "making the bits go faster", performance
> increases typically go hand in hand with some sort of risk that your
> transactions *might* not make it to disk in a "power off" situation*
>

Well, a properly designed system incorporating hierarchical
checksumming, metadata journalling, and the ability to handle hundreds
of in-flight transactions without needing hundreds of threads has the
potential to do both.  ZFS is a good example of this technique.

Also, not to be too pedantic, but no matter how well-designed your
system is, there will always be transactions which don't make it to
disk.  The key is making sure that no part of the distributed system
is fooled into thinking a transaction completed, unless it's possible
for the downed node to recover the transaction and commit it later.

> * disk gets unplugged, machine panics, blahblah
>
> ...which is a "risk" almost any filesystem or application takes into
> consideration, and allows the filesystem user to determine when it's
> "really necessary" to wait to go forward until data is committed to
> firm storage, or not.  Good or bad, the fileserver is assuming that's
> what you want to do all of the time in the CopyOnWrite and
> StoreData_RXStyle (not to mention the volume structure management
> code in namei_ops, etc.).  I guess it's that since we don't have a
> "channel" to forward along real fsync() messages that we assume that
> it's what you want to do all the time, or at the time the code was
> written it was assumed horrible things were going to happen all of
> the time... cleaning lady unplugs the direct attached SCSI disk,
> cosmic ray causes a kernel panic, fsck can't reconstruct the
> filesystem to save it's life...  so making sure every transaction was
> written to disk was probably a good idea.  Nowadays with the cleaning
> lady banned from the datacenter unless escorted, multipathing fibre
> links to disk storage, filesystems that go beyond even metadata
> logging to preserve structure (like zfs), the cost/benefit of

Speaking of ZFS, fsync on ZFS is a serious performance issue.  With
ZFS, an fsync results in an update to the entire fs checksum tree up
to, and including, the root checksum.  Needless to say, this is a very
high-latency operation with potential for quite a few disk seeks.
Generally, ZFS tries to do hierarchical checksum updates on the order
of every ten seconds.  Clone operations are just plain ugly on ZFS.

-Tom