[OpenAFS] Re: [OpenAFS-devel] 1.6 and post-1.6 OpenAFS branch management and schedule

Fri, 18 Jun 2010 16:40:49 -0400

Rainer, I don't mean to pick on you, but I see this probabilistic
argument all too frequently; I think it requires response.  I'll
readily concede that over the short-term while we are triaging bugs
that a probabilistic argument, such as the one you make below, is
perfectly reasonable.  After all, on short time scales we can't solve
all of the world's problems--our primary goal is to mitigate risk as
best we can.

However, I think the purpose of this thread is strategic: to determine
what should be in the code several releases from now.  In my opinion,
over these longer time scales, probabilistic arguments with regards to
risk are not acceptable, as we have sufficient time to solve these
problems in a manner that is both deterministic and provably correct.

The fact that you haven't encountered any serious problems (regardless
of the time span or number of machines) while running fast-restart or
bitmap-later in your environment does absolutely nothing to disprove
the existence of truly insidious failure modes.  Furthermore, without
any good means of comparing state space coverage, we can't even begin
to infer a probability of failure at other sites from that experience.

Further comments inline...

On Fri, Jun 18, 2010 at 5:47 AM, Rainer Toebbicke <rtb@pclella.cern.ch> wro=
te:
> Jeffrey Hutzelman schrieb:
>
>>
>> Really, I consider enable-fast-restart to be extremely dangerous.
>> It should have gone away long ago.
>>
>> I realize some people believe that speed is more important than not losi=
ng
>> data, but I don't agree, and I don't think it's an appropriate position =
for
>> a filesystem to take. =A0Not losing your data is pretty much the definin=
g
>> difference between filesystems you can lose and filesystems from which y=
ou
>> should run away screaming as fast as you can. =A0I do not want people to=
 run
>> away screaming from OpenAFS, at any speed.
>>
>
> I beg to disagree: the Volume/Vnode back-end has by no means the same
> problems that a file system might have. Damages there will never wildly
> destroy random items on disk, as you would have to be afraid using in a f=
ile
> system. At least in namei, damages in a volume are entirely contained

All this implies is each volume group is, in effect, its own little
failure domain.  Each of those failure domains is individually capable
of being inconsistent following the crash.  Moreover, each is
individually capable of further corrupting itself, should it come
online without an internal consistency check.  I suppose one
conclusion you could draw from this is that, due to failure isolation,
infrequently modified volume groups are less likely to become further
corrupted than frequently modified volume groups...

> therein, files themselves are at the worst entirely replaced by others,
> they're never corrupted partly other than being half-written or such. Of

I beg to differ on this point: the fact that multiple vnodes may end
up pointing to the same namei backing store in no way implies whole
file-replacement.  Partial corruption is an absolutely plausible
failure mode.  You can easily end up in this situation:

* a couple of chunks get flushed over top of what used to be (and
still is) another vnode's backing store
* we create/delete a directory entry over top of what used to be
another vnode's backing store

> course files on disk can become unfindable or directories can have bogus
> entries.
>

In general, you're making a probabilistic, and thus trivially
disprovable, argument.  While I'll readily concede that this sort of
probabilistic argument is the foundation of risk analysis for most
sites, I do not agree that OpenAFS, as a code vendor, should be in the
business of supplying code whose correctness guarantees are wholly
non-deterministic and probabilistic in nature.  When you re-attach a
volume following a crash without performing internal consistency
checks, you're introducing non-determinism--because we don't have any
form of journal--into the distributed system.

>From the point of view of the administrator, I'll grant that a volume
may just be a collection of vnodes.  However, from the user's point of
view, there is a complex semantic relationship between those files
(and possibly to other wholly-unrelated entities within a distributed
system).  Breaking that structure in any way introduces
non-determinism whereby rolling back to a sync point becomes Hard once
you allow production operations to proceed from that point of
inconsistency.

Even if you're disciplined enough to schedule salvages for down
periods, within a reasonable time-frame following the crash (thus
attempting to mitigate Jeff's argument with regard to metadata
corruption going unnoticed for long enough that backups expire),
you've still introduced non-determinism into the distributed system,
thus permitting corrupting ripple-effects.  Reverting once trouble is
uncovered is an extremely painful process.  Reconstructing exactly
what should be restored in order to make the distributed system
consistent again, as I discussed above, requires deep understanding of
the applications involved and the semantic relationships between them.
  Let's face it--most people just punt on this problem.

While completely eliminating the non-determinism introduced by the
crash is out of scope for this discussion, we can strive to minimize
it by checking internal consistency before attempting to service any
RPCs...

[snip]
>
> For us, the delta does not justify keeping the service down for several
> hours after a crash. Make that delta proportionally bigger by fixing the
> other issues and I revise my statement.
>

Ok.  That's a perfectly fair rationale.  What I still don't understand
is why people think _OpenAFS_ should strive to ship (and thus
implicitly endorse) code that introduces such non-determinism
(especially given that, as Andrew pointed out, under DAFS enabling
fast-restart semantics will quite literally involve a 1-line
out-of-tree, unsupported change)...

Regards,

-Tom