[OpenAFS-devel] Re: Breaking callbacks on unlink

Wed, 25 Jan 2012 17:40:36 -0800

Andrew Deason <adeason@sinenomine.net> writes:
> Russ Allbery <rra@stanford.edu> wrote:

>> For example, look at the idledead problems that have delayed 1.6.1 and
>> that have caused serious production outages for some sites, such as
>> mine.

> You mean the idledead code that existed since before 1.4.11, which afaik
> was running fine for you?

No, I think there's a longer history, and I think the situation is more
complex than that.

1.4.11 AFS file servers and 1.4.11 clients were working for us most of the
time against a cell that had twice as many file servers and half as much
storage.  And by most of the time, I mean that they only melted down every
three months or so, still frequently enough that we automated a way of
shooting the file server in the head because it would reach a state where
it would never recover and start responding to clients again.

The situation got drastically worse once we reduced the number of servers
by half in storage consolidation (at the same time as we upgraded the
servers to Debian 6.0 and, IIRC, 1.4.12, but I'm increasingly suspecting
that the version change was a red herring).  However, that's just been a
variation of a long-term pattern; we've been having ongoing issues like
this in one form or another, for one reason or another, for at least five
years.  You may (I know Derrick does) remember the discussion about how to
keep one client from tying up all the file server threads and implementing
per-host thread quotas.  That was yet another round of attempts to fix
many of the same basic symptoms.

It used to be that the basic symptom was that the file server melted down
and stopped responding to clients and had to be forcibly restarted.
Recently, this has shifted to the client melting down and all processes
attempting to access AFS going into disk wait for some period of time up
to a half-hour until the client recovers.  And occasionally the server
still melts down too.

For the past two years, those problems have been at a level as to cause
serious production issues with our AFS clients, which in turn has resulted
in us having to migrate data that we were previously able to serve out of
AFS onto local disk to have acceptable performance and uptime guarantees.

A patched version of 1.4.14 to address the other client deadlock issues
that were present in 1.4 gets us back to only having occasional meltdowns
that we are currently tolerating by putting all our web servers behind a
hardware load balancer so that each time AFS deadlocks on a web server, we
can drop it from the production pool and wait for the 10-15 minutes it
takes for it to recover.

Our upper management at this point believes AFS is an unstable and risky
technology that cannot provide acceptable reliability.  I have a lot
invested in this community, but there's a limit to how long I'm willing to
stake my personal credibility on staying with AFS.

Now, this is obviously not all the fault of idledead.  Part of the
problem, indeed, is exactly that there are so many interactions and so
many contributing factors that no one knows *what* the problem is, and
each time we go around on this problem, we end up with an inconclusive
result and a new acceptable normal with workarounds that's a little worse
than the last one.  idledead is just the latest in a round of problems
with file server thread allocation, locking, deadlock situations between
the client and server, deadlock situations in the client alone, and so
forth that we've been struggling with for years.  But idledead appears at
the moment to be a contributory factor.

> With this particular issue, again, there are two irreconcilable desired
> behaviors:

>  - when accessing a legacy/misbehaving fileserver, yield an error after
>    N seconds of no progress
>  
>  - when accessing a legacy/misbehaving fileserver, hang forever in the
>    face of no progress

> I believe/assume what is being considered "right" is the latter option.
> But to tell me that it is the universal "right" option is arrogant and
> inconsiderate of the differences in different site circumstances.

You know, I could dive farther into this, but actually, it's missing the
core of why I have the reaction I do.  Let me take this up a few levels
instead.

The problem I have is that AFS is not working, in ways that are causing
some pretty serious problems.  We've been going around and around on a set
of performance problems that all look very similar to the user for way,
way too long at this point.

The software needs to work first.  If it's working, I have a pretty high
tolerance for options and configurable behavior (hell, look at pam-krb5).
When it doesn't work, that tolerance drops considerably, particularly when
attempting to diagnose why it doesn't work involves multiple iterations
through various options that are annoyingly unrelated to the problem of
the file system not working.  And, to take that a step further, I'm
getting pretty sick and tired of having to assemble a fragile and
ever-changing set of random flags, options, and tuning parameters to
achieve the basic goal of having working software.

I need to be able to install AFS and have it work.  If I can't do that, I
will install something other than AFS.  It really is that simple.

This is why I have a bit of a kneejerk reaction to a proposal to add yet
another optional feature and additional complexity in the protocol
interactions and in the core fileserver and client interaction.

I have not spent anywhere near as much time inside the code as you have,
and I *do* agree that my theory that idledead is a significant
contributing factor is simply a theory, and I could be wrong.  Maybe the
current round of performance problems are from some other cause.  But
there are some instincts that I've learned to trust around software, and
one of them is that the simpler code is, the more robust code is.  Another
is that options are hard to test, and a third is that complexity kills
reliability.  When I'm concerned about reliability and I hear proposals to
increase complexity, my reaction is pretty hostile because I'm already
frustrated.

OpenAFS, as presently constructed, through only some fault of the OpenAFS
development community (most of the problem was pre-existing, and much of
it dated from the pthreads integration inside Transarc/IBM), fails most of
the markers that I would look for in evaluating code for robustness.  I
believe it's getting better, but it's a *long* way from good.  I realize
that you're hearing from other people who are advocating for flexibility
and configurability.  Think of me as a passionate advocate for robustness,
because that's *my* squeaky wheel right now.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>