[OpenAFS-devel] Re: Breaking callbacks on unlink

Russ Allbery rra@stanford.edu
Thu, 26 Jan 2012 10:51:14 -0800


Andrew Deason <adeason@sinenomine.net> writes:

> If you want my opinion on what the _reason_ is, it's just that your high
> rate of pag generation and high rate of writes is more than the
> fileserver can handle, which is why I only ever see this stuff come from
> you (at least, to this degree).

But, of course, it's not only me.  There are at least three sites that I
know of that are seriously impacted by these sorts of reliability issues
under load, and it's worth remembering that we only hear from a small
percentage of sites.

> or, something in that area. I've mentioned a few times a few different
> changes that I think can alleviate some of this, but... I never heard
> anything back about them, so it didn't seem like you were interested or
> that it wasn't that important.

That's an unfortunate interpretation of delays in implementing
configuration changes, and I'm sorry you got that impression.  A better
conclusion to draw is that it takes quite a bit of time to implement file
server configuration changes in a large environment with a zero scheduled
downtime requirement.

We're currently still in the process of implementing the last round of
suggestions.  Each time you give us something new to try, it takes us
several weeks to implement it, and we can't tell whether the problem has
improved until after we do that and then observe behavior for several more
weeks.  That's one of the problems with trying to resolve these sorts of
site-wide reliability issues.

Part of the problem here is that I'm not really supposed to be spending my
time on trying to shephard these problems through to resolution, because
I'm not supposed to be primary on our production AFS cell.  Stanford wants
me to be doing other things and delegating that to other people, but they
may not know what they need to be communicating to ensure that you
understand what the situation is on our end.

> Yes, I can understand that, and I can understand why that is what you're
> advocating. But when I see you talk on this list, I see you as a
> gatekeeper, and so when I see objection to runtime options, I see that
> as something that will become "openafs.org policy" or something if I
> don't object.

And indeed you're not wrong about the crossover between these opinions and
the obligation I feel as a gatekeeper to advocate for design principles
that I believe will make OpenAFS better.  I believe that reliability and
robustness are more important than configurable flexibility, and that
ensuring reliability is the top (but not exclusive) priority for OpenAFS.
I think this is a key property of a file system; nothing else you do
matters if it's not reliable.

As long as I'm a gatekeeper, I feel like part of that job is to advocate
for a direction and a set of guiding principles that continues the general
improvement in the overall robustness of the OpenAFS code and prioritizes
that appropriately against other types of changes.  (idledead is, of
course, particularly challenging since it's a real problem in all
directions, and was originally added to address a *different* robustness
problem.)

However, because AFS is seen as less and less strategic at Stanford, in
part because of ongoing reliability issues but more because the usage
patterns of file systems have changed and OpenAFS is not currently keeping
up, the amount of time that I have available to be a gatekeeper has
diminished considerably.  If my remaining contribution in terms of trying
to advocate for the sort of project I think OpenAFS should be is out of
step with the community, then I can resign.

> If you want to try to say that we shouldn't add anything more until
> these problems are solved, then... well, I don't think you're trying to
> advocate that everyone should stop what they're doing just to help you
> :) but that at least puts a kind of limit on things.

There is, of course, an inherent conflict of interest in any gatekeeper
position in that one is not going to care enough about AFS to become a
gatekeeper unless one is actually using it, and the problems that one is
personally running into are going to, shall we say, come readily to mind.
Part of my job as gatekeeper (and, more to the point, elder) is, somewhat
inherently, to advocate for Stanford's issues and concerns and hope that
those issues and concerns are at least somewhat representative of a class
of users of OpenAFS.

I don't believe that the situation I'm describing is unique to Stanford,
and I have done a reality check with others and think they would have told
me if it was.  But, again, if I'm out of step with the priorities of the
community, I can certainly find something else to do with the small amount
of time I'm currently able to spend on OpenAFS.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>