[OpenAFS-devel] warnings fix

Jeffrey Altman jaltman@secure-endpoints.com
Fri, 17 Jul 2009 09:10:38 -0400


Felix Frank wrote:
> Granted. Speaking only for myself, I would have liked to see a model
> that left the stable branch as just that, but allowed for additional
> branches off stable (such as, say, rxk5, rxosd etc.) that would have
> allowed
> a) interested sites to perform testing if they choose to and
> b) the community to observe and guide the evolution of such development
> endeavors

One of the primary reasons we started the work to switch to "git" in Feb
2008 was to permit sites to do so.  With git you can publish your own
tree and make it available to others to review.  We simply could not do
this with cvs.  The security model simply did not permit it.  We had to
know we could trust the developers first because we could give commit
access.

That being said, there has been an rxk5 development branch in the
cvs repository for many years.  There was also an rxtcp development
branch.  A disconnected development branch.  Etc.

However, they cannot be branches off of stable.  We have said this many
times.  "stable" is for bug fixes and new platform support only.  No new
development work is going to end up on the "stable" branch.  All that
putting this work on the "maintenance" branch does is provide further
incentive for sites to avoid using the "development" or "features" branch.

The point that the gatekeepers have made over and over again is that
the maintenance branch is expected by sites that use it to be stable.
It should not change from release to release except for fixing known
bugs or supporting new kernel versions.  The vast majority of sites
do not want to be handed new "feature" code when they install a security
update.

> That being said, there has never been any indication on your part that
> such an approach might be viable, so I guess it's the contributors'
> fault to let things drag along for too long, after all.

With all due respect I believe that you have an incomplete view of
the situation.

> What I would like to reinforce, however, is that the large gap between
> stable and devel and devel's "testing only for all platforms except
> Windows" status has raised a certain barrier between bigger-scale
> contrib-projects and upstream development. I don't believe the blame for
> that can be put solely on the contributing parties.

The large gap between the "maintenance" and "feature" releases of comes
from the fact that (except on Windows which doesn't have a choice) there
simply are no organizations that are willing to run their storage
systems on the "features" release.  Without that there is not sufficient
testing in real world conditions.

When a new pile of code gets delivered to OpenAFS that was developed
in-house the gatekeepers were forced to make a hard choice.  Do we throw
the code into a new branch where it will rot or put it onto the
"features" branch and force sites that are tracking that branch to use
it.  (Assuming of course there were any.)  Because we do not want code
to rot, the code often ends up on the "features" branch.  Pthreaded ubik
is on the features branch even though it doesn't work correctly and by
"work correctly" I mean it will destroy your data.  Its there so it
doesn't bit-rot and we feel is it ok to remain only because we do not
compile anything that uses it.

Another example is the Demand Attach File Server.  This "feature" was
pulled into the openafs-devel-1_5_x.  Its changes were major.  For more
than a year it was impossible to use the "features" branch for anything
other than the clients because the DAFS modifications broke the ability
to create new volumes among other things.  Any site that tried to use it
discovered the problem and quickly stopped.

The development organization that developed the code wasn't using the
openafs-devel-1_5_x branch themselves and so it wasn't a high priority
for them to fix the issue.  Their clients used modified versions of the
maintenance branch including their new feature of choice.

After more than a year most of the issues that were introduced are
fixed, some by the gatekeepers and some by the original contributors,
but others still remain.  In fact, it is the doubts regarding the
stability of these changes in all configurations that is blocking the
issuance of the next "maintenance" release series.

Yesterday Tom Keiser complained that total number of commits is not
indicative of actual coding.  What it does not indicate is much code was
contributed by the entity.  The reason that the gatekeeper commit
numbers are so much higher than anyone else's is because when push comes
to shove and there is a bug that affects an end user, it is the
gatekeepers that are the ones that have to go and fix it.

As a gatekeeper, our responsibility is to the entire community of
OpenAFS users not just those that work for our organization or those of
our paid support clients.  If there is a bug in the code no matter how
minor, we will fix it.  If the bug report came with a patch to review
all the better but in 95% of the cases, all we have is a report of a
problem that must be researched and fixed.

If someone pays you to develop some code, once that code is accepted
into the OpenAFS tree and distributed to end users it becomes the
responsibility of the gatekeepers because when your contract ends or you
decide to move on, we still have thousands of deployed cells out there
with millions of end users.   When the code breaks, OpenAFS gets a bad
reputation which is really hard to overcome.

The blame doesn't sit on the contributing parties.  It doesn't sit on
the gatekeepers.  It doesn't sit on the end users.  As long as the only
resources that are available to perform testing and fix bugs on behalf
of the general community in a proactive manner are severely constrained
we are going to have these issues.

The following analogy is quite applicable.  In the United States, it is
frequently the case that buildings, bridges, roads, etc. are built using
public funds.  In many cases, the funding allocation provides enough
money to cover the capital expense of the new construction.  However,
they very rarely provide any on-going maintenance allocations.  If you
build me a $40 million dollar building that I want but fail to provide
me any resources to maintain it but assign me the responsibility to do
so I am in a bind.  So goes the story of how a major campus received a
new field house only to find that in order to afford the resulting
upkeep after it was built they had to stop funding the sports teams that
were intended to use it.

I would love it if as part of accepting a contribution to OpenAFS along
with it came a promise that any bugs that were determined to be caused
by that code would be fixed by the contributing party without charge and
in a timely manner for a period of two years.  However, that is
unrealistic.

The real situation is that large contributions of code result in
increased burdens on the gatekeepers with no additional resources being
provided to cover the additional overhead expenses.

Not to single you out because this applies to so many of the developers
that contribute to OpenAFS but you are working on your project because
you are being paid to do so.  For some number of hours per week or
period of months, this is your job that you are paid to do.  Your
employer is not providing any resources to support the overhead of
managing the OpenAFS project, performing the code/architecture/protocol
reviews, performing the on-going testing, code integration, release
management and community support.

There is no blame to go around when there are simply no resources
available to do anything else.  I appreciate the need for individuals to
vent but when individual developers (in general) put the blame on the
existing leadership it is really insulting.

All of the existing gatekeepers would be happy to resign provided that
someone was actually willing to take on all of the responsibilities that
we bear.  What we are not willing to do is to take unnecessary risks
with the hard earned reputation that OpenAFS has developed over the last
five years as being a stable enterprise ready product that is worth
trusting with your data for the next decade or more.

> No hard feelings

None taken.

Jeffrey Altman