[OpenAFS] Puzzler: lack of access to AFS files
Thu, 13 Dec 2007 14:22:54 -0500
On Dec 12, 2007, at 11:06 PM, Jeffrey Altman wrote:
> ...Stupid things like re-using objects that were recently accessed
> the queues did not track objects in the order of most recent use.
> forced to read data or directory entries from the file server that was
> just written by the client because data buffer version numbers weren't
> incremented when merging the updated status data received as a
> result of
> the write or the failure to locally update the directory entries when
> possible. Re-issuing FetchStatus calls on .readonly volumes
> because the volume callback expirations were not tracked by each
> in the volume. Some of the changes result in improved performance of
> the client when measured by throughput. Other changes reduced the
> time required by the client but most of all, the improvements have
> reduced network traffic and load on the file servers.
Putting on my old software guy hat for a moment . . .
With help from Dan Hyde, I've made a few brief trips through the afs
source code, though mostly on the server side. There are some
wonderful things in there, but there are also several categories of
dog. I'm getting incredibly itchy to dive into the vos copy/move/
shadow code and refactor it. When finishing up the shadow work we saw
lots of opportunities for improvement, but we didn't make those
changes because we want the code to be accepted. Doing an unsolicited
mass rewrite is no way to get code accepted. So we're going slow and
careful. Once we've convinced the Elders of our general competence,
then we might refactor that code.
> Some of the changes have unfortunately triggered bug in the file
> that in turn have to be fixed.
Sooo true, and not just in client-server relations. During the shadow
work, Dan found a condition where an interrupted volume operation
would cause the original to be deleted. But once one started using
shadows, it was possible to start making a shadow, interrupt, and the
cleanup would blow away the original rather than cleaning up the
incomplete shadow. Oops. Yes, the bug is fixed in production AFS and
in the 1.5 line. But it's been there latent for years.
Similarly, we're convinced some of the issues that we have been
working lately are present because recent fixes to the locking code
uncovered bugs that had been there since Transarc days. Sort of a
microcosm of what FreeBSD went through in implementing the removal of
giant() so multiprocessing really worked right. Once we figured out
what the problem class was, Dan spent a great deal of time poring
over other code sections that might have similar issues. He verified
some code is clean, found and fixed some others. Did we get
everything? Good question.
All of which is a roundabout way of saying that as active work on AFS
keeps ramping up, we'll keep finding, fixing, and unfortunately
revealing bugs. Some of these would be best done by major refactoring
of the code. We will not attempt that regional refactoring until we
have a solid enough understanding of the code as a whole and we've
convinced the elders that we're competent to do it. Why? Because
reading a four-line bug fix is easy; verifying it doesn't break
anything is easy. Reading 2,000 lines of replacement code is damned
hard. Writing it is hard. Verifying that you're introducing fewer new
bugs than you're fixing is even harder. So start small.
At this point, the topic has drifted pretty far from the original.
I'll write a separate note on other things relating to this.