[OpenAFS] Puzzler: lack of access to AFS files

Thu, 13 Dec 2007 14:22:54 -0500

On Dec 12, 2007, at 11:06 PM, Jeffrey Altman wrote:

> ...Stupid things like re-using objects that were recently accessed  
> because
> the queues did not track objects in the order of most recent use.   
> Being
> forced to read data or directory entries from the file server that was
> just written by the client because data buffer version numbers weren't
> incremented when merging the updated status data received as a  
> result of
> the write or the failure to locally update the directory entries when
> possible.  Re-issuing FetchStatus calls on .readonly volumes  
> prematurely
> because the volume callback expirations were not tracked by each  
> object
> in the volume.  Some of the changes result in improved performance of
> the client when measured by throughput.   Other changes reduced the  
> CPU
> time required by the client but most of all, the improvements have
> reduced network traffic and load on the file servers.

Putting on my old software guy hat for a moment . . .

With help from Dan Hyde, I've made a few brief trips through the afs  
source code, though mostly on the server side. There are some  
wonderful things in there, but there are also several categories of  
dog. I'm getting incredibly itchy to dive into the vos copy/move/ 
shadow code and refactor it. When finishing up the shadow work we saw  
lots of opportunities for improvement, but we didn't make those  
changes because we want the code to be accepted. Doing an unsolicited  
mass rewrite is no way to get code accepted. So we're going slow and  
careful. Once we've convinced the Elders of our general competence,  
then we might refactor that code.

> Some of the changes have unfortunately triggered bug in the file  
> servers
> that in turn have to be fixed.

Sooo true, and not just in client-server relations. During the shadow  
work, Dan found a condition where an interrupted volume operation  
would cause the original to be deleted. But once one started using  
shadows, it was possible to start making a shadow, interrupt, and the  
cleanup would blow away the original rather than cleaning up the  
incomplete shadow. Oops. Yes, the bug is fixed in production AFS and  
in the 1.5 line. But it's been there latent for years.

Similarly, we're convinced some of the issues that we have been  
working lately are present because recent fixes to the locking code  
uncovered bugs that had been there since Transarc days. Sort of a  
microcosm of what FreeBSD went through in implementing the removal of  
giant() so multiprocessing really worked right. Once we figured out  
what the problem class was, Dan spent a great deal of time poring  
over other code sections that might have similar issues. He verified  
some code is clean, found and fixed some others. Did we get  
everything? Good question.

All of which is a roundabout way of saying that as active work on AFS  
keeps ramping up, we'll keep finding, fixing, and unfortunately  
revealing bugs. Some of these would be best done by major refactoring  
of the code. We will not attempt that regional refactoring until we  
have a solid enough understanding of the code as a whole and we've  
convinced the elders that we're competent to do it. Why? Because  
reading a four-line bug fix is easy; verifying it doesn't break  
anything is easy. Reading 2,000 lines of replacement code is damned  
hard. Writing it is hard. Verifying that you're introducing fewer new  
bugs than you're fixing is even harder. So start small.

At this point, the topic has drifted pretty far from the original.  
I'll write a separate note on other things relating to this.

Steve