[OpenAFS] robustness in face of server failures

Russ Allbery rra@stanford.edu
Wed, 16 Nov 2005 10:31:44 -0800


Noel Yap <noel.yap@gmail.com> writes:

> I'm investigating whether or not OpenAFS would be a good solution for
> our needs.  One requirement is that the chances of catastrophic
> failure (eg the network goes down) ought to be minimal (~once every
> few years or less).  What has been peoples' experiences with this?  I
> know 1.4 hasn't been out that long, but has anyone noticed any good or
> bad things about it?

I'd say that there are two potentially worrisome aspects to OpenAFS from a
hard uptime requirement perspective:

 * You want to be sure to be running the latest version, particularly on
   Windows clients.  Older releases of the Windows client had various bugs
   that could cause them to really hammer a file server.

 * For the most part, AFS fails independently, so that if a particular
   file server goes down, everything else on other file servers is still
   accessible.  However, if the AFS file server gets into a state where
   it thinks it's still up but it can't answer client requests, clients
   that try to access replicated volumes from that file server will hang
   practically forever waiting for it rather than rolling over to another
   replica site.  It would be very nice to have a fix for this.  In the
   meantime, you really want your file servers to refuse UDP packets when
   they're sick, which is something that you can rig up with some
   monitoring and a local firewall.

AFS is, in general, extremely stable apart from those two issues.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>