[OpenAFS] robustness in face of server failures

Wed, 16 Nov 2005 10:44:29 -0800

Noel Yap <noel.yap@gmail.com> writes:
> On 11/16/05, Russ Allbery <rra@stanford.edu> wrote:

>>  * For the most part, AFS fails independently, so that if a particular
>>    file server goes down, everything else on other file servers is still
>>    accessible.  However, if the AFS file server gets into a state where
>>    it thinks it's still up but it can't answer client requests, clients
>>    that try to access replicated volumes from that file server will hang
>>    practically forever waiting for it rather than rolling over to another
>>    replica site.  It would be very nice to have a fix for this.  In the
>>    meantime, you really want your file servers to refuse UDP packets when
>>    they're sick, which is something that you can rig up with some
>>    monitoring and a local firewall.

> What's been the typical causes of the server reaching this state? 
> Would you say that some of these have been addressed in 1.4?

Yes.  Most of the causes have been fixed via other means (such as clients
with asymmetric firewalls, older Windows clients, etc.).  Usually this is
caused by an extreme burst of activity that overloads the server.  It's
very difficult to do this with just normal traffic; it usually takes some
sort of bug on top of that to overwhelm the server.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>