[OpenAFS] Fail over to replica sites

Jeffrey Hutzelman jhutz@cmu.edu
Fri, 9 Aug 2002 18:52:48 -0400 (EDT)


On 8 Aug 2002, Nathan Neulinger wrote:

> Note - I have not seen it when the server really cleanly goes down. In
> those cases, it fairly reliably switches. I have however seen the
> problem numerous times when a file server starts to not respond for some
> reason. However, it must be responding to some stuff, cause it doesn't
> ever completely go down. If I kill -STOP the fileserver, the clients see
> it instantaneously. (Quicker in my case with the RX_DEADTIME being
> small.) Immediate response on most clients to the -CONT as well.

This doesn't sound like Russ's problem.  This sounds like the fileserver
is overloaded, so it accepts RPC's but takes a long time (maybe ~forever)
to respond to them.  We've seen this situation when a fileserver has a
disk that is repeatedly dropping offline, causing any access to that disk
to take a very long time.