[OpenAFS] Fail over to replica sites

Thu, 08 Aug 2002 19:01:29 -0700

Nathan Neulinger <nneul@umr.edu> writes:

> Yes. It's not reproducible though. I have yet to be able to "do"
> anything to the file/vol servers to trigger the symptom.

> Note - I have not seen it when the server really cleanly goes down. In
> those cases, it fairly reliably switches. I have however seen the
> problem numerous times when a file server starts to not respond for some
> reason. However, it must be responding to some stuff, cause it doesn't
> ever completely go down. If I kill -STOP the fileserver, the clients see
> it instantaneously. (Quicker in my case with the RX_DEADTIME being
> small.) Immediate response on most clients to the -CONT as well.

In this case, the server just went away completely without any warning.
(Basically, the machine was powered off by accident.)  Many of our clients
didn't recover and see the replicated volumes located on that server until
the server came back up (and they were pointing to the read-only path and
should have been able to find one of the other two replicas).

> In our cases though, it sometimes doesn't ever get to the 'connection
> timed out' point... It just hangs forever.

I've not seen that myself.  This was more what I'd expect when a
read/write server was down.  When you tried to access something that was
replicated on that server, the system would respond "connection timed out"
immediately.  There was no delay; it was obvious that it had cached that
the system was down and wasn't retrying network access.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>