[OpenAFS] Fail over to replica sites

Nathan Neulinger nneul@umr.edu
08 Aug 2002 20:54:51 -0500


Yes. It's not reproducible though. I have yet to be able to "do"
anything to the file/vol servers to trigger the symptom. 

Note - I have not seen it when the server really cleanly goes down. In
those cases, it fairly reliably switches. I have however seen the
problem numerous times when a file server starts to not respond for some
reason. However, it must be responding to some stuff, cause it doesn't
ever completely go down. If I kill -STOP the fileserver, the clients see
it instantaneously. (Quicker in my case with the RX_DEADTIME being
small.) Immediate response on most clients to the -CONT as well. 

One of my thoughts on this is that it might be nice if the cache manager
could round-robin requests for replicated volumes. I.e. on every
request, talk to a different file server. Probably lots of problems with
that idea though. 

In our cases though, it sometimes doesn't ever get to the 'connection
timed out' point... It just hangs forever. 

I wonder if there would be some way to say "on a replicated volume,
timeout any rx read call after X seconds and reissue call against
another server", where X is significantly smaller than for RW volumes.

-- Nathan

On Thu, 2002-08-08 at 18:39, Russ Allbery wrote:
> Are other people having trouble with OpenAFS's failover to replica sites
> when one server goes down?  We had one of our main replication servers
> (that also holds the read/write versions of many of the volumes) go down
> today, and rather than falling over to another server (even after a
> delay), we had quite a few systems that started just reporting "connection
> timed out" on any paths located in our AFS cell.
> 
> This seems less than ideal, and sort of punches a hole in the AFS
> reliability feature.  To have the client cache report "connection timed
> out" on a replicated volume when it hasn't tried all of the replicas
> strikes me as simply wrong....
> 
> -- 
> Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
-- 

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216