[OpenAFS] Re: "hard mounts"

Lyle Seaman lws@spinnakernet.com
Sat, 17 Mar 2001 11:34:47 -0500


Nathan Neulinger wrote:

> I presume that the hard mount support means that the timeouts are
> disabled, and the machine just hangs until the volume is available
> again?

The rpc times out, fails over to the next replica, times out, fails over,
times out, reaches the end of the list of replicas, and then starts over from
the beginning of the list.

Theoretically, other processes should be unaffected, but I don't remember if
we actually achieved that  goal 100% or not.  There might be opportunity for
some improvement there.  If there is some lock which is held for the entire
duration, then perhaps someone can find a way to restructure things so that
the lock can be dropped and re-obtained.   It might be a little tricky.

For all intensive porpoises (eg, if the affected process is the X server,
trying to page in)  the machine hangs until the volume is available again, or
the "retry_count" is exhausted.

One approach that might be interesting would be to reduce the RX timeout to
about 20 seconds, and set the retry_count to 6 or so -- which would hasten
failover due to loss of a single replica while also increasing your resilience
to the loss of all your vldb servers...

Oh.  I remember the other thing:   Absent this change, it doesn't  matter how
many RO replicas you have, you will always be vulnerable to the loss of all
your vldb servers.