[OpenAFS] Openafs failover

Thu, 11 Jun 2009 21:39:59 -0400

On Mon, Jun 8, 2009 at 7:23 PM, Harald Barth<haba@kth.se> wrote:
>
>> In the case of server1 would went down server2:
>> 1. would mount vicepa ...
>> 2. would take over address 10.0.0.1
>> 3. finally would restart the vlserver volserver and fs processes.
>
> You have missed what to do with the outstanding callbacks that server1
> is holding (in memory). When server1 does shut down nicely, these are
> handled (clients notified) during the shutdown. If server1 crashes,

Hi Harald,

I know a lot of us have said this over the years (I'm pretty sure I'm
guilty as well), but it's not entirely accurate.  Yes, coherence is
maintained across crashes/restarts by sending one of the
InitCallBackState family of RPCs to the client.  However, the key
point is it happens _after_ the new fileserver process starts up, and
when the cm next makes contact.  When we walk the host hash table, we
fail to find a host entry, and thus perform initialization of a new
host object, host cps data, etc.  This process forces the client to
invalidate its status entries, and thus results in a new round of
FetchStatus RPCs.  The net result is 2-node active/passive failover
clusters can be equivalent to standalone fileservers in terms of cache
coherence (assuming proper Net{Info,Restrict} and rxbind
configuration).

> these are lost, so clients could in this case continue to use an
> outdated copy in cache. If I remember correctly, there has been work
> for the 1.5.x server series to write down callback information
> (continously) to the /vicepX. That could then be used by a starting

Storing continuously is an excellent end-goal.  Unfortunately, we're
not there yet.  What we have at present (with dafs) is a mechanism to
serialize an atomic snapshot to disk.  Unfortunately, the current
implementation does not lend itself to continuous dumping.  In order
to achieve atomicity we quiesce all Rx worker threads and hold H_LOCK
across the entire operation.  Furthermore, the fsstate.dat on-disk
format is optimized for serialization, not random access.

Continuous dumping is complicated from a number of perspectives.
First of all, we'd likely want tunable consistency modes.  Secondly,
there's the question of whether extended callback data should be
serialized or not (at present, dafs+osi+xcb does not dump xcb data; it
would not be particularly hard to add support in future).  Lastly,
there is the pertinent question of where to store the data.  If/when
partition uuid extensions become supported, the issue becomes
significantly more complicated because we will likely want the host
package data to be replicated across every partition in order to
support partition-level load balancing (which is further complicated
by the existence of unmounted, unsynchronized, out-of-date clones).

-Tom

--
Tom Keiser
tkeiser@sinenomine.net