[OpenAFS] Resilience

Hartmut Reuter reuter@rzg.mpg.de
Tue, 02 Jun 2009 12:24:12 +0200

Wheeler, JF (Jonathan) wrote:
> One of our (3) AFS servers has a mounted read-write volume which must be
> available 24x7 to our batch system.  The server is as resilient is we
> can make it, but still it may fail outside normal working hours for some
> reason.  For technical reasons related to the software installed on the
> volume it is not possible to use read-only volumes mounted from our
> other servers (the software must be installed and served from the same
> directory name), so I have devised the following plan in the event of a
> failure: 
> a) create read-only volumes on the other 2 servers, but do not mount
> them; use "vos release" whenever the software is updated
> b) in the event of a failure of server1 (which has the rw volume), drop
> the existing mount and mount one of the read-only volumes (we can live
> with the read-only copy whilst server1 is being repaired/replaced) in
> its place.
> Can anyone see problems with that scenario ?  We could use "vos
> convertROtoRW"; how would that affect the process ?

The problem with convertROtoRW is that a dying fileserver doesn't send
callbacks to the client as would happen when you move the RW-volume to
another place. So you will have to do a "fs checkvol" on all clients to
make sure they don't wait forever for the broken server, but use instead
the newly created RW-volume. Our backup strategy is completely based on
the possibility to do  convertROtoRW. CRON jobs on the batch worker do
the "fs checkvol" once in a while...

> Jonathan Wheeler 
> e-Science Centre 
> Rutherford Appleton Laboratory

Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)