[OpenAFS] vos convertROtoRW requires salvage ?

Hartmut Reuter reuter@rzg.mpg.de
Thu, 03 Apr 2008 09:17:59 +0200


John Tang Boyland wrote:
> As people on the list may know, I am in the process of recovering from
> complete fileserver failure (lesson: don't use inode servers with Solaris
> 10 x86).  In what follows, "filip" is an inode Solaris 10 x86
> fileserver that cannot attach any of its volumes.  "eastside" is a namei
> Solaris 10 fileserver (a PC dragged in to fill the gap when the main
> fileservers failed) and "solomons" is an ancient Solaris 8
> (sparc) fileserver brought out of retirement for the same reason.
> "filip" was the second fileserver to fail -- after the first failed, I
> brought up eastside and released several important volumes (such as
> root.cell) to eastside, and thus when filip failed a week later, at
> least a RO copy was available.  (I wasn't actually expecting filip to
> fail since it had been working fine for two years without incident.)
> 
> I have been using the new helpful command "vos convertROtoRW" to 
> convert volumes.  (BTW: thanks for the man page on openafs.org --
> maybe I should point out that the "PRIVILEGE REQUIRED" section looks
> like it was copied from "vos move").
> 
> The problem is that the conversion takes the volume offline requiring
> a salvage.  I have been nervous about salvaging (see other messages)
> but fortunately salvage works uneventfully.  I've used vos convertROtoRW
> earlier on a less important volume.  In the end everything's OK,
> but still I'd like to ask: is this salvage requirement a known feature?  


No, normally the volume does not go off-line. The only problem I know of
is that the new RW-volume being on another server than the old one is
not automatically seen by the clients. The reason is that the old dead
fileserver one didn't send a callback to the clients (as would have done
a vos move). So it's necessary to run "fs checkvol" on all clients.

Off-line volumes you may bring on-line also by doing a
"vos dump volume 0 > /dev/null" or be restarting the server.

I (being also the author of convertROtoRW) have used it several times
already when we lost RAID-partitions because of hardware problems. We
have a strict policy that any volume gets two ROs: one on another server
(to allow convertROtoRW) and one in the same partition (to speed up "vos
release"). So we were able to "restore" several TB within half an hour.

Hartmut

> 
> (In this transcript: /usr/afsws/bin is in AFS but /usr/afs/bin is on the
> local disk -- thankfully! -- and the former is on my path.  And yes,
> I'm running with admin tokens in a user account ON the new fileserver -- I
> said this was a temporary stopgap arrangement.) 
> 
> eastside.cs 71 % vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 4
>        server filip.cs.uwm.edu partition /vicepa RW Site 
>        server filip.cs.uwm.edu partition /vicepa RO Site 
>        server eastside.cs.uwm.edu partition /vicepa RO Site 
>        server solomons.cs.uwm.edu partition /vicepa RO Site  -- Not released
> eastside.cs 72 % vos convertROtoRW eastside a root.cell
> VLDB indicates that a RW volume exists already on filip.cs.uwm.edu in partition /vicepa.
> Overwrite this VLDB entry? [y|n] (n)
> y
> eastside.cs 73 % vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 3
>        server solomons.cs.uwm.edu partition /vicepa RO Site  -- Not released
>        server filip.cs.uwm.edu partition /vicepa RO Site 
>        server eastside.cs.uwm.edu partition /vicepa RW Site 
> eastside.cs 74 % vos remsite filip a root.cell
> /usr/afsws/etc/vos: No such device
> eastside.cs 75 % vos listvldb root.cell
> /usr/afsws/etc/vos: Connection timed out
> eastside.cs 76 % /usr/afs/bin/vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 3
>        server solomons.cs.uwm.edu partition /vicepa RO Site  -- Not released
>        server filip.cs.uwm.edu partition /vicepa RO Site 
>        server eastside.cs.uwm.edu partition /vicepa RW Site 
> eastside.cs 77 % /usr/afs/bin/vos remsite filip a root.cell
> Deleting the replication site for volume 536870915 ...Removed replication site filip /vicepa for volume root.cell
> eastside.cs 78 % /usr/afs/bin/vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 2
>        server solomons.cs.uwm.edu partition /vicepa RO Site  -- Not released
>        server eastside.cs.uwm.edu partition /vicepa RW Site 
> eastside.cs 79 % vos release root.cell
> /usr/afsws/etc/vos: Connection timed out
> eastside.cs 80 % /usr/afs/bin/vos release root.cell
> Failed to start transaction on volume 536870915
> Volume needs to be salvaged
> Error in vos release command.
> Volume needs to be salvaged
> eastside.cs 81 % /usr/afs/bin/vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 2
>        server solomons.cs.uwm.edu partition /vicepa RO Site  -- Not released
>        server eastside.cs.uwm.edu partition /vicepa RW Site 
> eastside.cs 82 % /usr/afs/bin/bos salvage eastside a root.cell
> Starting salvage.
> bos: salvage completed
> eastside.cs 83 % vos listvldb root.cell
> /usr/afsws/etc/vos: Connection timed out
> eastside.cs 84 % /usr/afs/bin/vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 2
>        server solomons.cs.uwm.edu partition /vicepa RO Site  -- Not released
>        server eastside.cs.uwm.edu partition /vicepa RW Site 
> eastside.cs 85 % /usr/afs/bin/vos addsite eastside a root.cell
> Added replication site eastside /vicepa for volume root.cell
> eastside.cs 86 % /usr/afs/bin/vos release root.cell
> Released volume root.cell successfully
> eastside.cs 87 % fs checkv  
> usage: /usr/openwin/bin/xfs [-config config_file] [-port tcp_port]
> eastside.cs 88 % /usr/afsws/bin/fs checkv
> /usr/afsws/bin/fs: Connection timed out
> eastside.cs 89 % /usr/afs/bin/fs checkv
> All volumeID/name mappings checked.
> eastside.cs 90 % /usr/afsws/bin/fs checks
> All servers are running.
> eastside.cs 91 % vos listvldb root.cell
> 
> root.cell 
>     RWrite: 536870915     ROnly: 536870916 
>     number of sites -> 3
>        server solomons.cs.uwm.edu partition /vicepa RO Site 
>        server eastside.cs.uwm.edu partition /vicepa RW Site 
>        server eastside.cs.uwm.edu partition /vicepa RO Site 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info


-- 
-----------------------------------------------------------------
Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------