[OpenAFS] Re: help, salvaged volume won't come back online, is it corrupt? [trimmed log]

Steve Simmons scs@umich.edu
Wed, 13 Sep 2006 12:15:20 -0400

On Sep 13, 2006, at 9:59 AM, Hartmut Reuter wrote:

> Juha J=E4ykk=E4 wrote:
>>> Better you do a "vos convertROtoRW" on the RO-site as soon as =20
>>> possible to regain a valid RW-volume in this case.
>> Except that I'm unlikely to notice the corruption before it's =20
>> released,
>> which happens automatically. Sounds like we need to change our backup
>> policy...
> The best way to prevent the salvager from corrupting volumes is not =20=

> to run it automatically. If you configure your OpenAFS with with "--=20=

> enable-fast-restart" then the fileserver will not salvage =20
> automatically after a crash. So if you find after a ccrash volumes =20
> which couldn't be attached you salvage them by "bos salvage server =20
> partition volume" and examine the SalvageLog. I suppose in the case =20=

> he throws the root-directory away you will see some thing in the log.

In a former life I had some Transarc AFS servers which had persistent =20=

problems starting after a major crash due to multi-day power outage. =20
Some were the same line reported here. The only cleanup process that =20
worked went something like this:

For every volume in the old server:
   Vos dump the volume to a file
   Restore it from the file to a different name on an otherwise empty =20=

   Salvage that volume with orphans being attached. Since you're =20
salvaging a copy, you
      have no risk of hosing the production volume
   If no problems, move the original volume to a new server
   If recoverable problems in the salvage:
      delete the copy you'd made
      move the volume to the empty server
      salvage it, and clean up the orphans with the end user
      move the volume to a new server
      thoroughly clean (mkfs) the empty server
   If unrecoverable problems in the volume salvage:
      tar up the existing volume as best you can
      apologize profusely to user.

Fortunately only about 1% of the volumes had problems and most were =20
easily remedied. But until we emptied the original servers and =20
rebuilt them from scratch with modern openafs, we had lingering =20
oddball problems.

In my current position we have about 10.5TB of user files in almost =20
250,000 volumes spread across 22 servers of various sizes, running =20
1.4. We see various minor problems which the openafs developers seem =20
to be addressing quite well, but as a defensive measure we're just =20
starting a policy of periodically emptying a file server and =20
ruthlessly salvaging it. Ask me in a year and I'll let you know how =20
it goes.