[OpenAFS] Re: help, salvaged volume won't come back online, is it corrupt? [trimmed log]
Steve Simmons
scs@umich.edu
Wed, 13 Sep 2006 12:15:20 -0400
On Sep 13, 2006, at 9:59 AM, Hartmut Reuter wrote:
> Juha J=E4ykk=E4 wrote:
>
>>> Better you do a "vos convertROtoRW" on the RO-site as soon as =20
>>> possible to regain a valid RW-volume in this case.
>> Except that I'm unlikely to notice the corruption before it's =20
>> released,
>> which happens automatically. Sounds like we need to change our backup
>> policy...
>
> The best way to prevent the salvager from corrupting volumes is not =20=
> to run it automatically. If you configure your OpenAFS with with "--=20=
> enable-fast-restart" then the fileserver will not salvage =20
> automatically after a crash. So if you find after a ccrash volumes =20
> which couldn't be attached you salvage them by "bos salvage server =20
> partition volume" and examine the SalvageLog. I suppose in the case =20=
> he throws the root-directory away you will see some thing in the log.
In a former life I had some Transarc AFS servers which had persistent =20=
problems starting after a major crash due to multi-day power outage. =20
Some were the same line reported here. The only cleanup process that =20
worked went something like this:
For every volume in the old server:
Vos dump the volume to a file
Restore it from the file to a different name on an otherwise empty =20=
server
Salvage that volume with orphans being attached. Since you're =20
salvaging a copy, you
have no risk of hosing the production volume
If no problems, move the original volume to a new server
If recoverable problems in the salvage:
delete the copy you'd made
move the volume to the empty server
salvage it, and clean up the orphans with the end user
move the volume to a new server
thoroughly clean (mkfs) the empty server
If unrecoverable problems in the volume salvage:
tar up the existing volume as best you can
apologize profusely to user.
Fortunately only about 1% of the volumes had problems and most were =20
easily remedied. But until we emptied the original servers and =20
rebuilt them from scratch with modern openafs, we had lingering =20
oddball problems.
In my current position we have about 10.5TB of user files in almost =20
250,000 volumes spread across 22 servers of various sizes, running =20
1.4. We see various minor problems which the openafs developers seem =20
to be addressing quite well, but as a defensive measure we're just =20
starting a policy of periodically emptying a file server and =20
ruthlessly salvaging it. Ask me in a year and I'll let you know how =20
it goes.
Steve=