[OpenAFS] Possible RO volume corruption, AFS 1.4.1 on Solaris 8

Thu, 03 Aug 2006 17:05:58 -0700

Kevin Hildebrand <kevin@umd.edu> writes:

> Hello, we've been having problems recently with one of our volumes
> having most or all of its RO replications go offline at approximately
> the same time.  The RW volume has remained stable, so it's only the ROs
> that we're having problems with.

> This volume is released on an hourly basis, and normally has 3 RO
> replications.  What's been happening, is that some point in between
> replications, the volume is taken offline-

> FileLog:
> Thu Aug  3 12:46:42 2006 VAttachVolume: volume salvage flag is ON for
> /vicepc//V1970897351.vol; volume needs salvage

> VolserLog:
> Thu Aug  3 12:46:42 2006 VAttachVolume: volume salvage flag is ON for
> /vicepc/V1970897351.vol; volume needs salvage

> There is no other relevant entry in the logs as to WHY the volume is
> being taken offline.  I'll be adding some debug code to the fileserver
> shortly to see if I can nail down where this is occurring, if no one
> else has any leads.

Yeah, we've been seeing the same problem intermittantly with the same
configuration.  There's a fix in 1.4.2-to-be that will hopefully take care
of this.  The volume isn't actually being corrupted, we think; we think
it's being taken off-line unnecessarily due to the misinterpretation of an
error.

Unfortunately, once it's taken off-line, because it's a replica, you
pretty much have to vos zap it and then re-release to get it properly
restored and on-line again.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>