[OpenAFS] Possible RO volume corruption, AFS 1.4.1 on Solaris 8
Russ Allbery
rra@stanford.edu
Thu, 03 Aug 2006 17:05:58 -0700
Kevin Hildebrand <kevin@umd.edu> writes:
> Hello, we've been having problems recently with one of our volumes
> having most or all of its RO replications go offline at approximately
> the same time. The RW volume has remained stable, so it's only the ROs
> that we're having problems with.
> This volume is released on an hourly basis, and normally has 3 RO
> replications. What's been happening, is that some point in between
> replications, the volume is taken offline-
> FileLog:
> Thu Aug 3 12:46:42 2006 VAttachVolume: volume salvage flag is ON for
> /vicepc//V1970897351.vol; volume needs salvage
> VolserLog:
> Thu Aug 3 12:46:42 2006 VAttachVolume: volume salvage flag is ON for
> /vicepc/V1970897351.vol; volume needs salvage
> There is no other relevant entry in the logs as to WHY the volume is
> being taken offline. I'll be adding some debug code to the fileserver
> shortly to see if I can nail down where this is occurring, if no one
> else has any leads.
Yeah, we've been seeing the same problem intermittantly with the same
configuration. There's a fix in 1.4.2-to-be that will hopefully take care
of this. The volume isn't actually being corrupted, we think; we think
it's being taken off-line unnecessarily due to the misinterpretation of an
error.
Unfortunately, once it's taken off-line, because it's a replica, you
pretty much have to vos zap it and then re-release to get it properly
restored and on-line again.
--
Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>