[OpenAFS] Re: volume offline due to too low uniquifier (and salvage cannot
fix it)
Andrew Deason
adeason@sinenomine.net
Tue, 16 Apr 2013 16:56:15 -0500
On Tue, 16 Apr 2013 19:27:33 +0000
Jakub Moscicki <Jakub.Moscicki@cern.ch> wrote:
> I am not sure, but if the uniquifier is just to uniquify the
> potentialy reused vnode id, then the risk of collisions is really low
> even without this check, right? One would have to have the uniquifier
> wrapped around 32bits in between the reuse of vnode id and hit exactly
> the same uniquifier number. It depends on the details of course of
> vnode id reuse algorithm but looks like very low probability.
I think there's 2 issues here, if that check is removed.
- If the salvager ever does not fix the volume max uniquifier properly,
or if somehow we don't track the max uniq correctly or otherwise
somehow we get a file in the volume with the uniq set to e.g. 10
above the current max uniq... then that's potentially a serious
problem. This is generally why the check in the fileserver is there
(I assume), since getting this wrong can mean cache problems, file
corruption etc etc. Without that check on attachment, we don't really
have a way to 'check' if it's correct (unless maybe we special-case
some of these scenarios with really big uniq's, but you still have
the problem then, just restricted to big-uniq volumes). If the max
uniq on disk gets screwed up, it'll just always be wrong and you'll
always be vulnerable to collisions.
- Even without that, with everything behaving normally but with uniq
rollover, I don't think the probabily of collision is _that_ low. If
you have an otherwise normal volume with say, a few thousand files,
and then somehow use up 2^32-1000 uniqs, now your uniqs are very
close to what the original thousand files were, and the probability
of collision is now very high.
That may not seem very "likely", but what I don't like about that is
that it's improbability is based on how you're using it, not based on
random events. That is, to me it's not really about the likelihood of
the event happening, but it means that a very specific file access
pattern in AFS can result in consistency problems. And if someone has
a workload that somehow creates that access pattern... well, there's
nothing they can do. That's really not a great situation, especially
since you may not even be aware that you are exercising such a file
pattern (and even if you are, you may not consider it to be an
unusual pattern; "why can't AFS work with blah blah blah").
Maybe those aren't the worst things in the world, but fixing it on
salvage seems like it doesn't have as many downsides. I guess the
primary downside is speed, though for this specific instance that
doesn't seem to have been a problem :)
--
Andrew Deason
adeason@sinenomine.net