[OpenAFS] Re: [OpenAFS-devel] 1.6 and post-1.6 OpenAFS branch
management and schedule
Rainer Toebbicke
rtb@pclella.cern.ch
Fri, 18 Jun 2010 11:47:25 +0200
Jeffrey Hutzelman schrieb:
>
> Really, I consider enable-fast-restart to be extremely dangerous.
> It should have gone away long ago.
>
> I realize some people believe that speed is more important than not
> losing data, but I don't agree, and I don't think it's an appropriate
> position for a filesystem to take. Not losing your data is pretty much
> the defining difference between filesystems you can lose and filesystems
> from which you should run away screaming as fast as you can. I do not
> want people to run away screaming from OpenAFS, at any speed.
>
I beg to disagree: the Volume/Vnode back-end has by no means the same problems
that a file system might have. Damages there will never wildly destroy random
items on disk, as you would have to be afraid using in a file system. At least
in namei, damages in a volume are entirely contained therein, files themselves
are at the worst entirely replaced by others, they're never corrupted partly
other than being half-written or such. Of course files on disk can become
unfindable or directories can have bogus entries.
My experience is that damages to the vnode files usually result in directories
containing inaccessible entries, in very rare occasions cross-linked files.
The link table is surprisingly robust (even with its header overwritten).
I reckon that in over 15 years of AFS service we've probably had more bit
errors in files due to uncaught memory errors and uncaught transmission
errors, not speaking about the major culprit "programming errors", than nasty
inconsistencies after crashes which complete and immediate salvaging would
have caught.
We salvage volumes in the background at a low rate, and on file servers which
never crash the logs show the same odd issues as on those who crashed, hence
the added risk of running with potential damage is within the error bars. Even
between salvages every now and then volumes drop out. The practical approach
is to detect this quickly and re-salvage, and when the rate exceeds the pain
threshold find a bug and fix it.
For us, the delta does not justify keeping the service down for several hours
after a crash. Make that delta proportionally bigger by fixing the other
issues and I revise my statement.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155