[OpenAFS] CopyOnWrite failed. Workarounds?

Hartmut Reuter reuter@rzg.mpg.de
Tue, 28 May 2002 16:49:45 +0200

The fact that the CopyOnWrite failure is seen only on servers with very
low load indicates that it has something to do with the filedescriptor
caching. As I mentioned some months ago: we had in MR-AFS a similar
problem where a directory was not REALLYCLOSEd before its unlinking.
Then CopyOnwrite wrote into the unlinked directory instead of the newly
created one because both had the identical UFS-path. The effect was that
after closing the filedescriptor the data were lost and the newly
created file had 0 bytes length.

I tried to produce this effect with an openafs-1.2.3 test-server by any
combination of directory update and "vos backup", but never saw the
problem. It would be nice if someone could give a recipe what steps are
necessary to produce the failure!

Anyway, if you have a production environment where the number of files
opened over a day exceeds by far the number of open filedescriptors you
probably wont see any errors. (not a very helpful hint, I know!)

Hartmut Reuter

Friedrich Delgado Friedrichs wrote:
> Hi!
> On Sunday i sent a report about my whole home directory becoming orphaned.
> Derrick J Brashear has guessed, that it may be the "CopyOnWrite failure" bug,
> that several people on this list have experienced, however i could not prove this,
> having lost the logfiles.
> After reading several of those posts here and on openafs-devel, i am now pretty sure that i
> suffered from the same bug, because apart from the missing logfile entry, the behaviour
> on my box was pretty much the same as reported by Marco Foglia in <3C8F30DB.73BAD2EE@psi.ch>
> and Matthew N. Andrews in <BJEHJHBBLPOFPKCANEMGIEABCAAA.mnandrews@lbl.gov> on the openafs-devel
> list.
> Especially since the first report from Marco Foglia dates back to 10/31/2001, and i intend
> to use openafs on my home box and at work in a minor installation, which might serve as
> a testbed for a larger installation lateron, i'd prefer not to wait for a fix, but rather
> like to know which workaround seems viable, especially if you want to have regular backups...
> If i fail to resolve this issue, i'd have to decide not to use afs, since regular lossage
> of this dimension is clearly not desirable. ;-)
> >From the other posts, i noticed a few points which might be helpful:
>         - Don't use backup volumes. Marco Foglia clearly stated that the problem never
>                 arises when the backup volumes are removed.
>           This probably means that in a smaller installation i will take regular dumps
>           directly from the home volumes, instead of using the backup volumes. Are there
>           any negative side-effects to be expected, i.e. because a mail delivery process
>           might hang in iowait for a long time, processes might timeout, etc.?
>         - I can't see clearly whether using a single threaded fileserver (is this the
>           same as the lwp fileserver?) will help. Several posters hinted in this direction
>           but hoffman@cs.pitt.edu in <200203182232.g2IMWil23214@frack.cs.pitt.edu> stated
>           that it might make things worse.
> What else could help me get around the problem, whilst retaining most of afs's functionality?
> Best Regards
> --
>                 Friedrich Delgado Friedrichs
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

Hartmut Reuter                           e-mail reuter@rzg.mpg.de
					   phone +49-89-3299-1328
RZG (Rechenzentrum Garching)               fax   +49-89-3299-1301 
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)