[OpenAFS-devel] CopyOnWrite failure

Hartmut Reuter hwr@rzg.mpg.de
Wed, 13 Mar 2002 10:34:31 +0100 (CET)


I had seen a similar effect on our MR-AFS servers some time ago. It also
happened only on servers with low traffic and I finally found out what it
was:

The ihandle layer tries to keep open as many filedescriptors as
possible. If you created a new directory it gets tag 0 in the link
table. If you create a backup volume the link count for this tag 0 file is
increased to 2. Then you create a new file in the directory which forces a
copy on write creating the directory file with tag 1. Next night the
reclone of the backup volume unlink the directory with tag 0, but the
fileserver still may have it open!! If now the directory is modified again
the copy on write creates a file with tag 0 again. But becasue the old one
is still open (however removed from the /vicep.... directory) copy on
write writes into the old dead file!

I guess there must be missing a REALLYCLOSE anywhere.

We are running all MR-AFS servers so I haven't seen this anymore.

Hartmut

On Wed, 13 Mar 2002, Marco Foglia wrote:

> Hi,
> 
> here are some additional information from our site about this
> problem. 
> 
> Derrick J Brashear wrote:
> > 
> > Some questions for those of you with this problem:
> > -always with non-replicated volumes that have a .backup?
> 
> Yes. It never happens if you remove the .backup volume.
> 
> > -if the above, was the backup being recreated at the time? (the VolserLog
> >  may be helpful here, as well as the vos examine info)
> 
> We recreate the backup volumes around midnight but the 
> CopyOnWrite failure "happens" when users log in in the morning.
> BUT, there were cases when the volume was already corrupt before
> we saw the "CopyOnWrite failure" in the file server log (the
> last backup of these volumes was already corrupt).  
> 
> > -what if anything pertinent about access patterns?
> 
> We have one Linux file server (300 GB, 550 volumes) which 
> does not have the CopyOnWrite bug! I tried to clone this server 
> by using exactly the same hardware and doing a rsync but the
> cloned file server suffers from the bug. So I don't think that 
> a special access pattern (or afs client version) is 
> responsible for it. The likeliness is just higher if you 
> are a heavy user. The only difference between our "stable" and 
> any other file server (="unstable") is that the "stable" 
> is 60% full and the "unstables" are more or less empty. 
> Could the timing of some file system functions be different 
> and therefore trigger the bug?
> 
> Marco 
> 
> --
> Marco Foglia | Paul Scherrer Institut  | phone     +41 56 310 36 39 
>              | CH-5232 Villigen        | mailto:marco.foglia@psi.ch
> -------------------------------------------------------------------
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
> 

-----------------------------------------------------------------
Hartmut Reuter                           e-mail reuter@rzg.mpg.de
					   phone +49-89-3299-1328
RZG (Rechenzentrum Garching)               fax   +49-89-3299-1301 
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------