[OpenAFS] CopyOnWrite failed - orphaned files

Tue, 5 Mar 2002 13:38:29 +0100 (CET)

I guess these problems have nothing to do with your hardware.

Have you checked the file he tries to read exists in the /vicepX partition
and has the correct size?

If so, I would try to run the non-threaded version of the fileserver which
is built in src/viced. Also the problem of processes not going away looks
for me like a problem in the pthread environment.

Hartmut

On Mon, 4 Mar 2002 hoffman@cs.pitt.edu wrote:

> Marco Foglia (marco.foglia@psi.ch) wrote:
> 
> > we are having continuous troubles with CopyOnWrite failures on a linux
> > file server and as a result orphaned files. We have tried several setups
> > and even changed hardware but it is still there. The current setup is 
> > redhat-6.2 with 2.2.19-6.2.12smp and Openafs 1.2.2. 
> >
> > It only happens after a full backup (vos dump <volume>.backup -time 0) of
> > a backup volume. You can find the following message in the file server
> > log file:
> >
> > CopyOnWrite failed: volume 536872685 in partition /vicepa  (tried
> > reading 8192, read 0, wrote 0, errno 4) volume needs salvage
> >
> > Does anybody have the same problems? Solutions? 
> 
> 
> I have been fighting this problem for almost a year with no solution.
> I didn't post anything to this list before because I wasn't convinced
> that the problem wasn't in my hardware.  By now, I've tried enough
> different combinations that I'm convinced that it's a software problem.
> 
> My hardware:   Dual 800 MHz Pentium IIIs on two different motherboards,
>                1 GB SDRAM
>                DPT Century VI RAID controller
>                 or
>                Adaptec 3200S RAID controller
>                 or
>                Mylex AcceleRAID 352 RAID controller
>                 or
>                non-RAID Symbios SCSI controller
> 
> My software:   RedHat 6.2, 2.2.16-3
>                Transarc AFS Base configuration afs3.6 2.0
>                 or
>                RedHat 6.2, 2.2.16-3
>                Transarc AFS Base configuration afs3.6 2.3
>                 or
>                RedHat 6.2, 2.2.16-3
>                OpenAFS 1.0.3
>                 or
>                [ many different versions in between ]
>                 or
>                RedHat 7.2, kernel 2.4.9-21
>                OpenAFS 1.2.3
> 
> As you can see, I've tried many different hardware and software
> combinations and I still get corrupted volumes.  It can happen
> on any backup, not just full backups.  I have tried running the
> servers in uniprocessor mode and while that seemed to help, it
> did not eliminate the problem.
> 
> 
> Here are some excerpts from the log files on the most recent
> corruption, using the last software configuration above:
> 
> FileLog:
> Thu Feb 28 15:47:19 2002 CopyOnWrite failed: volume 536877621 in partition /vicepc  (tried reading 8192, read 0, wrote 0, errno 4) volume needs salvage
> Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for /vicepc//V0536877621.vol; volume needs salvage
> 
> VolserLog:
> Thu Feb 28 15:35:09 2002 1 Volser: Clone: Recloning volume 536876261 to volume 536876263
> [...]
> Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for /vicepc/V0536877621.vol; volume needs salvage
> Thu Feb 28 15:58:46 2002 1 Volser: ListVolumes: Could not attach volume 536877621 (V0536877621.vol) error=101
> 
> SalvageLog.old:
> @(#) OpenAFS 1.2.3 built  2002-02-01 
> 02/28/2002 16:00:48 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc 536877621)
> 02/28/2002 16:00:48 CHECKING CLONED VOLUME 536877622.
> 02/28/2002 16:00:48 cs.usr0.naveen.backup (536877622) updated 02/28/2002 15:30
> 02/28/2002 16:00:48 Vnode 1: length incorrect; (is 8192 should be 0)
> 02/28/2002 16:00:48 SALVAGING VOLUME 536877621.
> 02/28/2002 16:00:48 cs.usr0.naveen (536877621) updated 02/28/2002 15:46
> 02/28/2002 16:00:48 Vnode 1198: version < inode version; fixed (old status)
> 02/28/2002 16:00:48 Vnode 1514: version < inode version; fixed (old status)
> 02/28/2002 16:00:48 Vnode 2754: version < inode version; fixed (old status)
> [ similar lines deleted ]
> 02/28/2002 16:00:48 Vnode 10078: version < inode version; fixed (old status)
> 02/28/2002 16:00:48 Vnode 1: length incorrect; changed from 8192 to 0
> 02/28/2002 16:02:19 First page in directory does not exist.
> 02/28/2002 16:02:19 Directory bad, vnode 1; salvaging...
> 02/28/2002 16:02:19 Salvaging directory 1...
> 02/28/2002 16:02:19 Failed to read first page of fromDir!
> 02/28/2002 16:02:19 Checking the results of the directory salvage...
> 02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file .__afsD8D is deleted (vnode 1120)
> 02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file .__afs8036 is deleted (vnode 780)
> 02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file .__afsDED1 is deleted (vnode 7988)
> [ similar lines deleted ]
> 02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file .__afs4975 is deleted (vnode 5862)
> 02/28/2002 16:02:20 Vnode 1: link count incorrect (was 45, now 2)
> 02/28/2002 16:02:30 Found 4784 orphaned files and directories (approx. 587713 KB)
> 02/28/2002 16:02:30 Salvaged cs.usr0.naveen (536877621): 4785 files, 587714 blocks
> 
> SalvageLog:
> @(#) OpenAFS 1.2.3 built  2002-02-01 
> 02/28/2002 16:05:06 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc 536877621 -orphans attach)
> 02/28/2002 16:05:07 SALVAGING VOLUME 536877621.
> 02/28/2002 16:05:07 cs.usr0.naveen (536877621) updated 02/28/2002 15:46
> 02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as __ORPHANDIR__.3.207280
> 02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as __ORPHANDIR__.11.46709
> 02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as __ORPHANDIR__.13.7
> [ similar lines deleted ]
> 02/28/2002 16:05:07 Attaching orphaned file to volume's root dir as __ORPHANFILE__.7846.198967
> 02/28/2002 16:05:07 Attaching orphaned file to volume's root dir as __ORPHANFILE__.9320.219854
> 02/28/2002 16:05:07 Vnode 1: link count incorrect (was 2, now 45)
> 02/28/2002 16:05:07 Vnode 3: link count incorrect (was 2, now 3)
> 02/28/2002 16:05:07 Vnode 11: link count incorrect (was 1, now 2)
> 02/28/2002 16:05:07 Vnode 13: link count incorrect (was 7, now 8)
> 02/28/2002 16:05:07 Vnode 15: link count incorrect (was 2, now 3)
> [ similar lines deleted ]
> 02/28/2002 16:05:07 Vnode 9320: link count incorrect (was 0, now 1)
> 02/28/2002 16:05:07 Salvaged cs.usr0.naveen (536877621): 4785 files, 587715 blocks
> 02/28/2002 16:05:07 The volume header file V0536877622.vol is not associated with any actual data (deleted)
> 
> 
> The FileLog says "errno 4", which is Interrupted System Call.  Could this
> be a clue?
> 
> We have a lot of older machines of many architectures here running older
> client software.  Could that cause this problem?
> 
> Is there anything I can do to instrument the servers to help find the
> root cause of the problem?  
> 
> 
> Another problem that may or may not be related:
> 
> On these same machines, the fileserver processes can get into a state such
> that a bos restart, shutdown or stop will not kill them and neither will a
> simple "kill <pid>".  You must do a "kill -9" on each of the fileserver
> processes to make them go away.  In the case of a bos restart, new
> fileserver processes are started without the old ones having been killed,
> the new processes fail, causing the salvager to start.  When the salvager
> finishes, the fileserver processes are started again and fail again,
> leading to an endless fileserver-salvager-fileserver-salvager cycle.
> This has led to us disabling the 4:00 AM Sunday restarts.
> 
> 
> Thanks in advance for your help.
> 
> 	---Bob.
> --
> Bob Hoffman, N3CVL      University of Pittsburgh           Tel: +1 412 624 8404
> hoffman@cs.pitt.edu     Department of Computer Science     Fax: +1 412 624 8854
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
> 

-----------------------------------------------------------------
Hartmut Reuter                           e-mail reuter@rzg.mpg.de
					   phone +49-89-3299-1328
RZG (Rechenzentrum Garching)               fax   +49-89-3299-1301 
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------