[OpenAFS] CopyOnWrite failed - orphaned files

hoffman@cs.pitt.edu hoffman@cs.pitt.edu
Mon, 4 Mar 2002 12:32:00 -0500 (EST)


Marco Foglia (marco.foglia@psi.ch) wrote:

> we are having continuous troubles with CopyOnWrite failures on a linux
> file server and as a result orphaned files. We have tried several setups
> and even changed hardware but it is still there. The current setup is 
> redhat-6.2 with 2.2.19-6.2.12smp and Openafs 1.2.2. 
>
> It only happens after a full backup (vos dump <volume>.backup -time 0) of
> a backup volume. You can find the following message in the file server
> log file:
>
> CopyOnWrite failed: volume 536872685 in partition /vicepa  (tried
> reading 8192, read 0, wrote 0, errno 4) volume needs salvage
>
> Does anybody have the same problems? Solutions? 


I have been fighting this problem for almost a year with no solution.
I didn't post anything to this list before because I wasn't convinced
that the problem wasn't in my hardware.  By now, I've tried enough
different combinations that I'm convinced that it's a software problem.

My hardware:   Dual 800 MHz Pentium IIIs on two different motherboards,
               1 GB SDRAM
               DPT Century VI RAID controller
                or
               Adaptec 3200S RAID controller
                or
               Mylex AcceleRAID 352 RAID controller
                or
               non-RAID Symbios SCSI controller

My software:   RedHat 6.2, 2.2.16-3
               Transarc AFS Base configuration afs3.6 2.0
                or
               RedHat 6.2, 2.2.16-3
               Transarc AFS Base configuration afs3.6 2.3
                or
               RedHat 6.2, 2.2.16-3
               OpenAFS 1.0.3
                or
               [ many different versions in between ]
                or
               RedHat 7.2, kernel 2.4.9-21
               OpenAFS 1.2.3

As you can see, I've tried many different hardware and software
combinations and I still get corrupted volumes.  It can happen
on any backup, not just full backups.  I have tried running the
servers in uniprocessor mode and while that seemed to help, it
did not eliminate the problem.


Here are some excerpts from the log files on the most recent
corruption, using the last software configuration above:

FileLog:
Thu Feb 28 15:47:19 2002 CopyOnWrite failed: volume 536877621 in partition /vicepc  (tried reading 8192, read 0, wrote 0, errno 4) volume needs salvage
Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for /vicepc//V0536877621.vol; volume needs salvage

VolserLog:
Thu Feb 28 15:35:09 2002 1 Volser: Clone: Recloning volume 536876261 to volume 536876263
[...]
Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for /vicepc/V0536877621.vol; volume needs salvage
Thu Feb 28 15:58:46 2002 1 Volser: ListVolumes: Could not attach volume 536877621 (V0536877621.vol) error=101

SalvageLog.old:
@(#) OpenAFS 1.2.3 built  2002-02-01 
02/28/2002 16:00:48 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc 536877621)
02/28/2002 16:00:48 CHECKING CLONED VOLUME 536877622.
02/28/2002 16:00:48 cs.usr0.naveen.backup (536877622) updated 02/28/2002 15:30
02/28/2002 16:00:48 Vnode 1: length incorrect; (is 8192 should be 0)
02/28/2002 16:00:48 SALVAGING VOLUME 536877621.
02/28/2002 16:00:48 cs.usr0.naveen (536877621) updated 02/28/2002 15:46
02/28/2002 16:00:48 Vnode 1198: version < inode version; fixed (old status)
02/28/2002 16:00:48 Vnode 1514: version < inode version; fixed (old status)
02/28/2002 16:00:48 Vnode 2754: version < inode version; fixed (old status)
[ similar lines deleted ]
02/28/2002 16:00:48 Vnode 10078: version < inode version; fixed (old status)
02/28/2002 16:00:48 Vnode 1: length incorrect; changed from 8192 to 0
02/28/2002 16:02:19 First page in directory does not exist.
02/28/2002 16:02:19 Directory bad, vnode 1; salvaging...
02/28/2002 16:02:19 Salvaging directory 1...
02/28/2002 16:02:19 Failed to read first page of fromDir!
02/28/2002 16:02:19 Checking the results of the directory salvage...
02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file .__afsD8D is deleted (vnode 1120)
02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file .__afs8036 is deleted (vnode 780)
02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file .__afsDED1 is deleted (vnode 7988)
[ similar lines deleted ]
02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file .__afs4975 is deleted (vnode 5862)
02/28/2002 16:02:20 Vnode 1: link count incorrect (was 45, now 2)
02/28/2002 16:02:30 Found 4784 orphaned files and directories (approx. 587713 KB)
02/28/2002 16:02:30 Salvaged cs.usr0.naveen (536877621): 4785 files, 587714 blocks

SalvageLog:
@(#) OpenAFS 1.2.3 built  2002-02-01 
02/28/2002 16:05:06 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc 536877621 -orphans attach)
02/28/2002 16:05:07 SALVAGING VOLUME 536877621.
02/28/2002 16:05:07 cs.usr0.naveen (536877621) updated 02/28/2002 15:46
02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as __ORPHANDIR__.3.207280
02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as __ORPHANDIR__.11.46709
02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as __ORPHANDIR__.13.7
[ similar lines deleted ]
02/28/2002 16:05:07 Attaching orphaned file to volume's root dir as __ORPHANFILE__.7846.198967
02/28/2002 16:05:07 Attaching orphaned file to volume's root dir as __ORPHANFILE__.9320.219854
02/28/2002 16:05:07 Vnode 1: link count incorrect (was 2, now 45)
02/28/2002 16:05:07 Vnode 3: link count incorrect (was 2, now 3)
02/28/2002 16:05:07 Vnode 11: link count incorrect (was 1, now 2)
02/28/2002 16:05:07 Vnode 13: link count incorrect (was 7, now 8)
02/28/2002 16:05:07 Vnode 15: link count incorrect (was 2, now 3)
[ similar lines deleted ]
02/28/2002 16:05:07 Vnode 9320: link count incorrect (was 0, now 1)
02/28/2002 16:05:07 Salvaged cs.usr0.naveen (536877621): 4785 files, 587715 blocks
02/28/2002 16:05:07 The volume header file V0536877622.vol is not associated with any actual data (deleted)


The FileLog says "errno 4", which is Interrupted System Call.  Could this
be a clue?

We have a lot of older machines of many architectures here running older
client software.  Could that cause this problem?

Is there anything I can do to instrument the servers to help find the
root cause of the problem?  


Another problem that may or may not be related:

On these same machines, the fileserver processes can get into a state such
that a bos restart, shutdown or stop will not kill them and neither will a
simple "kill <pid>".  You must do a "kill -9" on each of the fileserver
processes to make them go away.  In the case of a bos restart, new
fileserver processes are started without the old ones having been killed,
the new processes fail, causing the salvager to start.  When the salvager
finishes, the fileserver processes are started again and fail again,
leading to an endless fileserver-salvager-fileserver-salvager cycle.
This has led to us disabling the 4:00 AM Sunday restarts.


Thanks in advance for your help.

	---Bob.
--
Bob Hoffman, N3CVL      University of Pittsburgh           Tel: +1 412 624 8404
hoffman@cs.pitt.edu     Department of Computer Science     Fax: +1 412 624 8854