[OpenAFS] Help!!! Files / Volumes are disappearing

Fri, 14 Jun 2002 23:14:08 +0200

Hi all,

we've got a serious problem here. Whole directories are disappearing.
Even a restore from a tape backup is not working properly -- the
internal afs storage structure seems to be corrupted, such that a
restore is reproducing the same kind of error!

Here are the details:

We have 3 servers with 150 users, may not very active one. Accumulated
used space is 200 GB. We are in production (after migrating from NFS /
AMD) for over 2 month, now. We are using Redhat 7.2 and 7.3 and Server
1.2.3 / 1.2.4. 

During this time entire users directories became unavailable, twice. (ls
results in "connection timed out") the FileServer log contains:

Thu Jun 13 09:37:28 2002 ProbeUuid failed for host 172.22.85.135:7001
Thu Jun 13 09:46:05 2002 CopyOnWrite failed: volume 536871014 in
partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 4) volume
needs salvage
Thu Jun 13 10:40:36 2002 VAttachVolume: volume salvage flag is ON for
/vicepa//V0536871014.vol; volume needs salvage

We salvages the volume and there the disaster increases:

@(#) OpenAFS 1.2.4 built  2002-06-01 
06/13/2002 10:43:14 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager
/vicepa 536
871014 -tmpdir /tmp/ -orphans attach)
06/13/2002 10:43:27 CHECKING CLONED VOLUME 536871016.
06/13/2002 10:43:27 user.goetz.backup (536871016) updated 06/13/2002
01:02
06/13/2002 10:43:27 Vnode 1: length incorrect; (is 616448 should be 0)
06/13/2002 10:43:27 SALVAGING VOLUME 536871014.
06/13/2002 10:43:27 user.goetz (536871014) updated 06/13/2002 09:46
06/13/2002 10:43:27 Vnode 50318: version < inode version; fixed (old
status)
06/13/2002 10:43:27 Vnode 50336: version < inode version; fixed (old
status)
06/13/2002 10:43:27 Vnode 51128: version < inode version; fixed (old
status)

**** etc. ****

06/13/2002 10:43:27 Vnode 1: length incorrect; changed from 616448 to 0
06/13/2002 10:43:27 Vnode 3413: length incorrect; changed from 139264 to
0
06/13/2002 10:43:27 Vnode 4841: length incorrect; changed from 2048 to 0
06/13/2002 10:44:55 First page in directory does not exist.
06/13/2002 10:44:55 Directory bad, vnode 1; salvaging...
06/13/2002 10:44:55 Salvaging directory 1...
06/13/2002 10:44:55 Failed to read first page of fromDir!
06/13/2002 10:44:55 Checking the results of the directory salvage...
06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced
file .__
afs7B72 is deleted (vnode 110664)
06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced
file .__
afsF894 is deleted (vnode 92952)
06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced
file .__
afs3A43 is deleted (vnode 97004)
06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced
file .__
afs43D9 is deleted (vnode 99872)
06/13/2002 10:44:57 First page in directory does not exist.
06/13/2002 10:44:57 Directory bad, vnode 3413; salvaging...

**** etc. ****

So vnode 1 is incorrect?! They systems seems to like this idea and kills
all data in the root directory of the volume! Receiving alls this
hundreds of __ORPHANDIR__ and files doesn't help. 

To reconstruct all information would have taken days. So we decided to
go back to the tape backup that was done from a backup volume the prior
night. We restored everything... but as we mounted the volume no data
seems to be in it. The fileserver says the same as in the case of the
original volume 2 hours before... volume needs salvage! We did it again,
same result, too!!!

Out rescue arises from a backup that was two days old. There was no
problem anymore: just vos volrestore ...; fs mkmount ...; and enjoy AFS
;)

This is the second incident of that disastrous dimensions. A third
occurred this morning, but only some directories where affected and
strangely there were __ORPHANDIR__ created, but the originals were
there, still.

The errors occurred on different servers with different server-software
versions 1.2.3 / 1.2.4. The client that mainly used the volumes were
different, too.

Sorry for the cynics, but people here at my site are making me a hard
time, since I was the one that suggested AFS.

Your help and suggestions are very welcome, as many of our institute are
very concerned about this issues. They even suggested moving back to
NFS, because AFS seems not to be ready for a production environment!?

Thanks,
Ruby

--
Rubino Geiss, Universitaet Karlsruhe, IPD Goos
Postfach 6980, D-76128 Karlsruhe, GERMANY
Adenauerring 20a, 50.41 (AVG), Zi. 235 
rubino@ipd.info.uni-karlsruhe.de
Tel: (+49) 721 / 608-8352
Fax: (+49) 721 / 30047