[OpenAFS] can neither delete nor use backup volume (DA server 1.6.0pre2)

John Tang Boyland boyland@cs.uwm.edu
Mon, 29 Aug 2011 14:18:14 -0500


I accidentally moved a backed-up volume from partition a to b.
I moved it back, but now the backup volume is misbehaving.
(The process is REPEATABLE!)

I can't seem to remove all traces of it, but neither can I use it.
I can get root access on the (Linux) file server and start hacking files
directly, but perhaps someone on this list can help me avoid 
making a terrible blunder.

Here's a partial history.  Notice that after the "zap", vos backup gives
a different error once, but then it's back to the same set of errors
again.  I've been able to create backups of other volumes on the same
partition today.  So it seems that vos zap doesn't actually zap
everything.  Something's hanging around which is preventing vos backup
from working.  (And no, salvaging doesn't help either.)

Any advice?

John

P.S. If you cc me in a response, I will get it quicker.

------------------------------------------------------------------
admin 79 % rxdebug localhost -port 7001 -version
Trying 127.0.0.1 (port 7001):
AFS version:  OpenAFS 1.4.11 built  2009-07-13

admin 80 % rxdebug jeremiah -port 7000 -version
Trying 129.89.143.70 (port 7000):
AFS version:  OpenAFS 1.6.0pre2 built 2011-02-17 (91.pre2.sl6@fnal.gov)

admin 82 % vos examine fa11.cs351.backup
**** Could not attach volume 536875958 ****

    RWrite: 536875956     Backup: 536875958
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site

admin 83 % vos examine fa11.cs351
fa11.cs351                        536875956 RW       1679 K  On-line
    jeremiah.cs.uwm.edu /vicepa
    RWrite  536875956 ROnly          0 Backup  536875958
    MaxQuota     100000 K
    Creation    Thu Jun 23 20:15:28 2011
    Copy        Mon Aug 29 11:13:07 2011
    Backup      Mon Aug 29 13:43:23 2011
    Last Access Mon Aug 29 13:33:06 2011
    Last Update Mon Aug 29 13:28:10 2011
    294 accesses in the past day (i.e., vnode references)

    RWrite: 536875956     Backup: 536875958
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site
admin 84 % vos remove fa11.cs351.backup
vos: Missing required parameter '-id'

admin 85 % vos remove jeremiah a fa11.cs351.backup
WARNING: Volume 536875958 did not exist on the partition
Volume 536875958 on partition /vicepa server jeremiah.cs.uwm.edu deleted

admin 86 % vos examine fa11.cs351.backup
**** Could not attach volume 536875958 ****

Volume fa11.cs351.backup does not exist in VLDB
    RWrite: 536875956
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site

admin 87 % vos zap jeremiah a fa11.cs351.backup -force -verbose
vos: forcibly removing all traces of volume 536875958, please wait...done.

admin 88 % vos examine fa11.cs351.backup
Could not fetch the information about volume 536875958 from the server
: No such device
Volume fa11.cs351.backup does not exist

Dump only information from VLDB

fa11.cs351
    RWrite: 536875956
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site

admin 89 % vos backup fa11.cs351
Failed to clone the volume 536875956
Volume not attached, does not exist, or not on line
Error in vos backup command.
Volume not attached, does not exist, or not on line

admin 90 % vos examine fa11.cs351.backup
Could not fetch the information about volume 536875958 from the server
: No such device
Volume fa11.cs351.backup does not exist

Dump only information from VLDB

fa11.cs351
    RWrite: 536875956
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site

admin 91 % vos backup fa11.cs351
Failed to start a transaction on the backup volume 536875958
Volume not attached, does not exist, or not on line
Error in vos backup command.
Volume not attached, does not exist, or not on line

admin 92 % vos examine fa11.cs351.backup
**** Could not attach volume 536875958 ****

    RWrite: 536875956     Backup: 536875958
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site

admin 93 % vos create jeremiah a fa11.test
Volume 536876132 created on partition /vicepa of jeremiah

admin 94 % vos backup fa11.test
Created backup volume for fa11.test

admin 95 % vos examine fa11.test.backup
fa11.test.backup                  536876134 BK          2 K  On-line
    jeremiah.cs.uwm.edu /vicepa
    RWrite  536876132 ROnly          0 Backup          0
    MaxQuota       5000 K
    Creation    Mon Aug 29 13:59:13 2011
    Copy        Mon Aug 29 13:59:13 2011
    Backup      Mon Aug 29 13:59:13 2011
    Last Update Mon Aug 29 13:59:05 2011
    0 accesses in the past day (i.e., vnode references)

    RWrite: 536876132     Backup: 536876134
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site

[SOMEWHAT LATER, I had the bright idea to repeat my error, and to my
surprise and delight(?) the same error happens!]

admin 107 % vos move fa11.test jeremiah a jeremiah b
Volume 536876132 moved from jeremiah /vicepa to jeremiah /vicepb

admin 108 % vos move fa11.test jeremiah b jeremiah a
Volume 536876132 moved from jeremiah /vicepb to jeremiah /vicepa

admin 109 % vos backup fa11.test
Failed to start a transaction on the backup volume 536876134
Volume not attached, does not exist, or not on line
Error in vos backup command.
Volume not attached, does not exist, or not on line

admin 110 % vos examine fa11.test.backup
**** Could not attach volume 536876134 ****

    RWrite: 536876132     Backup: 536876134
    number of sites -> 1
       server jeremiah.cs.uwm.edu partition /vicepa RW Site


---------------------------------

NB: FileLog says:
Mon Aug 29 13:57:10 2011 FSYNC_com_VolOff: failed to get heavyweight reference to volume 536875958 (state=20, flags=0x18)

Salvaleg logs (I tried to salvage fa11.cs351.backup after the series of
commands here, which said it was successful but the logs show:

$ cat SalsrvLog
08/29/2011 13:43:00 dispatching child to salvage volume 536875956...
08/29/2011 13:43:00 2 nVolumesInInodeFile 56
08/29/2011 13:43:00 CHECKING CLONED VOLUME 536875958.
08/29/2011 13:43:00 SALVAGING VOLUME 536875956.
08/29/2011 13:43:00 fa11.cs351 (536875956) updated 08/29/2011 13:28
08/29/2011 13:43:00 totalInodes 513
08/29/2011 13:43:00 Salvaged fa11.cs351 (536875956): 506 files, 1679 blocks
08/29/2011 13:56:19 dispatching child to salvage volume 536875956...
08/29/2011 13:56:19 2 nVolumesInInodeFile 56
08/29/2011 13:56:19 CHECKING CLONED VOLUME 536875958.
08/29/2011 13:56:19 SALVAGING VOLUME 536875956.
08/29/2011 13:56:19 fa11.cs351 (536875956) updated 08/29/2011 13:28
08/29/2011 13:56:19 totalInodes 513
08/29/2011 13:56:20 Salvaged fa11.cs351 (536875956): 506 files, 1679 blocks

and

$ cat SalvageLog
@(#) OpenAFS 1.6.0pre2 built 2011-02-17 (91.pre2.sl6@fnal.gov)
08/29/2011 14:01:49 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepa 536875958)
08/29/2011 14:01:49 _VLockFd: conflicting lock held on fd 4, offset 0 by pid 2177 (locktype=2)

(process 2177 is  /usr/afs/bin/salvageserver)