[OpenAFS] AFS Fileserver Won't Start --> Can't Release root.cell or root.afs

Karl M. Davis karl@ridgetop-group.com
Thu, 4 Oct 2007 17:14:53 -0700


> Try vos release -id root.afs -verbose -local as root to get more info 
> and use your KeyFile instead of user tokens.Running that command gives me:
<<
karl@picacho:~$ sudo vos release -id root.afs -verbose -localauth

root.afs
    RWrite: 536870915     ROnly: 536870916     RClone: 536870916
    number of sites -> 4
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server picacho.ridgetop-group.local partition /vicepa RO Site  -- New
release
       server picacho.ridgetop-group.local partition /vicepa RO Site  -- Old
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release
This is a completion of a previous release
Starting transaction on cloned volume 536870916... done
Failed to start a transaction on the RO volume.
VOLSER: volume is busy
The volume 536870915 could not be released to the following 1 sites:
               picacho.ridgetop-group.local /vicepa
VOLSER: release could not be completed
Error in vos release command.
VOLSER: release could not be completed
karl@picacho:~$
>>

> Does vos listaddrs -noresolve print out?
Nope, nothing from that, either.

> You might need to vos remsite replicas attached to those IPs first.
Interesting... I tried this but instead of specifying picacho, I passed it
127.0.0.1 as the machine name and it worked.  It actually worked twice for
each.  Here's the output:
<<
karl@picacho:~$ sudo vos remsite -server 127.0.0.1 -partition a -id root.afs
-localauth -verbose
Deleting the replication site for volume 536870915 ... done
Removed replication site 127.0.0.1 /vicepa for volume root.afs
karl@picacho:~$ sudo vos remsite -server 127.0.0.1 -partition a -id root.afs
-localauth -verbose
Deleting the replication site for volume 536870915 ... done
Removed replication site 127.0.0.1 /vicepa for volume root.afs
karl@picacho:~$ sudo vos remsite -server 127.0.0.1 -partition a -id root.afs
-localauth -verbose
This site is not a replication site
Error in vos remsite command.
VOLSER: illegal operation
karl@picacho:~$ vos listvldb
vsu_ClientInit: Could not get afs tokens, running unauthenticated.
VLDB entries for all servers

lib
    RWrite: 536870933     ROnly: 536870934
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

lib.pdks
    RWrite: 536870936     ROnly: 536870937
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

root.afs
    RWrite: 536870915     ROnly: 536870916     RClone: 536870916
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release

root.cell
    RWrite: 536870918     ROnly: 536870919     RClone: 536870919
    number of sites -> 4
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server picacho.ridgetop-group.local partition /vicepa RO Site  -- New
release
       server picacho.ridgetop-group.local partition /vicepa RO Site  -- Old
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release

Total entries: 4
karl@picacho:~$ sudo vos remsite -server 127.0.0.1 -partition a -id
root.cell -localauth -verbose
Deleting the replication site for volume 536870918 ... done
Removed replication site 127.0.0.1 /vicepa for volume root.cell
karl@picacho:~$ sudo vos remsite -server 127.0.0.1 -partition a -id
root.cell -localauth -verbose
Deleting the replication site for volume 536870918 ... done
Removed replication site 127.0.0.1 /vicepa for volume root.cell
karl@picacho:~$ sudo vos remsite -server 127.0.0.1 -partition a -id
root.cell -localauth -verbose
This site is not a replication site
Error in vos remsite command.
VOLSER: illegal operation
karl@picacho:~$ vos listvldb
vsu_ClientInit: Could not get afs tokens, running unauthenticated.
VLDB entries for all servers

lib
    RWrite: 536870933     ROnly: 536870934
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

lib.pdks
    RWrite: 536870936     ROnly: 536870937
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

root.afs
    RWrite: 536870915     ROnly: 536870916     RClone: 536870916
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release

root.cell
    RWrite: 536870918     ROnly: 536870919     RClone: 536870919
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release

Total entries: 4
>>

After that, I tried "vos release" again.  Here's the output from that:
<<
karl@picacho:~$ sudo vos release -id root.cell -verbose -localauth

root.cell
    RWrite: 536870918     ROnly: 536870919     RClone: 536870919
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release
This is a complete release of volume 536870918
Cloning RW volume 536870918 to temporary RO... done
Getting status of RW volume 536870918... done
Ending cloning transaction on RW volume 536870918... done
Starting transaction on cloned volume 536870919... done
Updating existing ro volume 536870919 on coronado.ridgetop-group.local ...
Starting ForwardMulti from 536870919 to 536870919 on
coronado.ridgetop-group.local (as of Thu Sep  6 12:58:39 2007).
Deleting the releaseClone 536870919 ... done
updating VLDB ... done
Released volume root.cell successfully
karl@picacho:~$ vos listvldb
vsu_ClientInit: Could not get afs tokens, running unauthenticated.
VLDB entries for all servers

lib
    RWrite: 536870933     ROnly: 536870934
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

lib.pdks
    RWrite: 536870936     ROnly: 536870937
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

root.afs
    RWrite: 536870915     ROnly: 536870916     RClone: 536870916
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release

root.cell
    RWrite: 536870918     ROnly: 536870919
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

Total entries: 4
karl@picacho:~$ sudo vos release -id root.afs -verbose -localauth

root.afs
    RWrite: 536870915     ROnly: 536870916     RClone: 536870916
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site  -- New
release
       server coronado.ridgetop-group.local partition /vicepa RO Site  --
New release
This is a complete release of volume 536870915
Cloning RW volume 536870915 to temporary RO... done
Getting status of RW volume 536870915... done
Ending cloning transaction on RW volume 536870915... done
Starting transaction on cloned volume 536870916... done
Updating existing ro volume 536870916 on coronado.ridgetop-group.local ...
Starting ForwardMulti from 536870916 to 536870916 on
coronado.ridgetop-group.local (as of Thu Aug  9 13:13:33 2007).
Deleting the releaseClone 536870916 ... done
updating VLDB ... done
Released volume root.afs successfully
karl@picacho:~$ vos listvldb
vsu_ClientInit: Could not get afs tokens, running unauthenticated.
VLDB entries for all servers

lib
    RWrite: 536870933     ROnly: 536870934
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

lib.pdks
    RWrite: 536870936     ROnly: 536870937
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

root.afs
    RWrite: 536870915     ROnly: 536870916
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

root.cell
    RWrite: 536870918     ROnly: 536870919
    number of sites -> 2
       server picacho.ridgetop-group.local partition /vicepa RW Site
       server coronado.ridgetop-group.local partition /vicepa RO Site

Total entries: 4
>>

I then tried running "vos changeaddr -oldaddr 127.0.0.1 -remove", but it
looks like some of my volumes are still "stuck" on the old IP:
<<
karl@picacho:~$ sudo vos changeaddr -oldaddr 127.0.0.1 -remove -localauth
-verbose
Could not remove server 127.0.0.1 from the VLDB
VLDB: volume Id exists in the vldb
>>

How would I go about resolving this?  By the way, thanks very much for all
of your help so far; you've really saved my ass on this.

Thanks,
Karl


-----Original Message-----
From: openafs-info-admin@openafs.org [mailto:openafs-info-admin@openafs.org]
On Behalf Of Christopher D. Clausen
Sent: Thursday, October 04, 2007 6:46 AM
To: Karl M. Davis
Cc: openafs-info@openafs.org
Subject: Re: [OpenAFS] AFS Fileserver Won't Start --> Can't Release
root.cell or root.afs

Karl M. Davis <karl@ridgetop-group.com> wrote:
> Well, after rebooting again, things suddenly seem to be working.  No
> idea why...
>
> I still have some problems with making RO copies of root.cell and
> root.afs, though.  Running "vos release" gives me:
> <<
> karl@picacho:~$ vos release -id root.cell
> Failed to start a transaction on the RO volume.
> VOLSER: volume is busy
> The volume 536870918 could not be released to the following 1 sites:
>               picacho.ridgetop-group.local /vicepa
> VOLSER: release could not be completed
> Error in vos release command.
> VOLSER: release could not be completed
> karl@picacho:~$ vos release -id root.afs
> Failed to start a transaction on the RO volume.
> VOLSER: volume is busy
> The volume 536870915 could not be released to the following 1 sites:
>               picacho.ridgetop-group.local /vicepa
> VOLSER: release could not be completed
> Error in vos release command.
> VOLSER: release could not be completed

Try vos release -id root.afs -verbose -local as root to get more info 
and use your KeyFile instead of user tokens.

Does vos listaddrs -noresolve print out?

And can you vos changeaddr -remove any incorrect IP addresses?  (You 
might need to vos remsite replicas attached to those IPs first.)
You might still be having problems related to having your 127.* 
/etc/hosts line match the actual IP of your AFS server.  In theory you 
can shutdown both AFS servers, delete your VL DB and have it regenerated 
via vos syncserv and vos syncvldb commands.  Of course, this could also 
make things worse.

<<CDC 


_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info