[OpenAFS] AFS Fileserver Won't Start

Karl M. Davis karl@ridgetop-group.com
Thu, 4 Oct 2007 00:17:06 -0700


Thanks for the quick response,

Yeah, sorry for disappearing there on IRC but I needed to restart the
computer I was connected with.


> Is the above a correct assumption about your Realm?  I would expect you 
> to be using ridgetop-group.com.
Yes, it is correct: our realm is ridgetop-group.local.


> Check the /etc/hosts file on all machines and all CellServDB files for 
> incorrect entries.
The CellServDB file is correct:
/etc/openafs/CellServDB:
<<
>ridgetop-group.local
192.168.2.5             # coronado.ridgetop-group.local
192.168.2.6             # picacho.ridgetop-group.local
...
>>
/etc/openafs/server/CellServDB:
<<
>ridgetop-group.local   #Cell name
192.168.2.5    #coronado.ridgetop-group.local
192.168.2.6    #picacho.ridgetop-group.local
>>

The /etc/hosts file is correct, but I did add the second line to it
somewhere before things went to pot:
<<
127.0.0.1       localhost
192.168.2.6     picacho.ridgetop-group.local    picacho
127.0.1.1       picacho.ridgetop-group.local    picacho
...
>>


> What is in VLLog?
Not much.  /var/log/openafs/VLLog:
<<
Wed Oct  3 23:45:05 2007 Using 192.168.2.6 as my primary address
Wed Oct  3 23:45:05 2007 Starting AFS vlserver 4 (/usr/lib/openafs/vlserver)
>>


I believe that moving volumes went well enough.  I started having trouble,
though, when I went to recreate the RO copies of root.cell and root.afs.
Unfortunately, I'm unclear on the exact order of all this now, but here's a
list of the things I did:
1. Setup picacho as a dbserver and fileserver.
2. "vos move"'d all of the RW volumes from coronado to picacho.
3. "vos addsite"'s for root.afs and root.cell, but could not get "vos
release" to work.  It gave me some errors: "Failed to start a transaction on
the RO volume ... volume is busy".
4. Tried "vos syncvldb" and "vos syncserv" on both servers, but those didn't
seem to help.  Running syncvldb on picacho gave me errors: "Warning:
Orphaned RW volume ... exists on ...".
5. Further googling turned up some hits that suggested I should try "vos
changeaddr 127.0.0.1 192.168.2.6".  This is also around when I added the
above-mentioned line to /etc/hosts.  I can't recall exactly, but I may have
tried playing around with "bos addhost" and "bos removehost" here as well.
6. Tried running "bos salvage".  I'm pretty sure this is when things got
ugly and fs stopped starting.  Running "fs checkvolumes" now segfaults: very
fun.

I only have the two openafs servers: coronado (old VM) and picacho (new
box).  Both of them are dbservers and volservers, neither is multi-homed.

That's the saga so far.  I greatly appreciate any help you can offer!
-- Karl


-----Original Message-----
From: openafs-info-admin@openafs.org [mailto:openafs-info-admin@openafs.org]
On Behalf Of Christopher D. Clausen
Sent: Wednesday, October 03, 2007 8:19 PM
To: Karl M. Davis
Cc: openafs-info@openafs.org
Subject: Re: [OpenAFS] AFS Fileserver Won't Start

Karl M. Davis <karl@ridgetop-group.com> wrote:

Hi Karl.  I'm going to assume it was you in the #openafs IRC channel. 
I'd suggest staying logged in if you really want help.  You have to wait 
for people to have time to respond.  And more than the 15 minutes that 
you waited.  We do need to do things like eat and sleep.

> Somewhere towards the end of moving the volumes from the old server
> to the new server, things got badly goofed.  The fs process will no
> longer start on the new server and I find the following entry in the
> /var/log/openafs/FileLog file:
>
> Wed Oct  3 19:26:59 2007 afs_krb_get_lrealm failed, using
> ridgetop-group.local.

Is the above a correct assumption about your Realm?  I would expect you 
to be using ridgetop-group.com.

> Wed Oct  3 19:26:59 2007 VL_RegisterAddrs rpc failed; The IP address
> exists on a different server; repair it

Check the /etc/hosts file on all machines and all CellServDB files for 
incorrect entries.

> Wed Oct  3 19:26:59 2007 VL_RegisterAddrs rpc failed; See VLLog for
> details

What is in VLLog?

> Unfortunately, there's nothing helpful in VLLog.  Interestingly, "vos
> listaddrs" returns nothing on the new server, either.

vos listaddrs might not be working b/c of the above errors.

> Running "vos listvldb" returns the following:
> VLDB entries for all servers
> root.afs
>    RWrite: 536870915     ROnly: 536870916
>    number of sites -> 3
>       server picacho.ridgetop-group.local partition /vicepa RW Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>
> root.cell
>    RWrite: 536870918     ROnly: 536870919
>    number of sites -> 3
>       server picacho.ridgetop-group.local partition /vicepa RW Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>
> I'm unsure why there are duplicate RO entries, but the last thing I
> was working on was recreating RO volumes for root.cell and root.afs
> on the new server.

Well, it looks like something did not work out right.

> I'm panicking because all of the volumes are now on the new server and
> non-accessible.  Anyone have some clue what I did wrong and how I can
> fix things?

Probably going to need more information about what happened, what you 
did to try and fix it, and other infrastructure questions, like how many 
AFS DB servers you actually have, and if any of them are multi-homed.

<<CDC 


_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info