[OpenAFS] AFS Fileserver Won't Start

Wed, 3 Oct 2007 22:18:33 -0500

Karl M. Davis <karl@ridgetop-group.com> wrote:

Hi Karl.  I'm going to assume it was you in the #openafs IRC channel. 
I'd suggest staying logged in if you really want help.  You have to wait 
for people to have time to respond.  And more than the 15 minutes that 
you waited.  We do need to do things like eat and sleep.

> Somewhere towards the end of moving the volumes from the old server
> to the new server, things got badly goofed.  The fs process will no
> longer start on the new server and I find the following entry in the
> /var/log/openafs/FileLog file:
>
> Wed Oct  3 19:26:59 2007 afs_krb_get_lrealm failed, using
> ridgetop-group.local.

Is the above a correct assumption about your Realm?  I would expect you 
to be using ridgetop-group.com.

> Wed Oct  3 19:26:59 2007 VL_RegisterAddrs rpc failed; The IP address
> exists on a different server; repair it

Check the /etc/hosts file on all machines and all CellServDB files for 
incorrect entries.

> Wed Oct  3 19:26:59 2007 VL_RegisterAddrs rpc failed; See VLLog for
> details

What is in VLLog?

> Unfortunately, there's nothing helpful in VLLog.  Interestingly, "vos
> listaddrs" returns nothing on the new server, either.

vos listaddrs might not be working b/c of the above errors.

> Running "vos listvldb" returns the following:
> VLDB entries for all servers
> root.afs
>    RWrite: 536870915     ROnly: 536870916
>    number of sites -> 3
>       server picacho.ridgetop-group.local partition /vicepa RW Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>
> root.cell
>    RWrite: 536870918     ROnly: 536870919
>    number of sites -> 3
>       server picacho.ridgetop-group.local partition /vicepa RW Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>       server picacho.ridgetop-group.local partition /vicepa RO Site
>
> I'm unsure why there are duplicate RO entries, but the last thing I
> was working on was recreating RO volumes for root.cell and root.afs
> on the new server.

Well, it looks like something did not work out right.

> I'm panicking because all of the volumes are now on the new server and
> non-accessible.  Anyone have some clue what I did wrong and how I can
> fix things?

Probably going to need more information about what happened, what you 
did to try and fix it, and other infrastructure questions, like how many 
AFS DB servers you actually have, and if any of them are multi-homed.

<<CDC