[OpenAFS] "afs" and "admin" entries disappear from kaserver

Paul Blackburn mpb@est.ibm.com
Thu, 09 Jan 2003 09:27:37 +0000


Brian Sebby wrote:

>I'm setting up a small AFS cell to teach some people about how AFS works,
>and today we ran into a fairly bizarre problem.
>
>The systems involved are Linux servers, and we're using the stock kaserver,
>etc.  I haven't done anything with v5 to keep things simple since this cell
>isn't going into production.
>
>After setting up the first machine as a database and file server, everything
>seemed to be working ok.  We added a second db/file server, and again, it
>mounted AFS and everything looked like it was going smoothly.  We did the
>same steps on the third server (copying over the contents of /usr/afs/etc,
>etc.) and again could mount AFS.
>
>Then we noticed something bizarre.  When we tried to authenticate as admin,
>we got an error message that the "user does not exist".  I looked in kas,
>but couldn't get a listing of the users because I didn't have authorization.
>Looking in the protection database indicated that admin still existed there,
>with an AFS ID of 1.  I finally shut down the servers and started bosserver
>in -noauth mode and did a kas list, and the only things that came back were:
>
>AuthServer.Admin
>krbtgt.IMSA.EDU
>
>Any ideas what might have happened?  Could one of the other servers have
>overwritten the database when syncing with it?  What can I do to recover
>from this?  Any help would be appreciated.
>
>
>Thanks,
>
>Brian Sebby
>
>  
>
Hello Brian,

Total guess: could the third db server have somehow decided it was sync 
site without
being in UBIK synchronisation with the other two db servers?

Some of "sanity" checks I would do when adding new servers:

a) Before starting AFS on a new server, make sure that time and date
    are correctly set. The simplest way I have found is:
        rdate -s $ntpserver
    or
        ntpdate -u $ntpserver

b) After starting AFS on new database server, make sure all the required
    bos processes are running.

    Here is an example from a dedicated AFS database server (no 
fileserver process):

    $ bos status 10.33.33.25 -long
       Instance kaserver, (type is simple) currently running normally.
            Process last started at Sun Jan  5 04:00:48 2003 (1 proc starts)
            Command 1 is '/usr/afs/bin/kaserver'

        Instance buserver, (type is simple) currently running normally.
            Process last started at Sun Jan  5 04:00:48 2003 (1 proc starts)
            Command 1 is '/usr/afs/bin/buserver'

        Instance ptserver, (type is simple) currently running normally.
            Process last started at Sun Jan  5 04:00:48 2003 (1 proc starts)
            Command 1 is '/usr/afs/bin/ptserver'

        Instance vlserver, (type is simple) currently running normally.
            Process last started at Sun Jan  5 04:00:48 2003 (1 proc starts)
            Command 1 is '/usr/afs/bin/vlserver'

c) After starting AFS on new database server, make sure that the UBIK
    voting process has successfully voted the "sync site" (lead db server).
    This is done with something like:
        udebug $afs-db-server 7004

    For example, here is the output for a cell with 3 AFS database servers:
    10.33.33.25   10.33.33.26   10.33.33.30

     In the following example, 10.33.33.25 has been voted "sync site".
     Note also, the database version active on each server is shown
     (in this case "1035994252.2"). This should be the same on all db 
servers.

        $ udebug 10.33.33.25 7004
        Host's addresses are: 10.33.178.25
        Host's 10.33.33.25 time is Thu Jan  9 04:03:44 2003
        Local time is Thu Jan  9 04:03:43 2003 (time differential -1 secs)
        Last yes vote for 10.33.33.25 was 6 secs ago (sync site);
        Last vote started 6 secs ago (at Thu Jan  9 04:03:37 2003)
        Local db version is 1035994252.2
        I am sync site until 49 secs from now (at Thu Jan  9 04:04:32 
2003) (3 servers)
        Recovery state 1f
        Sync site's db version is 1035994252.2
        0 locked pages, 0 of them for write

        Server (10.33.33.30): (db 1035994252.2)
            last vote rcvd 8 secs ago (at Thu Jan  9 04:03:35 2003),
            last beacon sent 6 secs ago (at Thu Jan  9 04:03:37 2003), 
last vote was yes
            dbcurrent=1, up=1 beaconSince=1

        Server (10.33.33.26): (db 1035994252.2)
            last vote rcvd 11 secs ago (at Thu Jan  9 04:03:32 2003),
            last beacon sent 6 secs ago (at Thu Jan  9 04:03:37 2003), 
last vote was yes
            dbcurrent=1, up=1 beaconSince=1

If the UBIK voting process has completed OK then all should be well.
Why check this? It is possible for things to go wrong:
    - not in time synchronization
    - network connection problem
    - local firewall rules blocking UBIK synchronisation

I hope this helps.
--
cheers
paul                                http://acm.org/~mpb