[OpenAFS] /afs does not exist

Dexter 'Kim' Kimball dhk@ccre.com
Wed, 14 Apr 2004 08:22:05 -0600


> >I do suggest creating an RO on   "server rsl155 partition /vicepa RW
> >ite"  -- same server, same partition as RW -- doesn't cost much
> Mmm not sure how to do this? I'm not exactly an AFS expert as you prob
> guessed!

Admin tokens ... Use vos addsite to add the new site to the VLDB entry for
root.afs, then vos release the volume

Syntax: vos addsite -server <machine name for new site> -partition
<partition name for new site> -id <volume name or ID>
Something like:
        vos addsite rsl155 a root.afs
        vos listvl root.afs     # let's you see the impact of vos addsite
        vos release root.afs -v
        vos listvl root.afs     # to see impact of vos rel on VLDB entry


> Hi Kim,
>
> Rebooted some of the bad clients last night to no avail. I have
> added some
> answers to your questions below.
> I think that the clients may have stopped working on 11 April
> actually (not
> 10 April).
>
> Thanks for your help on this.
>
> JS.
> >
> > >
> >
> since such
> >a
> >RO is COW/clone and will allow rsl155 to be used as an RO failover site.
> >(Client won't fail over to the RW if it's supposed to get an RO, and when
> >root.afs is replicated the client will be looking for an RO.)
> Mmm not sure how to do this? I'm not exactly an AFS expert as you prob
> guessed!
>
> >
> >Are you able to "vos listvl root.afs" from a good client/bad client?
>
> Bad client:
> $ vos listvl root.afs
> vsu_ClientInit: Could not process files in configuration directory
> (/usr/vice/etc).
> could not initialize VLDB library (code=4294967295)
>

Above ... looks like the AFS kernel extensions aren't loaded (can't use some
resources normally provided by afs) so vos listvl fails.

> Good client:
> $ vos listvl root.afs
>
> root.afs
>     RWrite: 536870915     ROnly: 536870916
>     number of sites -> 3
>        server rsl155 partition /vicepa RW Site
>        server rsl156 partition /vicepa RO Site
>        server rsl59 partition /vicepa RO Site
>
> >
> > > >
> > > >rxdebug <hostname> 7001    will give you some info about activity on
> >the
> > > >AFS
> > > >client's callback port
> > >
> > > On a working client:
> >
> >Nothing conclusive here, but nothing unexpected either.  rs155
> has talked
> >to
> >a fileserver (apparently itself) and still has an open connection.
> >
> > >
> > > rsl55:/afs/.uk.baplc.com# rxdebug rsl55 7001
> > > Trying 167.156.154.55 (port 7001):
> > > Free packets: 130, packet reclaims: 0, calls: 101338, used FDs: 64
> > > not waiting for packets.
> > > 0 calls waiting for a thread
> > > 1 threads are idle
> > > Connection from host 167.156.154.55, port 7000, Cuid 9915c0ac/1817cfe8
> > >   serial 128760,  natMTU 1444, flags pktCksum, security index 2,
> > > client conn
> > >   rxkad: level clear, flags pktCksum
> > >   Received 271944 bytes in 2518 packets
> > >   Sent 180088696 bytes in 128706 packets
> > >     call 0: # 2518, state dally, mode: receiving, flags: receive_done
> > >     call 1: # 0, state not initialized
> > >     call 2: # 0, state not initialized
> > >     call 3: # 0, state not initialized
> > > Done.
> >
> >
> >While rs156 either hasn't talked to a fileserver recently or at all -- in
> >any case there's no connection.
> >
> >?????  Someone correct me if I'm wrong ... IIRC the connections to 7001
> >from
> >a given fileserver will time out after a period of non use???
> >
> > >
> > > On a broekn client:
> > >
> > > rsl56:/# rxdebug rsl56 7001
> > > Trying 167.156.154.56 (port 7001):
> > > Free packets: 130, packet reclaims: 0, calls: 79437, used FDs: 64
> > > not waiting for packets.
> > > 0 calls waiting for a thread
> > > 1 threads are idle
> > > Done.
> > >
> > >
> > > >
> > > >df to see if /afs still appears in the output
> > >
> > > /dev/logsarc     9469952   6327480   34%      244     1% /logs/archive
> > > AFS
> > > df: /afs: No such file or directory
> > >
> >
> >Expected from broken client.
> >
> >
> >Is there anything (like a reboot) that happened .... oh, I see, it looks
> >like you're doing the Sunday default fileserver restarts ... judging from
> >the dates ...
> >
> >rsl57:/usr/afs/local# ps -ef | grep afs
> > > > >     root 17314 40486   0   11 Apr      -  0:00
> >/usr/afs/bin/fileserver
> > > > >     root 17686 40486   0   11 Apr      -  0:00
> >/usr/afs/bin/volserver
> > > > >     root 20134     1   0   08 May      - 17:24 /usr/vice/etc/afsd
> > > > > -stat 2800
> > > > > -dcache 2400 -daemons 5 -volumes 128
> > > > >     root 20384     1   0   08 May      - 17:23 /usr/vice/etc/afsd
> > > > > -stat 2800
> >
> >
> >How about sending the output from bos status <fileserver> -long ... for
> >each
> >fileserver -- or at least telling me when the last restart times were for
> >each.
>
> $ bos status -server rsl155 -long
> Bosserver reports inappropriate access on server directories
> Instance fs, (type is fs) currently running normally.
>     Auxiliary status is: file server running.
>     Process last started at Sun Apr 11 04:01:10 2004 (2 proc starts)
>     Command 1 is '/usr/afs/bin/fileserver'
>     Command 2 is '/usr/afs/bin/volserver'
>     Command 3 is '/usr/afs/bin/salvager'
>
> Instance kaserver, (type is simple) currently running normally.
>     Process last started at Sun Apr 11 04:01:10 2004 (1 proc starts)
>     Command 1 is '/usr/afs/bin/kaserver'
>
> Instance buserver, (type is simple) currently running normally.
>     Process last started at Sun Apr 11 04:01:10 2004 (1 proc starts)
>     Command 1 is '/usr/afs/bin/buserver'
>
> Instance ptserver, (type is simple) currently running normally.
>     Process last started at Sun Apr 11 04:01:10 2004 (1 proc starts)
>     Command 1 is '/usr/afs/bin/ptserver'
>
>
> >
> >Is it possible that all your fileservers restarted at the same time, or
> >that
> >the two fileservers with root.afs.readonly restarted at the same time, or
> >were unavailable at the same time?
> >
> >If so, what were the broken clients doing during the restart?  Might be
> >worth checking uptime on the broken clients to see if they were restarted
> >while the fileservers were restarting.
> All of these are possible but it's unlikely 6 clients would have been
> restarted at the same time.

Don't know many operations that automagically reboot clients on a regular
basis, but ...
Anyway, unless this were scheduled intentionally it is unlikely 6 clients
would reboot simultaneously w/o power interruption.

> >
> >?????  Do any of the fs commands (fs checkv or fs checks, e.g.)
> work on the
> >client?  I don't know if those are blocked when /afs won't mount or not.
> >Anyone???
>
> These ones worked:
> # fs checkv
> All volumeID/name mappings checked.
> # fs checks
> All servers are running.
>



> >
> >How about fs setcache?  (Gotta be root, and I'd try this on a
> broken client
>
> # fs setcache 1
> New cache size set.
> # fs setcache 0
> New cache size set.
> # fs checkv
> All volumeID/name mappings checked.
> # cd /afs
> ksh: /afs:  not found.


Flushing the cache didn't work :(

>
>
> >
> >Looks like afsd has been up for almost a year?  While AFS
> fileserver procs
> >were restarted 11 Apr?  Clients were good on Apr 10 ...
>
> I'm not sure but it looks as if it was OK until the 11th April.
>

OK.  From bos status info it looks like the fileserver and DB servers are
restarted at the same time ...  all DB servers also fileservers ??

> >
> >While we're at it, has anything changed with your AFS DB
> servers?  Are the
> >CellServDB files correct on clients (/usr/vice/etc) and servers
> >(/usr/afs/etc)?  What does "bos listhosts" report from each of the
> >fileservers?  "fs listc" from the clients?
>
> # bos listhosts -server rsl155
> Cell name is uk.dd.com
>     Host 1 is rsl155
>     Host 2 is rsl156
>
> Although if I run this on rsl155 it gives:
> $ bos listhosts -server rsl155
> bos: can't open cell database (/usr/vice/etc)
> eventhough /usr/vice/etc/CellServDB is present.
>
> # fs listc
> Cell uk.dd.com on hosts rsl155.dd.com.
>
> # cat /usr/vice/etc/CellServDB
> >uk.dd.com   #Cell name
> 161.2.249.91    #rsl155.dd.com

The bos listhosts is telling me that the fileserver is aware of 2 database
servers:  rsl155 and rsl156, while the client only knows about rsl155

Unless there's some reason (firewall) to subset the entries in the client
CellServDB all AFS DB servers should be listed there.

Pro forma -- I'd make sure that all CellServDB files contain entries for all
AFS DB servers (the ones that run kaserver, vlserver, ptserver ...)
You can use bos addhost to fix the servers (check the
/usr/afs/etc/CellServDB file afterwords) ... for the clients,
edit/distribute /usr/vice/etc/CellServDB and then, as root, use the "fs
newcell" command to inform the running afs client of the change.  Specify
all of the DB servers for the cell when using fs newcell.  Use fs listcell
afterwords to make sure the client lists the same DB servers you've put in
the CSDB.

If the DB servers on rsl155 are unavailable the clients currently cannot
find a DB server (no secondary DB server to fail over to), can't get the
VLDB entry for root.afs, can't mount root.afs at /afs

Conventional wisdom is that three AFS DB servers are better than two.

BTW, lest I assume into a bad place -- how many DB servers are you running?

Kim



Gotta run

Kim

=================================
Kim (Dexter) Kimball
CCRE, Inc.
>
> >
> >Kim
> >
>
> > > > > Hi,
> > > > >
> > > > > I'm having problems getting in to the /afs directory on an AIX
> > > > > box and I'm
> > > > > not sure how to fix it:
> > > > >
> > > > > # cd /afs
> > > > > ksh: /afs:  not found.
> > > > >
> > > > > This has been running fine until now though. The
> processes are still
> > > > > running:
> > > > >
> > > > > rsl57:/usr/afs/local# ps -ef | grep afs
> > > > >     root 17314 40486   0   11 Apr      -  0:00
> >/usr/afs/bin/fileserver
> > > > >     root 17686 40486   0   11 Apr      -  0:00
> >/usr/afs/bin/volserver
> > > > >     root 20134     1   0   08 May      - 17:24 /usr/vice/etc/afsd
> > > > > -stat 2800
> > > > > -dcache 2400 -daemons 5 -volumes 128
> > > > >     root 20384     1   0   08 May      - 17:23 /usr/vice/etc/afsd
> > > > > -stat 2800
> > > > > -dcache 2400 -daemons 5 -volumes 128
> > > > >     ..
> > > > >     ..
> > > > >     root 40486     1   0   11 Apr      -  0:00
> >/usr/afs/bin/bosserver
> > > > >
> > > > > And the logs look normal. I can ping the AFS server also.
> > > > >
> > > > > Is there anything else I can try?
> > > > >
> > > > > Thanks for any help.
> > > > >
> > > > > JS.
>
> _________________________________________________________________
> Use MSN Messenger to send music and pics to your friends
> http://www.msn.co.uk/messenger
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>