[OpenAFS-devel] Start script tidbits

Brett Johnson mlafsinfo@k50.net
Wed, Apr 18 2001 18:25:00 -0500


Probably the best way to stress a network setup is to let the power company work their magic (ie power outages longer than the UPS's can hold).

Victim to one of these this week (and madder than a wet hornet), I've found out some things that may help others and make AFS little more fault tolerant under the extreme.

I'm running OpenAFS 1.03 and RH62 with a 2.2.18 kernel (all self compiled).  The two systems I will mention are still in lab and not yet in production.

#1) My AFS server is also set up as a client to itself.  The /cache partition is a dedicated ext2 partition at about 230megs.  While it didn't have any activity at 5am, apparently some of the cache files were open at time time of power failure.  When I came in the AFS start script was hung and would not restart no matter what (ie would panic the system).  I deleted everything out of /cache and let it recreate itself and all came back normally.

Paranoid Solution #1:  Add this line in the /etc/rc.d/init.d/afs start script just under the start) line: (watch out for line wrapping)

#######
find /usr/vice/cache/ -depth -print 2>/dev/null | grep -v "^/usr/vice/cache/$" | grep -v "^/usr/vice/cache$" | grep -v "^lost+found$" | xargs -l20 rm -rf 2>/dev/null
#######

On a side note, does anyone have an idea why corrupted cache files would not just be deleted???

#2) My faithful client box was a bit perplexed about not finding its server and had hung on startup.  My eventual goal for all this (as any admin) is to not have the box come fully up until it can adequately connect to the server (otherwise I have to go do it manually).  My solution is simple but still somewhat of a hack.

Paranoid Solution #2:  Do a test to see if the AFS server is running, if not, wait; if fail, exit and don't even try to start.  This script bit is *reasonably* server aware (trying to keep it generic so it is easier to distribute on my side).  Add this section in the /etc/rc.d/init.d/afs start script just under the start) line: (watch out for line wrapping)

#######

        # Do a primitive "up or wait test" first for a non-server client.
        # This section should only be executed on a client system.
        if ! test -e /usr/afs/bin/bosserver ; then
        # Be careful in positioning this AFS start script in relation to other start scripts.
        SECONDS=0
        # Find the program on that will return !0 if fails (udebug someday?).
        if test -e /usr/vice/bin/vos ; then
           TESTME="/usr/vice/bin/vos listvol"
        else
           TESTME="vos listvol"
           fi
        # This is a generic kludge and may not work for everyone.
        # It also assumes the primary AFS server is the lowest IP number.
        THISCELL=`cat /usr/vice/etc/ThisCell | tr -d "\r" | tr -d "\n"`
        until ${TESTME} `grep -i ${THISCELL} /usr/vice/etc/CellServDB|grep -v "^>"|grep -v "^#"|sort|head -1|cut -f1 -d \ |cut -f1` >/dev/null 2>/dev/null
        # If the generic kludge does not work, use this line and spell out the server.
        # until ${TESTME} afs01.k50.net >/dev/null 2>/dev/null
           do
           echo "Searching for AFS server failed, retrying..."
           if test $SECONDS -gt 3600 ; then
              echo "Unable to contact AFS server, exiting."
              exit 1
              fi
           sleep 10s
           done
        fi #end of big client if.

#######

Originally I was going to use the udebug program to do the ${TESTME} part, but udebug always seems to return 0.  When I pointed it at a dummy IP, it printed out a -1 error, but still returned 0.  Is that suppose to be???

Now, I guess the big question is can something like this be assimilated into the main distribution?  maybe as an option or something?  Not so much as deleting the cache on ever start, but waiting until the server is there before the client starts?  Typically when my AFS client does a "half start" it will not do a full start later without a reboot (this is why I prefer all or nothing starts).

As always, this code is provided as is without any warranty.  Just because it works on my system doesn' t mean it will work perfectly on yours.  If you don't know a lot about shell scripting, BACK UP and GET HELP.  This message will self destruct in 30 seconds...

B++/K90, Inc.the client starts?  Typically when my AFS client does a "half start" it will not do a full start later without a reboot (this is why I prefer all or nothing starts).

As always, this code is provided as is without any warranty.  Just because it works on my system doesn' t mean it will work perfectly on yours.  If you don't know a lot about shell scripting, BACK UP and GET HELP.  This message will self destruct in 30 seconds...

B++/K90, Inc.