[OpenAFS-devel] Start script tidbits

Thu, 19 Apr 2001 17:03:53 GMT

Probably the best way to stress a network setup is to let the power company 
work their magic (ie power outages longer than the UPS's can hold).

Victim to one of these this week (and madder than a wet hornet), I've found 
out some things that may help others and make AFS little more fault tolerant 
under the extreme. 

I'm running OpenAFS 1.03 and RH62 with a 2.2.18 kernel (all self compiled).  
The two systems I will mention are still in lab and not yet in production. 

#1) My AFS server is also set up as a client to itself.  The /cache 
partition is a dedicated ext2 partition at about 230megs.  While it didn't 
have any activity at 5am, apparently some of the cache files were open at 
time time of power failure.  When I came in the AFS start script was hung 
and would not restart no matter what (ie would panic the system).  I deleted 
everything out of /cache and let it recreate itself and all came back 
normally. 

Paranoid Solution #1:  Add this line in the /etc/rc.d/init.d/afs start 
script just under the start) line: (watch out for line wrapping) 

#######
find /usr/vice/cache/ -depth -print 2>/dev/null | grep -v 
"^/usr/vice/cache/$" | grep -v "^/usr/vice/cache$" | grep -v "^lost+found$" 
| xargs -l20 rm -rf 2>/dev/null
####### 

On a side note, does anyone have an idea why corrupted cache files would not 
just be deleted??? 

#2) My faithful client box was a bit perplexed about not finding its server 
and had hung on startup.  My eventual goal for all this (as any admin) is to 
not have the box come fully up until it can adequately connect to the server 
(otherwise I have to go do it manually).  My solution is simple but still 
somewhat of a hack. 

Paranoid Solution #2:  Do a test to see if the AFS server is running, if 
not, wait; if fail, exit and don't even try to start.  This script bit is 
*reasonably* server aware (trying to keep it generic so it is easier to 
distribute on my side).  Add this section in the /etc/rc.d/init.d/afs start 
script just under the start) line: (watch out for line wrapping) 

####### 

       # Do a primitive "up or wait test" first for a non-server client.
       # This section should only be executed on a client system.
       if ! test -e /usr/afs/bin/bosserver ; then
       # Be careful in positioning this AFS start script in relation to 
other start scripts.
       SECONDS=0
       # Find the program on that will return !0 if fails (udebug someday?).
       if test -e /usr/vice/bin/vos ; then
          TESTME="/usr/vice/bin/vos listvol"
       else
          TESTME="vos listvol"
          fi
       # This is a generic kludge and may not work for everyone.
       # It also assumes the primary AFS server is the lowest IP number.
       THISCELL=`cat /usr/vice/etc/ThisCell | tr -d "\r" | tr -d "\n"`
       until ${TESTME} `grep -i ${THISCELL} /usr/vice/etc/CellServDB|grep -v 
"^>"|grep -v "^#"|sort|head -1|cut -f1 -d \ |cut -f1` >/dev/null 2>/dev/null
       # If the generic kludge does not work, use this line and spell out 
the server.
       # until ${TESTME} afs01.k50.net >/dev/null 2>/dev/null
          do
          echo "Searching for AFS server failed, retrying..."
          if test $SECONDS -gt 3600 ; then
             echo "Unable to contact AFS server, exiting."
             exit 1
             fi
          sleep 10s
          done
       fi #end of big client if. 

####### 

Originally I was going to use the udebug program to do the ${TESTME} part, 
but udebug always seems to return 0.  When I pointed it at a dummy IP, it 
printed out a -1 error, but still returned 0.  Is that suppose to be??? 

Now, I guess the big question is can something like this be assimilated into 
the main distribution?  maybe as an option or something?  Not so much as 
deleting the cache on ever start, but waiting until the server is there 
before the client starts?  Typically when my AFS client does a "half start" 
it will not do a full start later without a reboot (this is why I prefer all 
or nothing starts). 

As always, this code is provided as is without any warranty.  Just because 
it works on my system doesn' t mean it will work perfectly on yours.  If you 
don't know a lot about shell scripting, BACK UP and GET HELP.  This message 
will self destruct in 30 seconds... 

B++/K90, Inc.