[OpenAFS-devel] Start script tidbits
Brett Johnson
mlafsdevel@k50.net
Thu, 19 Apr 2001 17:03:53 GMT
Probably the best way to stress a network setup is to let the power company
work their magic (ie power outages longer than the UPS's can hold).
Victim to one of these this week (and madder than a wet hornet), I've found
out some things that may help others and make AFS little more fault tolerant
under the extreme.
I'm running OpenAFS 1.03 and RH62 with a 2.2.18 kernel (all self compiled).
The two systems I will mention are still in lab and not yet in production.
#1) My AFS server is also set up as a client to itself. The /cache
partition is a dedicated ext2 partition at about 230megs. While it didn't
have any activity at 5am, apparently some of the cache files were open at
time time of power failure. When I came in the AFS start script was hung
and would not restart no matter what (ie would panic the system). I deleted
everything out of /cache and let it recreate itself and all came back
normally.
Paranoid Solution #1: Add this line in the /etc/rc.d/init.d/afs start
script just under the start) line: (watch out for line wrapping)
#######
find /usr/vice/cache/ -depth -print 2>/dev/null | grep -v
"^/usr/vice/cache/$" | grep -v "^/usr/vice/cache$" | grep -v "^lost+found$"
| xargs -l20 rm -rf 2>/dev/null
#######
On a side note, does anyone have an idea why corrupted cache files would not
just be deleted???
#2) My faithful client box was a bit perplexed about not finding its server
and had hung on startup. My eventual goal for all this (as any admin) is to
not have the box come fully up until it can adequately connect to the server
(otherwise I have to go do it manually). My solution is simple but still
somewhat of a hack.
Paranoid Solution #2: Do a test to see if the AFS server is running, if
not, wait; if fail, exit and don't even try to start. This script bit is
*reasonably* server aware (trying to keep it generic so it is easier to
distribute on my side). Add this section in the /etc/rc.d/init.d/afs start
script just under the start) line: (watch out for line wrapping)
#######
# Do a primitive "up or wait test" first for a non-server client.
# This section should only be executed on a client system.
if ! test -e /usr/afs/bin/bosserver ; then
# Be careful in positioning this AFS start script in relation to
other start scripts.
SECONDS=0
# Find the program on that will return !0 if fails (udebug someday?).
if test -e /usr/vice/bin/vos ; then
TESTME="/usr/vice/bin/vos listvol"
else
TESTME="vos listvol"
fi
# This is a generic kludge and may not work for everyone.
# It also assumes the primary AFS server is the lowest IP number.
THISCELL=`cat /usr/vice/etc/ThisCell | tr -d "\r" | tr -d "\n"`
until ${TESTME} `grep -i ${THISCELL} /usr/vice/etc/CellServDB|grep -v
"^>"|grep -v "^#"|sort|head -1|cut -f1 -d \ |cut -f1` >/dev/null 2>/dev/null
# If the generic kludge does not work, use this line and spell out
the server.
# until ${TESTME} afs01.k50.net >/dev/null 2>/dev/null
do
echo "Searching for AFS server failed, retrying..."
if test $SECONDS -gt 3600 ; then
echo "Unable to contact AFS server, exiting."
exit 1
fi
sleep 10s
done
fi #end of big client if.
#######
Originally I was going to use the udebug program to do the ${TESTME} part,
but udebug always seems to return 0. When I pointed it at a dummy IP, it
printed out a -1 error, but still returned 0. Is that suppose to be???
Now, I guess the big question is can something like this be assimilated into
the main distribution? maybe as an option or something? Not so much as
deleting the cache on ever start, but waiting until the server is there
before the client starts? Typically when my AFS client does a "half start"
it will not do a full start later without a reboot (this is why I prefer all
or nothing starts).
As always, this code is provided as is without any warranty. Just because
it works on my system doesn' t mean it will work perfectly on yours. If you
don't know a lot about shell scripting, BACK UP and GET HELP. This message
will self destruct in 30 seconds...
B++/K90, Inc.