[OpenAFS] problems with openafs 1.2.5 on Linux 2.4.18 (debian woody)

Dr A V Le Blanc Dr A V Le Blanc <LeBlanc@mcc.ac.uk>
Fri, 12 Jul 2002 10:48:36 +0100


On Wed, 10 Jul 2002 13:47:05, Derrick J Brashear <shadow@dementia.org> wrote:
> As to removing unused ones:
> 
> vos changeaddr -remove

This doesn't allow you to remove machines which are in the VLDB but
no longer DNS registered.  Should one have to register old machines
temporarily to delete them?  I tried putting one in /etc/hosts,
but it's not enough.

On 10 Jul 2002 12:12:15, seph <seph@commerceflow.com> wrote:
>> Incidentally, when bosserver starts, it
>> begins with the message:
>> 
>>      Wed Jul 10 11:02:50 2002: Server directory access is not okay
> 
> I was getting that error, then I found
> http://www.transarc.ibm.com/Support/afs/news/boswarn.html setting the
> permissions as described there cleared up the warning.

This is particularly hard to deal with on a Debian system, since
the directories are not the same as the Transarc ones.  But I'll
try to work through everything.  Of course, it would be much easier
if there were a way to ask bos to name a problem directory or
file with the wrong permissions.

With respect to my original problem, I did in the end manage to
get the new dbserver running, but only at the cost of two days
of disrupted service.  The original sync site, which has the
lowest IP address (Oh, these stupid inflexibilities!) is a machine
which has long been having problems with processes dying.  For
example, it has a tendency for the bosserver process to die,
which can be fixed only by manuually killing the db and file server
processes, then restarting bosserver.  Sometimes the file server
processes take 30 minutes or more to die.  You can kill them
with -9, but then the whole server is inaccessible for 30 or
40 minutes after restarting, while the salvager runs.  And often
during this process the bosserver dies again.

Anyway, for some reason, every time I tried to add the new db server
machine , the bosserver and most db server processes on the old
sync machine died (the logs say 'stack overflow').  In the end,
I changed both the CellServDB files (the one in /usr/vice/etc and
the one in /usr/afs/etc, or, on the new Debian machine, the one
in /etc/openafs and the one in /etc/openafs/server) and rebooted
all the servers one by one.  One of the problems with the old
configuration is that when the sync machine failed, failover of
replicated volume reads to the other servers didn't always work,
and this meant that sometimes all three servers went down.

Anyway, now we're back up with the problem machine running only
as a file server, and I have no further worries except the
copy-on-write bug.  Thanks for all who suggested help and solutions.

     -- Owen
     LeBlanc@mcc.ac.uk