[OpenAFS] 1.3.80 server strangeness (kernel 2.6.11-gentoo-r3)

Kevin openafs@gnosys.biz
Fri, 25 Mar 2005 14:10:43 -0500


On Fri, 2005-03-25 at 12:48 -0500, Derrick J Brashear wrote:
> On Fri, 25 Mar 2005, Kevin wrote:
> > But then I noticed that vlserver was not running (I had made the new
> > server a database server as described in the docs).
> 
> Were the instances bos stopped at some point? Did bos status (server) 
> -long indicate it had ever tried to run them, or why they had stopped?
> 

I didn't think to try that at the time, but since writing, I've done the
following with the given results:

aphrodite bin # ps aux|grep afs
root     12138  0.0  0.0   1572   468 pts/0    S+   12:47   0:00 grep
afs
aphrodite bin # bosserver -noauth
aphrodite bin # ps aux|grep afs
root     12142  0.0  0.1   2400  1332 ?        S    12:47
0:00 /usr/afs/bin/upclient zeus.domain.name /usr/afs/etc
root     12143  0.0  0.1   2520  1372 ?        S    12:47
0:00 /usr/afs/bin/upserver -clear /usr/afs/bin
root     12144  0.0  0.3   4992  3916 ?        S    12:47
0:00 /usr/afs/bin/buserver
root     12145  0.0  0.4   5572  4432 ?        S    12:47
0:00 /usr/afs/bin/ptserver
root     12146  0.5  0.5   6920  5692 ?        S    12:47
0:00 /usr/afs/bin/vlserver
root     12148  0.0  0.0   1572   468 pts/0    R+   12:47   0:00 grep
afs
aphrodite bin # bos start -server aphrodite -instance fs
bos: a pioctl failed (getting tickets)
bos: running unauthenticated
aphrodite bin # ps aux|grep afs
root     12142  0.0  0.1   2400  1332 ?        S    12:47
0:00 /usr/afs/bin/upclient zeus.domain.name /usr/afs/etc
root     12143  0.0  0.1   2520  1372 ?        S    12:47
0:00 /usr/afs/bin/upserver -clear /usr/afs/bin
root     12144  0.0  0.3   4992  3924 ?        S    12:47
0:00 /usr/afs/bin/buserver
root     12145  0.0  0.4   5736  4612 ?        S    12:47
0:00 /usr/afs/bin/ptserver
root     12146  0.0  0.5   7436  5724 ?        S    12:47
0:00 /usr/afs/bin/vlserver
root     12154  0.2  0.6 155516  6744 ?        S<l  12:47
0:00 /usr/afs/bin/fileserver
root     12158  0.0  0.1 109632  1664 ?        Sl   12:47
0:00 /usr/afs/bin/volserver
root     12188  0.0  0.0   1572   468 pts/0    R+   12:47   0:00 grep
afs
aphrodite bin # cat /usr/afs/local/BosConfig
restarttime 11 0 4 0 0
checkbintime 3 0 5 0 0
bnode simple upclientetc 1
parm /usr/afs/bin/upclient  zeus.folkvang.org /usr/afs/etc
end
bnode simple upserver 1
parm /usr/afs/bin/upserver -clear /usr/afs/bin
end
bnode fs fs 1
parm /usr/afs/bin/fileserver
parm /usr/afs/bin/volserver
parm /usr/afs/bin/salvager
end
bnode simple buserver 1
parm /usr/afs/bin/buserver
end
bnode simple ptserver 1
parm /usr/afs/bin/ptserver
end
bnode simple vlserver 1
parm /usr/afs/bin/vlserver
end

My question here is, should it not be the case that starting bosserver
automatically starts up all of the processes listed in BosConfig?  If
so, why didn't fileserver and volserver start up with bosserver?  Why
was it necessary for me to issue that second command starting up fs?

Following the above, I did this:
aphrodite bin # bos status aphrodite -long
bos: a pioctl failed (getting tickets)
bos: running unauthenticated
Instance upclientetc, (type is simple) currently running normally.
    Process last started at Fri Mar 25 12:47:23 2005 (1 proc starts)
    Command 1 is '/usr/afs/bin/upclient  zeus.folkvang.org /usr/afs/etc'

Instance upserver, (type is simple) currently running normally.
    Process last started at Fri Mar 25 12:47:23 2005 (1 proc starts)
    Command 1 is '/usr/afs/bin/upserver -clear /usr/afs/bin'

Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Fri Mar 25 12:47:48 2005 (3 proc starts)
    Last exit at Fri Mar 25 12:47:48 2005
    Command 1 is '/usr/afs/bin/fileserver'
    Command 2 is '/usr/afs/bin/volserver'
    Command 3 is '/usr/afs/bin/salvager'

Instance buserver, (type is simple) currently running normally.
    Process last started at Fri Mar 25 12:47:23 2005 (1 proc starts)
    Command 1 is '/usr/afs/bin/buserver'

Instance ptserver, (type is simple) currently running normally.
    Process last started at Fri Mar 25 12:47:23 2005 (1 proc starts)
    Command 1 is '/usr/afs/bin/ptserver'

Instance vlserver, (type is simple) currently running normally.
    Process last started at Fri Mar 25 12:47:23 2005 (1 proc starts)
    Command 1 is '/usr/afs/bin/vlserver'

Then I had a flash of insight that part or all of the problem with the
strange copy operations might have been due to the client code running
on the workstation from which I did the copy (1.3.77) rather than the
server code on the new 1.3.80 server.

So I tried the copy from one of the other two 1.2.11 client/server
machines, and there were no apparent complaints during the copy.

So then, thinking that perhaps all of the problems were due to the
1.3.77 client code, I did what I've been trying to do for a long time
now which is this (from Quick Beginnings under adding a new server):

===========================================================
11.
Issue the bos restart command on every database server machine in
the cell, including the new machine. The command restarts the
Authentication, Backup, Protection, and VL Servers, which forces
an election of a new Ubik coordinator for each process. The new
machine votes in the election and is considered as a potential
new coordinator. 

A cell-wide service outage is possible during the election of a new
coordinator for the VL Server, but it normally lasts less than five
minutes. Such an outage is particularly likely if you are installing
your cell's second database server machine. Messages tracing the
progress of the election appear on the console.

Repeat this command on each of your cell's database server machines in
quick succession. Begin with the machine with the lowest IP address. 

      %  bos restart <machine name> kaserver buserver ptserver vlserver
===========================================================

I just did buserver, ptserver, and vlserver since I'm running kerberos.

When I did them in the order described, the 1.3.80 server is last and
the first two servers (1.2.11) went off without a hitch, but the
last server said:

aphrodite bin # bos restart -server aphrodite -instance buserver ptserver vlserver
bos: a pioctl failed (getting tickets)
bos: running unauthenticated
*** glibc detected *** free(): invalid pointer: 0xb7c9c010 ***
bos: failed to restart instance buserver (communications failure (-1))
bos: failed to restart instance ptserver (communications failure (-1))
bos: failed to restart instance vlserver (communications failure (-1))
aphrodite bin # bos status aphrodite -long
bos: a pioctl failed (getting tickets)
bos: running unauthenticated
bos: failed to contact host's bosserver (communications failure (-1)).
aphrodite bin # ps aux|grep afs
root     12142  0.0  0.1   2400  1332 ?        S    12:47   0:00 /usr/afs/bin/upclient zeus.folkvang.org /usr/afs/etc
root     12143  0.0  0.1   2520  1372 ?        S    12:47   0:00 /usr/afs/bin/upserver -clear /usr/afs/bin
root     12145  0.0  0.4   5736  4620 ?        S    12:47   0:00 /usr/afs/bin/ptserver
root     12146  0.0  0.5   7572  6036 ?        S    12:47   0:00 /usr/afs/bin/vlserver
root     12154  0.5  0.7 155748  7364 ?        S<l  12:47   0:16 /usr/afs/bin/fileserver
root     12158  0.0  0.1 109760  1860 ?        Sl   12:47   0:00 /usr/afs/bin/volserver
aphrodite bin # ps aux|grep boss
root     12303  0.0  0.0   1572   468 pts/0    R+   14:02   0:00 grep boss

So it looks like this restart command killed the bosserver and the buserver
which I suppose explains the communications failure with buserver, but why
the failure with the other two?

None of this involved the 1.3.77 client code.

> > Fri Mar 25 12:00:22 2005 VShutdown:  shutting down on-line volumes...
> > Fri Mar 25 12:00:22 2005 VShutdown:  complete.
> > Fri Mar 25 12:00:22 2005 File server has terminated normally at Fri Mar
> > 25 12:00:22 2005
> 
> This is a clean shutdown. I assume something shut down your file server 
> deliberately.
> 

Yeah.  I did that manually, but wasn't sure if the messages were normal
or not.

> > 1) server instances won't stay running (as described above);
> 
> That's unclear.

Hopefully the above makes it a bit more clear.  Let me know if not and I
can provide other details.

> 
> > 2) after kinit'ing as the afs admin, aklog fails with:
> > aklog: unable to obtain tokens for cell folkvang.org (status: 11862788).
> 
> 11862788 (ktc).4 = a pioctl failed
> if no client is running, that's expected.
> 

Ok.

> > And because of 2) I've been starting bosserver locally on the new
> > server, but when trying to start or restart other server instances on
> > the new server, I issue the commands from the sysctl machine with the
> > -server argument listing the new server.
> 
> you can of course use -localauth on a server machine without needing to 
> get tokens

I'll try that next.

Thanks for your reply, Derrick.