[OpenAFS] Re: Ubik problem
Atro Tossavainen
atro.tossavainen+openafs@helsinki.fi
Sat, 17 Apr 2010 00:07:01 +0300 (EEST)
> Andrew talks a bit about "errors that appear after the server's been
> running for a while". If this is a memory corruption problem, then
> there is a good likelyhood of random seg faults, possible core dumps,
> and server restarts.
There are no coredumps. (Fileserver and volserver have dumped core
previously, and I've got them saved away, so I figure if there were
going to be any, at least I am not doing anything to stop it.)
I only just restarted all servers deliberately after changing the
faulty NetRestrict, but my previous AuthLog on the sunx86_510 extends
from Wed Apr 14 15:15 to Fri Apr 16 23:40 which is when I did.
I don't think kaserver is restarting spontaneously.
> paths and data matter here. Just knowing that the software is restarting
> spontaneously (cat /var/log/openafs/BosLog ?) would help a lot.
sunx86_510 # less BosLog
Sun Apr 11 04:00:58 2010: Server directory access is okay
Mon Apr 12 15:09:23 2010: kaserver exited on signal 15
Mon Apr 12 15:11:08 2010: kaserver exited on signal 15
Wed Apr 14 13:07:52 2010: kaserver exited on signal 15
Wed Apr 14 15:14:57 2010: kaserver exited on signal 15
Fri Apr 16 23:44:22 2010: upserverS10x86 exited on signal 15
Fri Apr 16 23:44:22 2010: vlserver exited on signal 15
Fri Apr 16 23:44:22 2010: kaserver exited on signal 15
Fri Apr 16 23:44:22 2010: ptserver exited on signal 15
Fri Apr 16 23:44:22 2010: fs:vol exited on signal 15
Fri Apr 16 23:44:22 2010: upclientetc exited on signal 15
Fri Apr 16 23:45:02 2010: fs:file exited with code 0
> Some other problems that could cause intermittent behavior include:
>
> /1/ flapping network routes. We already know there are multiple addresses...
And a static route.
> /2/ DNS. Unlikely, but ubik likely depends on dns. if "host `hostname`"
> lists more than one ip address, round robin behavior in dns
> might result in oddness.
It doesn't.
>From DNS, the hostname returns exactly one address. Even if host name
resolution was somehow involved, which seems unlikely to my untrained
mind, /etc/hosts takes preference, and since it's Solaris, you *have*
to have a separate name for each IP address you want to configure on
a network interface. Like this:
# ls /etc/hostname.nge*
hostname.nge0 hostname.nge1 hostname.nge2
# cat /etc/hostname.nge*
replicon-dev
replicon-rfc1918
replicon
# cat /etc/hosts
# grep replicon /etc/hosts
128.214.209.84 replicon-dev
128.214.58.174 replicon
10.0.0.20 replicon-rfc1918
nge0 is down and unplumbed now that the "development" server is no
more, nge1 is the RFC1918 address, and nge2 is the real McCoy.
> But since we know the key files aren't consistent,
You "know" that? That's a misassumption at best.
sun4x_58 # cksum /usr/afs/etc/KeyFile
2143645127 100 /usr/afs/etc/KeyFile
sunx86_510 # cksum /usr/afs/etc/KeyFile
2143645127 100 /usr/afs/etc/KeyFile
--
Atro Tossavainen (Mr.) / The Institute of Biotechnology at
Systems Analyst, Techno-Amish & / the University of Helsinki, Finland,
+358-9-19158939 UNIX Dinosaur / employs me, but my opinions are my own.
< URL : http : / / www . helsinki . fi / %7E atossava / > NO FILE ATTACHMENTS