[OpenAFS] Re: Ubik problem

Atro Tossavainen atro.tossavainen+openafs@helsinki.fi
Sat, 17 Apr 2010 00:07:01 +0300 (EEST)


> Andrew talks a bit about "errors that appear after the server's been
> running for a while".  If this is a memory corruption problem, then
> there is a good likelyhood of random seg faults, possible core dumps,
> and server restarts.

There are no coredumps.  (Fileserver and volserver have dumped core
previously, and I've got them saved away, so I figure if there were
going to be any, at least I am not doing anything to stop it.)

I only just restarted all servers deliberately after changing the
faulty NetRestrict, but my previous AuthLog on the sunx86_510 extends
from Wed Apr 14 15:15 to Fri Apr 16 23:40 which is when I did.
I don't think kaserver is restarting spontaneously.

> paths and data matter here.  Just knowing that the software is restarting
> spontaneously (cat /var/log/openafs/BosLog ?) would help a lot.

sunx86_510 # less BosLog
Sun Apr 11 04:00:58 2010: Server directory access is okay
Mon Apr 12 15:09:23 2010: kaserver exited on signal 15
Mon Apr 12 15:11:08 2010: kaserver exited on signal 15
Wed Apr 14 13:07:52 2010: kaserver exited on signal 15
Wed Apr 14 15:14:57 2010: kaserver exited on signal 15
Fri Apr 16 23:44:22 2010: upserverS10x86 exited on signal 15
Fri Apr 16 23:44:22 2010: vlserver exited on signal 15
Fri Apr 16 23:44:22 2010: kaserver exited on signal 15
Fri Apr 16 23:44:22 2010: ptserver exited on signal 15
Fri Apr 16 23:44:22 2010: fs:vol exited on signal 15
Fri Apr 16 23:44:22 2010: upclientetc exited on signal 15
Fri Apr 16 23:45:02 2010: fs:file exited with code 0

> Some other problems that could cause intermittent behavior include:
> 
> /1/ flapping network routes.  We already know there are multiple addresses...

And a static route.

> /2/ DNS.  Unlikely, but ubik likely depends on dns.  if "host `hostname`"
> 	lists more than one ip address, round robin behavior in dns
> 	might result in oddness.

It doesn't.

>From DNS, the hostname returns exactly one address.  Even if host name
resolution was somehow involved, which seems unlikely to my untrained
mind, /etc/hosts takes preference, and since it's Solaris, you *have*
to have a separate name for each IP address you want to configure on
a network interface.  Like this:

# ls /etc/hostname.nge*
hostname.nge0  hostname.nge1  hostname.nge2  

# cat /etc/hostname.nge*
replicon-dev
replicon-rfc1918
replicon

# cat /etc/hosts
# grep replicon /etc/hosts
128.214.209.84  replicon-dev
128.214.58.174  replicon
10.0.0.20       replicon-rfc1918

nge0 is down and unplumbed now that the "development" server is no
more, nge1 is the RFC1918 address, and nge2 is the real McCoy.

> 	But since we know the key files aren't consistent,

You "know" that?  That's a misassumption at best.

sun4x_58 # cksum /usr/afs/etc/KeyFile
2143645127      100     /usr/afs/etc/KeyFile

sunx86_510 # cksum /usr/afs/etc/KeyFile
2143645127      100     /usr/afs/etc/KeyFile

-- 
Atro Tossavainen (Mr.)               / The Institute of Biotechnology at
Systems Analyst, Techno-Amish &     / the University of Helsinki, Finland,
+358-9-19158939  UNIX Dinosaur     / employs me, but my opinions are my own.
< URL : http : / / www . helsinki . fi / %7E atossava / > NO FILE ATTACHMENTS