[OpenAFS-devel] fileserver crash on Solaris 2.6 with 1.2.7

Martin MOKREJŠ mmokrejs@natur.cuni.cz
Sun, 8 Dec 2002 19:01:57 +0100 (CET)


Hi,
  our main afs database server went down, as only one database server
remained alive, weird thing started to happen. Being unable to find real
cause, have upgrade from old IBM binaries to openafs binaries for solaris
2.6 version 1.2.7:

  Here's some meat for you:

$ gdb /usr/afs/bin/fileserver ./corevol.fs
GNU gdb 4.17
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.6"...
(no debugging symbols found)...
Core was generated by `/usr/afs/bin/fileserver'.
Program terminated with signal 9, Killed.
Reading symbols from /usr/lib/libpthread.so.1...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libsocket.so.1...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libresolv.so.2...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libnsl.so.1...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libintl.so.1...
warning: Lowest section in /usr/lib/libintl.so.1 is .dynamic at 0x74
(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libdl.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libc.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libmp.so.2...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libthread.so.1...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/nss_files.so.1...(no debugging symbols found)...
done.
#0  0xef554c9c in __sigprocmask ()
(gdb) where
#0  0xef554c9c in __sigprocmask ()
#1  0xef54b684 in __bounceself ()
#2  0xef5468bc in cond_wait ()
#3  0xef5467c8 in _ti_pthread_cond_wait ()
#4  0x86034 in rxi_ReadProc ()
#5  0x75818 in rx_EndCall ()
#6  0x454cc in VL_RegisterAddrs ()
#7  0x6716c in ubik_Call ()
#8  0x29640 in Do_VLRegisterRPC ()
#9  0x29930 in InitVL ()
#10 0x27014 in main ()
(gdb)


I think this might me related (FileLog):
Sun Dec  8 17:47:45 2002 File server starting
Sun Dec  8 17:47:45 2002 afs_krb_get_lrealm failed, using natur.cuni.cz.
Sun Dec  8 17:51:15 2002 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=2)
Sun Dec  8 17:51:16 2002 Partition /vicepa: attached 1 volumes; 0 volumes not attached
Sun Dec  8 17:51:16 2002 Getting FileServer name...
Sun Dec  8 17:51:16 2002 FileServer host name is 'var400'
Sun Dec  8 17:51:16 2002 Getting FileServer address...
Sun Dec  8 17:51:16 2002 FileServer var400 has address 195.113.59.121 (0xc3713b79 or 0xc3713b79 in host byte order)
Sun Dec  8 17:51:16 2002 File Server started Sun Dec  8 17:51:16 2002
Sun Dec  8 17:57:57 2002 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
Sun Dec  8 18:04:42 2002 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
Sun Dec  8 18:12:38 2002 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
Sun Dec  8 18:19:58 2002 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
Sun Dec  8 18:26:39 2002 VL_RegisterAddrs rpc failed; will retry periodically (code=-1, err=0)
Sun Dec  8 18:30:24 2002 Shutting down file server at Sun Dec  8 18:30:24 2002
Sun Dec  8 18:30:24 2002 Vice was last started at Sun Dec  8 17:51:16 2002

$ bos status -server var400 -long
Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Sun Dec  8 18:33:51 2002 (22 proc starts)
    Last exit at Sun Dec  8 18:33:51 2002
    Last error exit at Sun Dec  8 18:33:51 2002, by vol, by exiting with code 1
    Command 1 is '/usr/afs/bin/fileserver'
    Command 2 is '/usr/afs/bin/volserver'
    Command 3 is '/usr/afs/bin/salvager'

Instance ptserver, (type is simple) temporarily disabled, stopped for too many errors, currently shutdown.
    Process last started at Sun Dec  8 18:36:13 2002 (138 proc starts)
    Last exit at Sun Dec  8 18:36:14 2002
    Last error exit at Sun Dec  8 18:36:14 2002, by exiting with code 2
    Command 1 is '/usr/afs/bin/ptserver'

Instance vlserver, (type is simple) temporarily disabled, stopped for too many errors, currently shutdown.
    Process last started at Sun Dec  8 18:36:20 2002 (231 proc starts)
    Last exit at Sun Dec  8 18:36:20 2002
    Last error exit at Sun Dec  8 18:36:20 2002, by exiting with code 2
    Command 1 is '/usr/afs/bin/ptserver'

$ more PtLog
ptserver: problems with host name Ubik init failed
primary address
Sun Dec  8 18:36:20 2002 Inconsistent Cell Info on server: Sun Dec  8 18:36:20 2002 195.113.59.251 Sun Dec  8 18:36:20 2002
$


  As the main afs database machine (with lowest IP) is back up again and I
have deleted it in the meantime using "bos removehost" on remaining 2
machines, we can access /afs again. However, the ptserver and vlserver do
not run as seen above.

  I also see in BosLog:

Sun Dec  8 17:46:25 2002: ptserver exited with code 2
Sun Dec  8 17:46:39 2002: ptserver exited with code 2
Sun Dec  8 17:46:53 2002: ptserver exited with code 2
Sun Dec  8 17:47:06 2002: ptserver exited on signal 15
Sun Dec  8 17:47:13 2002: vlserver exited on signal 15
Sun Dec  8 17:47:17 2002: fs:vol exited on signal 15
Sun Dec  8 17:47:18 2002: fs:file exited on signal 3 (core dumped)
Sun Dec  8 17:47:44 2002: fs:salv exited with code 0
Sun Dec  8 17:51:11 2002: fs:vol exited with code 1
Sun Dec  8 18:30:24 2002: fs:vol exited on signal 15
Sun Dec  8 18:30:24 2002: fs:file exited with code 0
[...]
Sun Dec  8 18:59:42 2002: ptserver exited with code 2
Sun Dec  8 18:59:42 2002: ptserver exited with code 2
Sun Dec  8 18:59:42 2002: ptserver exited with code 2
Sun Dec  8 18:59:42 2002: BNODE 'ptserver' repeatedly failed to start, perhaps missing executable.
Sun Dec  8 18:59:43 2002: ptserver exited with code 2
Sun Dec  8 18:59:43 2002: BNODE 'ptserver' repeatedly failed to start, perhaps missing executable.


Any ideas what's wrong?
-- 
Martin Mokrejs <mmokrejs@natur.cuni.cz>, <m.mokrejs@gsf.de>
PGP5.0i key is at http://www.natur.cuni.cz/~mmokrejs
MIPS / Institute for Bioinformatics <http://mips.gsf.de>
GSF - National Research Center for Environment and Health
Ingolstaedter Landstrasse 1, D-85764 Neuherberg, Germany
tel.: +49-89-3187 3683 , fax: +49-89-3187 3585