[OpenAFS] Port probing our AFS cell...

Rodney M Dyer rmdyer@uncc.edu
Wed, 07 Sep 2005 14:35:00 -0400


At 01:13 PM 9/7/2005, you wrote:
>* Rodney M Dyer [2005-09-06 13:33:42 -0400]:
> > We've also seen our BOS process die leaving strange last-write-times on
> > some of the logs.  Notice the date on the core, BosLog, and VolserLog 
> files:
> >
> >      # ls -l
> >      total 9012
> >      -rw-r--r--   1 root     root         124 Dec 30  1967 BosLog
>
>What hardware is this?

Sun Ultra 5

>I once saw similar behaviour from a SPARC with a dying TOD (i.e., clock)
>chip.

Could be.


>To confirm or refute this wild guess, you could take a look at the
>current clock and at the NTP daemon's logs. Or simply look for jumps
>in the timestamps of successive entries in your syslogd output.

The logs show no abnormalities with NTP times being off.


>If it isn't the clock acting funny, then one has to wonder about the
>filesystem these oddly-timestamped files reside on.
>
> >      -rw-------   1 root     root     1817440 Dec 30  1967 core
> >      -rw-------   1 root     other     747172 Aug 31 06:57 corefile.fs
> >      -rw-------   1 root     root     1882976 Dec 30  1967 corevol.fs
>
>"other" does indeed smell of Solaris.

You guessed correctly.


>If these are genuine corefiles, the timestamp should be set by the
>kernel as it writes the post-mortem dump; unless something else changes
>the timestamp after the fact. I don't see how the crashed process itself
>could be responsible for it. That reinforces my suspicion that the
>kernel's notion of time has been erratic.

That is what I thought.  However I'm a bit confused about how time is 
measured on Solaris now (I'm a PC guy in our shop).  I thought 'nix time 
was stored as a 32 bit unsigned value that increased as the number of 
seconds since 1970.  If that is so, how can the kernel ever report earlier 
than 1970???  Is the 32 bit time value signed?

For what it's worth, I now know a couple of extra things...

The core files were generated by the VOL server, not the BOS server, but 
both processes died.  Only a week before this incident the FS process 
died.  This server, as well as others, are showing the anomalous call back 
logging, but it was this particular server that appears to have also 
suffered from a hardware failure (CPU).  When the server was shutdown and 
restarted it didn't come back to life.  The screen was just blank.  We have 
since switched out the hardware, but are using the original SCSI disk pack 
from the original.  So you are right it may have been possible that the 
hardware was erroring.  The disk appears to have recovered from salvage, 
and the UFS fsck seemed to be ok.  I still have no idea of how Solaris 
reports any time less than 1970.

Just for the curious (Derrick), the backtrace (which tells me nothing) on 
the core file generated by the dying vol server process is...

% gdb volserver core
GNU gdb 5.3
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "sparc-sun-solaris2.8"...
(no debugging symbols found)...
Core was generated by `/usr/afs/bin/volserver'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/lib/libresolv.so.2...(no debugging symbols
found)...
done.
Loaded symbols for /usr/lib/libresolv.so.2
Reading symbols from /usr/lib/libsocket.so.1...(no debugging symbols
found)...
done.
Loaded symbols for /usr/lib/libsocket.so.1
Reading symbols from /usr/lib/libnsl.so.1...(no debugging symbols
found)...
done.
Loaded symbols for /usr/lib/libnsl.so.1
Reading symbols from /usr/lib/libintl.so.1...
warning: Lowest section in /usr/lib/libintl.so.1 is .hash at 00000074
(no debugging symbols found)...
done.
Loaded symbols for /usr/lib/libintl.so.1
Reading symbols from /usr/lib/libdl.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from /usr/lib/libc.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libmp.so.2...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libmp.so.2
Reading symbols from /usr/platform/SUNW,Ultra-5_10/lib/libc_psr.so.1...
(no debugging symbols found)...done.
Loaded symbols for /usr/platform/SUNW,Ultra-5_10/lib/libc_psr.so.1
#0  0xff19ffb0 in _libc_kill () from /usr/lib/libc.so.1
(gdb) backtrace
#0  0xff19ffb0 in _libc_kill () from /usr/lib/libc.so.1
#1  0xff136ce8 in abort () from /usr/lib/libc.so.1
#2  0x00066f5c in IOMGR_AllocFDSet ()
#3  0x00064e74 in LWP_StackUsed ()
#4  0x00065d28 in savecontext ()
#5  0x00067780 in IOMGR_Initialize ()
#6  0x00053c88 in rx_StartServer ()
#7  0x00026bf0 in main ()


Rodney