[OpenAFS] 1.3.80 server strangeness (kernel 2.6.11-gentoo-r3)

Fri, 25 Mar 2005 12:40:12 -0500

Well, given that my problems with oops seemed to happen when running
1.3.79 and 1.3.80 as both server and client, I tried running 1.3.80 as a
server only and while the initial indications were that it seemed to be
working ok, it now seems to be acting up again (although I have no oops
today which I guess is a good thing).

When running 1.3.79 as both client and server (maybe a week ago), I had
stepped my way through the docs for adding a new server and I then
created a volume for storing the afs binaries and mounted it
at /afs/cellname/sysname (sysname=i386_linux26).  Still back in 1.3.79
(and running both server and client), I had tried copying the afs
binaries for the new sysname into this mounted volume.  This took a
great deal of time (>10 min) and didn't seem to finish normally.  After
that, I also copied several large files (totaling something like 500MB
or 750MB) to another volume (created in the same way as the sysname
volume), and this really acted up, causing several oops to get logged
in /var/log/messages.

After many tries with 1.3.79 and various kernels (2.6.10-r6 and
2.6.11-r3) and various kernel config settings (specifically, I turned
off PREEMPT and SMP as someone here recommended), I finally gave up on
1.3.79.

After upgrading to 1.3.80 and starting bosserver (which I presumed
started all of the necessary server instances that I had created in
1.3.79 and which still showed up in /usr/afs/local/BosConfig), I deleted
all of the binaries that I had originally copied into the sysname volume
and copied them over again.  Afterwards, I released the volume
successfully.  After that, I checked the md5sum of the originals with
the copies in /afs and everything checked out fine.

Then, I deleted the 500-750MB of files that I had originally copied into
another volume under 1.3.79 (which had acted up, throwing oops) and
attempted to copy them over again (as I had just finished doing with the
afs binaries).  This caused problems and when the copy operation
completed (with errors about preserving settings in non-existent target
files during the cp -a), I saw no files in the directory.  When I tried
rmdir'ing the subdirectory into which I had copied them on this new
volume, rmdir complained that the directory was not empty (in spite of
ls -a showing only . and ..).  Then I noticed that the volserver process
on the new server was not running, so I shutdown all the server
instances on the new server and restarted them, this time checking for
volserver which was running.  vos listvol new_server showed what I
expected.

But then I noticed that vlserver was not running (I had made the new
server a database server as described in the docs).

Basically, to make a long story a little less long, the new 1.3.80
server machine is having real difficulty keeping server instances
running.  When I attempt to restart them (from the sysctl machine), I
often see this message on the new server:

*** glibc detected *** free(): invalid pointer: 0xb7c9c010 ***

The pointer address is always the same.

I feel like I'm flailing here in trying to get this thing working right.

Can anyone recommend a specific plan to find out exactly what's going
wrong and how to fix it?  I'd be glad to provide BosConfig file and
other config files if anyone would like them, but I've looked these over
and they seem to be as I would expect (consider my level of
expertise---about a year running afs 1.2.11).  The /usr/afs/logs/* files
don't show anything terribly illuminating either, but there are some
messages in FileLog about:
Fri Mar 25 12:00:22 2005 Vice was last started at Fri Mar 25 11:54:30
2005

Fri Mar 25 12:00:22 2005 Large vnode cache, 400 entries, 0 allocs, 0
gets (0 reads), 0 writes
Fri Mar 25 12:00:22 2005 Small vnode cache,400 entries, 0 allocs, 0 gets
(0 reads), 0 writes
Fri Mar 25 12:00:22 2005 Volume header cache, 400 entries, 0 gets, 0
replacements
Fri Mar 25 12:00:22 2005 Partition /vicepa: 168725134 available 1K
blocks (minfree=9030130), Fri Mar 25 12:00:22 2005 149837534 free blocks
Fri Mar 25 12:00:22 2005 With 90 directory buffers; 0 reads resulted in
0 read I/Os
Fri Mar 25 12:00:22 2005 Total Client entries = 2, blocks = 1; Host
entries = 2, blocks = 1
Fri Mar 25 12:00:22 2005 There are 2 connections, process size 133164
Fri Mar 25 12:00:22 2005 There are 2 workstations, 2 are active (req in
< 15 mins), 0 marked "down"
Fri Mar 25 12:00:22 2005 VShutdown:  shutting down on-line volumes...
Fri Mar 25 12:00:22 2005 VShutdown:  complete.
Fri Mar 25 12:00:22 2005 File server has terminated normally at Fri Mar
25 12:00:22 2005

The main problems that I see happening on the new server are:

1) server instances won't stay running (as described above);
2) after kinit'ing as the afs admin, aklog fails with:
aklog: unable to obtain tokens for cell folkvang.org (status: 11862788).

[About 2), I'm not sure if that is as expected or not since I'm not
running any client processes]

And because of 2) I've been starting bosserver locally on the new
server, but when trying to start or restart other server instances on
the new server, I issue the commands from the sysctl machine with the
-server argument listing the new server.

Any suggestions?

TIA.

-- 
-Kevin
http://www.gnosys.us