[OpenAFS] Problems with 1.4.1 (built 2006-04-11) on AIX 5.3 ML03

Tom Keiser tkeiser@gmail.com
Fri, 9 Jun 2006 17:26:04 -0400

On 6/9/06, Tomasz Skarupa <tomasz.skarupa@gmail.com> wrote:
> I help in the administration of the cell enea.it.
> We are migrating our AIX fileservers [AIX 5.3 ML03] from AFS Transarc
> 3.6 2.38 (inode) to OpenAFS (namei).
> Version OpenAFS 1.4.0 works well but we have tried also version 1.4.1
> built 2006-04-11 and there we have two problems:

Hi Tomasz,

If you want to run 1.4.1 on AIX in production, I strongly recommend
recompiling with largefile support disabled.  When I packaged 1.4.1
for openafs.org, I left largefile support turned on, which I admit was
a mistake.  I thought I had solved the AIX largefile problems a few
months ago, but since then several new bugs have been uncovered.  I'm
working on a patch this week to clean up the AIX largefile builds, and
with any luck that patch will make it into 1.4.2 (but it's not my call
to make).

> 1) frequent [every 2-3 days] coredumps:
> Segmentation fault in rxkad_DecryptPacket
> the output of dbx (with showProcInfo script) is shown at the end of
> the E-mail

Is there any chance you can reproduce the rxkad crash on a fileserver
compiled with debug symbols?  The stack traces without symbols are
rather useless.  If you don't feel like rebuilding openafs yourself,
let me know and I can supply you with a debug build.

I have a hunch it's being caused by the false assumption that struct
rxkad_cprivate and struct rxkad_sprivate have field type at the same
offset, but it would be nice to get confirmation on this...

> 2) after about a day it does not answer to vos listvol commands, while
> bos status results are normal.

Yeah, all bos status will tell you is whether or not the volserver
process is still running.  It would be really helpful if you could
send stack backtraces (with debug symbols) for all the volserver
threads, and all the fileserver threads.  This sounds vaguely like a
bug in fssync.