[OpenAFS] OpenAFS 1.3.87 and 1.4.0-rc6 stability issues on Solaris 10

Loic Tortay tortay@cc.in2p3.fr
Tue, 11 Oct 2005 18:11:08 +0200


Hello,
I'm facing a somewhat severe stability problem with OpenAFS 1.3.87 and
1.4.0-rc6 on Solaris 10, on both i386 and Sparc.

Using one of the new "SMF" command can easily trigger a panic on
Solaris 10 when OpenAFS is running.

Specifically, the problem happens when running the "svcs -p" command
when (and only when) OpenAFS is up and running.

About one time out of three, the system will panic immediatly.

The panic message always looks the same as does the stack trace, and it
happens on both i386 (actually AMD64 booted in 32 bit mode) and Sparc
(in 64 bit mode).

Besides that, access to AFS works correctly.

The stack trace on a V40z running running Solaris 10 in i386/32 bit
mode is:
 # mdb unix.10 vmcore.10
 Loading modules: [ unix krtld genunix specfs ufs ip sctp usba fctl lofs =
nfs random ptm ]
 > $c
 contract_process_status+0x126(d1411c40, fed43330, 2, d0f10348, d0aa4e7c,=
 100000)
 ctfs_stat_ioctl+0x9d()
 fop_ioctl+0x1e(d1c21300, 63747300, 809dd88, 102001, d12ced90, d0aa4f80)
 ioctl+0x199()
 sys_sysenter+0xdc()
 >

The problem also occurs with the latest Solaris 10 recommended patch
cluster, the kernel release for the above mentionned machine is
"Generic_118844-08" (it also happens with older kernel releases, and
with kernel releases up to and including "Generic_118822-18" on Sparc).

OpenAFS is compiled with Sun Studio 10 (with the following
patches on i386: 117831-03, 117837-05, 117846-07 and 118682-01).

The "configure" options used are "--enable-transarc-paths" and
"--with-afs-sysname=3Dsunx86_510" ("--with-afs-sysname=3Dsun4x_510" on
Sparc).

I have several "crash dumps" on both i386 and Sparc if needed.

The problem does not occur without AFS, I've run a simple
"while :; do svcs -p > /dev/null;done" for about 24 hours (> 1.6
million calls to "svcs -p") without a panic.

The same loop needs less than one second to trigger a panic with
OpenAFS running.


I can't find anything related to this on either the list archives or
Google.

I've found a few things about people actually running OpenAFS on
Solaris 10 including people running AFS cells on Solaris 10 servers,
but none mentionning such issue.

So my question is: am I the only one with this issue ?

If so, has someone a clue on where to look for the origin of this
problem ?


Lo=EFc.
--=20
| Lo=EFc Tortay <tortay@cc.in2p3.fr> -     IN2P3 Computing Centre     |