[OpenAFS] Re: 1.4.1-rc2 feedback
Christopher D. Clausen
cclausen@acm.org
Tue, 13 Dec 2005 15:56:00 -0600
Terry McCoy wrote:
>>> I have sent openafs-bugs@openafs.org several reports with fileserver
>>> core dumps. I wish I could find out why it is dumping core so
>>> often.
>
>> On what OS?
>
> Solaris 8
Have you tried other 1.3.x builds? See if 1.3.81 is stable. I believe
that the changes that possibly caused problems were made between builds
.82 and .84.
Are you by chance running on a single processor system? I had all kinds
of problems using an Ultra 60 with a single proc. Not sure if that was
the issue, but I'm going to assume that just about everyone running
Solaris on sparc isn't using single-proc machines. Things got much
better when I replaced the Ultra 60 with an E420.
> Let me know if you get your fileserver process to run more that 24
> hours without it dumping core.
It appears as though all AFS server processes on alnilam.acm.uiuc.edu
have been running since Nov 23rd without crashing. This machine is a
Sun E3000 running Solaris 10 and the binary 1.4.0 builds from
openafs.org. My E420R and E450 aren't doing as well though. The
ptserver and bosserver processes have been crashing on these servers.
This is an example crashed bosserver stack trace:
[cclausen@alnitak:/usr/afs/logs]% sudo pstack core
Password:
core 'core' of 349: /usr/afs/bin/bosserver
ff1c0f90 _lwp_kill (6, 0, ff1a4a98, ffffffff, ff1e8284, 6) + 8
ff13ff98 abort (6ac00, 1, 68564, a83f0, ff1eb298, 0) + 110
0003bbcc ???????? (c000, ffbff188, 78a80, ac8c8, 2, 69000)
0003c158 ???????? (3ba5c, ac91c, 0, 3, 3b800, 0)
0003b870 LWP_MwaitProcess (1, 0, 0, 78a80, ac8c8, 4) + 1c8
0003b688 LWP_WaitProcess (7c4e8, 126, 7ab1c, 125, 0, 0) + 20
00027fd8 rx_GetCall (5, 7681c, ffbff37c, 7c4e8, 0, 68f80) + 310
00027aa0 rxi_ServerProc (5, 0, ffbff37c, 9a01, 78800, 68c00) + 44
00026844 rx_ServerProc (68c00, 37, 68c00, 153528, 20, ffffffff) + 6c
00026f58 rx_StartServer (6b800, 7a800, 0, 1, 68c00, 68c00) + c8
0001abd8 main (76800, 19800, 68800, 76800, 76800, 0) + 60c
000195c0 _start (0, 0, 0, 0, 0, 0) + 108
I assume you have similar crashes?
Hmm... just checked... It looks like there aren't very many volumes on
alnilam. I'm guessing the processes don't crash if they aren't doing
anything :-)
> Here are the configure options I compiled with
>
> --enable-bos-new-config
> --enable-largefile-fileserver
> --enable-supergroups
> --enable-fast-restart
> --enable-bitmap-later
> --enable-transarc-paths
>
> --with-krb5-conf=/home/tmmccoy/Kerberos/krb5_1.4.2/krb5/bin/krb5-config
>
> --enable-debug-lwp (currently grinding away recompiling)
I'm not sure how it relates to debug-lwp, but you might also want to
try --disable-optimize-lwp and see if that helps any.
-----
Is there a way to restart the bosserver process WITHOUT first stopping
all other AFS server processes (and thus causing downtime?) I really
don't like to wait for a salvage or have to vos move everything to
another server. Or, is there any disadvantage to NOT restarting the
bosserver process (other than the auto-restart if something else crashes
and using the bos command to control things.) And, if something else
did crash (likely ptserver,) can it be manually restarted without using
bosserver?
<<CDC
--
Christopher D. Clausen
ACM@UIUC SysAdmin