[OpenAFS] Re: 1.4.1-rc2 feedback

Christopher D. Clausen cclausen@acm.org
Tue, 13 Dec 2005 15:56:00 -0600


Terry McCoy wrote:
>>> I have sent openafs-bugs@openafs.org several reports with fileserver
>>> core dumps.  I wish I could find out why it is dumping core so
>>> often.
>
>> On what OS?
>
> Solaris 8

Have you tried other 1.3.x builds?  See if 1.3.81 is stable.  I believe 
that the changes that possibly caused problems were made between builds 
.82 and .84.

Are you by chance running on a single processor system?  I had all kinds 
of problems using an Ultra 60 with a single proc.  Not sure if that was 
the issue, but I'm going to assume that just about everyone running 
Solaris on sparc isn't using single-proc machines.  Things got much 
better when I replaced the Ultra 60 with an E420.

> Let me know if you get your fileserver process to run more that 24
> hours without it dumping core.

It appears as though all AFS server processes on alnilam.acm.uiuc.edu 
have been running since Nov 23rd without crashing.  This machine is a 
Sun E3000 running Solaris 10 and the binary 1.4.0 builds from 
openafs.org.  My E420R and E450 aren't doing as well though.  The 
ptserver and bosserver processes have been crashing on these servers. 
This is an example crashed bosserver stack trace:

[cclausen@alnitak:/usr/afs/logs]% sudo pstack core
Password:
core 'core' of 349:     /usr/afs/bin/bosserver
 ff1c0f90 _lwp_kill (6, 0, ff1a4a98, ffffffff, ff1e8284, 6) + 8
 ff13ff98 abort    (6ac00, 1, 68564, a83f0, ff1eb298, 0) + 110
 0003bbcc ???????? (c000, ffbff188, 78a80, ac8c8, 2, 69000)
 0003c158 ???????? (3ba5c, ac91c, 0, 3, 3b800, 0)
 0003b870 LWP_MwaitProcess (1, 0, 0, 78a80, ac8c8, 4) + 1c8
 0003b688 LWP_WaitProcess (7c4e8, 126, 7ab1c, 125, 0, 0) + 20
 00027fd8 rx_GetCall (5, 7681c, ffbff37c, 7c4e8, 0, 68f80) + 310
 00027aa0 rxi_ServerProc (5, 0, ffbff37c, 9a01, 78800, 68c00) + 44
 00026844 rx_ServerProc (68c00, 37, 68c00, 153528, 20, ffffffff) + 6c
 00026f58 rx_StartServer (6b800, 7a800, 0, 1, 68c00, 68c00) + c8
 0001abd8 main     (76800, 19800, 68800, 76800, 76800, 0) + 60c
 000195c0 _start   (0, 0, 0, 0, 0, 0) + 108

I assume you have similar crashes?

Hmm... just checked...  It looks like there aren't very many volumes on 
alnilam.  I'm guessing the processes don't crash if they aren't doing 
anything :-)

> Here are the configure options I compiled with
>
>    --enable-bos-new-config
>    --enable-largefile-fileserver
>    --enable-supergroups
>    --enable-fast-restart
>    --enable-bitmap-later
>    --enable-transarc-paths
>
> --with-krb5-conf=/home/tmmccoy/Kerberos/krb5_1.4.2/krb5/bin/krb5-config
>
>    --enable-debug-lwp   (currently grinding away recompiling)

I'm not sure how it relates to debug-lwp, but you might also want to 
try --disable-optimize-lwp and see if that helps any.

-----

Is there a way to restart the bosserver process WITHOUT first stopping 
all other AFS server processes (and thus causing downtime?)  I really 
don't like to wait for a salvage or have to vos move everything to 
another server.  Or, is there any disadvantage to NOT restarting the 
bosserver process (other than the auto-restart if something else crashes 
and using the bos command to control things.)  And, if something else 
did crash (likely ptserver,) can it be manually restarted without using 
bosserver?

<<CDC
-- 
Christopher D. Clausen
ACM@UIUC SysAdmin