[OpenAFS] Fileserver process hung on startup

John Morris openafs@butchwax.com
30 Mar 2004 00:34:12 -0600


Cool, got the strace.  After the expected loading of .so files and
config files and such, we see it contact the vldbs of the other two
servers and do a 'rt_sigsuspend'; the second from which it never
returns.  Is this a thread locking issue?

If it's relevant, this is openafs 1.2.11 built from a nearly-stock SRPM
on this machine, which runs RH8 with custom kernel and stock RH glibc
2.2.93.

Sorry for the line-wrapping, I'm not in my preferred environment right
now....

sendmsg(7, {msg_name(16)={sin_family=AF_INET, sin_port=htons(7003),
sin_addr=inet_addr("67.67.198.201")}},\
msg_iov(2)=[{"\257\356\r\2001\177\254\274\0\0\0\1\0\0\0\1\0\0\0\1\1\5"..., 28}, {"\0\0\2\24\0kK\356\0\0\3\34\241\0\0\37&\377\377\377\276\0"..., 64}], msg_controllen=0, msg_flags=0}, 0) = 92
time(NULL)                              = 1080626914
kill(5685, SIGRTMIN)                    = 0
kill(5685, SIGRTMIN)                    = 0
time(NULL)                              = 1080626914
rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
rt_sigsuspend([] <unfinished ...>
--- SIGRTMIN (Real-time signal 0) ---
<... rt_sigsuspend resumed> )           = -1 EINTR (Interrupted system
call)
sigreturn()                             = ? (mask now [RTMIN])
gettimeofday({1080626914, 364860}, NULL) = 0
gettimeofday({1080626914, 365028}, NULL) = 0
time(NULL)                              = 1080626914
gettimeofday({1080626914, 365310}, NULL) = 0
gettimeofday({1080626914, 365488}, NULL) = 0
sendmsg(7, {msg_name(16)={sin_family=AF_INET, sin_port=htons(7003),
sin_addr=inet_addr("66.227.12.105")}},\
msg_iov(2)=[{"\257\356\r\2001\177\254\270\0\0\0\1\0\0\0\1\0\0\0\1\1\5"..., 28}, {"\0\0\2\24\0kK\356\0\0\3\34\241\0\0\37&\377\377\377\276\0"..., 64}], msg_controllen=0, msg_flags=0}, 0) = 92
time(NULL)                              = 1080626914
time(NULL)                              = 1080626914
rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
rt_sigsuspend([]

As for the environment, the env variable is in there:

root      5681  0.0  0.6 17324 3544 ?        S<   00:08   0:00
/usr/afs/bin/fileserver HOSTNAME=kugioga.butchwax.com TERM=dumb
SHELL=/bin/bash HISTSIZE=1000 SSH_CLIENT=192.168.3.101 48003 22
SSH_TTY=/dev/pts/2 EMACS=t USER=root LS_COLORS= TERMCAP= USERNAME=root
COLUMNS=80
PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin MAIL=/var/spool/mail/root LC_COLLATE=C PWD=/usr/afs/logs INPUTRC=/etc/inputrc LANG=en_US SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass HOME=/root SHLVL=3 LD_ASSUME_KERNEL=2.4.1 BASH_ENV=/root/.bashrc LOGNAME=root LESSOPEN=|/usr/bin/lesspipe.sh %s DISPLAY=localhost:10.0 G_BROKEN_FILENAMES=1 _=/usr/bin/strace

Thanks!  Think we're getting closer?

	John



On Mon, 2004-03-29 at 21:30, Derrick J Brashear wrote:
> On Mon, 29 Mar 2004, John Morris wrote:
> 
> > What I'm guessing is that, judging by the fileserver's complete
> > unresponsiveness and that it's not even making system calls, that
> > something is hanging up, maybe a system call?  Is there any
> > documentation on the '-d #' argument to the fileserver executable's
> > commandline?  What debug level could I give it so it would give me a
> > clue what it's doing?  Where else can I look for debugging information
> > besides the sources I listed in my first email (included below for a
> > reminder)?
> 
> it won't really help. the debug messages probably won't give you a hint
> where it's hanging.
> 
> attach before it hangs; better yet wrap fileserver itself with a shell
> script which straces to a file.
> 
> I really like the LD_ASSUME_KERNEL idea; ps auxwwe |grep fileserver|grep
> -v grep
> and see if the LD_ASSUME_KERNEL variable is in the fileserver's
> environment.
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info