[OpenAFS-devel] Why do afsd daemons loop tightly after receiving a SIGHUP?

Matt Peterson matt@calderasystems.com
Thu, 2 Aug 2001 23:37:04 -0600 (MDT)


Matthew Peterson
On Thu, 2 Aug 2001, Daniel Jacobowitz wrote:

> On Thu, Aug 02, 2001 at 10:50:15PM -0400, Derek Atkins wrote:
> > Well, yea.  It looks like we should be able to flush_signals() on the
> > current thread context.  I _thought_ that's what we were doing
> > already.  Looking at src/afs/LINUX/osi_sleep.c, in afs_osi_Wait() we
> > do actually call flush_signals() if osi_TimedSleep() returns non-zero
> > and aintok (the third argument to afs_osi_Wait()) is zero.
> > 
> > So, this _should_ be doing the right thing, provided aintok is zero.
> > And indeed, it definitely looks like all the calls to afs_osi_Wait
> > indeed pass zero as the third argument.  So, we should be flushing
> > the signals.
> > 
> > AHH, the flush_signals() code is only activated if AFS_GLOBAL_SUNLOCK
> > is defined.  And that is only defined if AFS_SMP is defined.  This
> > means that signals are only flushed properly on SMP machines!  I bet
> > that's the problem.  :)
> 
> That might do it, yeah :)  I thought I'd never seen this problem, and I
> know I've sent signals to afsd on my SMP Linux machines.

Actually, I happened to be using a SMP kernel on a single processor
machine when I saw this problem.  I spent some more time looking at the
problem an I think that it is a bit deeper than you've discussed (I'd be
really happy if it wasn't).  

The linux 2.4.x scheduler code appears to check pending signals 
when determining run state and they are cleared by flush_signals().
Just for fun, try commenting out flush_signals() and you'll notice
tight looping, but the loop executes a very different code path than
the one that produces the CPU hog when flush_signals() is in place.

Another thing that makes me suspicious is that we see the same behavior
regardless of the Linux kernel version.  2.2 and 2.4 have significantly
different signal handling code so you'd expect to see differing behaviors
if the problem was indeed limited solely to the Linux kernel code in
question.

Ultimately, I was able to get around the immediate problem by modifying my
init scripts to insure that a SIGHUP was not sent to afsd.  Considering the
 number and diversity of Linux init scripts, I'm surprised there have not
been many more requests for help on this subject.  IMO this is only a
temporary workaround as there are several other situations (process
monitoring tools, etc) that would require that afsd act like a real
process when receiving signals.

Cheers,

--
Matt Peterson
Sr. Software Engineer
Caldera, Inc