[OpenAFS-devel] Why do afsd daemons loop tightly after receiving a SIGHUP?

Jeffrey Hutzelman jhutz@cmu.edu
Thu, 2 Aug 2001 12:59:46 -0400 (EDT)


On Wed, 1 Aug 2001, Matt Peterson wrote:

> From what I could tell from a few minutes of debugging, it looks like
> problems start as soon as the flush_signals() call is made in
> LINUX/osi_sleep.c:106.  I tried for a bit longer to find an
> obvious while or for loop problem, but was unable to come up with
> anything. 

The afsd processes aren't really normal processes.  They're there to
provide a process context in which the AFS kernel code can run.  In that
role, they make one syscall (afs_syscall(AFSCALL_CALL, AFSOP_START_xxx) 
for various values of xxx) which returns only when AFS has been shut down. 

Now, several of these processes spend most of their time sleeping inside
the kernel, waiting for some event to happen, like an incoming Rx packet
or some work for a background thread.  When such a process wakes up, it
checks to see what happened, does any appropriate work, and then goes back
to sleep.  Keep in mind that all of this happens inside the afs_syscall()
call; no user-mode code is involved.

Now, the problem comes up when you send a signal to one of these
processes.  Linux can't deliver the signal while the process is in a
syscall -- it has to wait for the syscall to exit.  Now, the syscall is
actually sleeping inside the kernel, so Linux wakes up that process
prematurely to give the syscall a chance to notice the pending signal,
clean up, and exit.  The problem is that the AFS syscall doesn't want to
do that -- if it returns, then that thread is gone forever, which in some
cases could be pretty bad.  As it turns out, the AFS syscall code contains
no provision for ever returning unless AFS is shut down.  It checks for
something to do, and finding nothing, goes back to sleep.  At which point
the Linux process scheduler promptly wakes it up again, so the
still-pending signal can be delivered.  This continues forever, which is
why a signalled afsd process runs forever.


Moral of the story -- don't send signals to afsd.  It's pretty much never
actually what you want.  Usually what you want is to simply unmount afs,
wait for all the afsd processes to exit, and unload the module.  The
various binary distributions for Linux should all come with an rc script
that does this.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA