[OpenAFS-devel] Solaris fixes for 1.4.x / AFS_SUN510_ENV
Jeffrey Hutzelman
jhutz@cmu.edu
Wed, 30 Jan 2008 14:44:34 -0500
--On Wednesday, January 30, 2008 06:14:02 PM +1100 Mike Battersby
<mib@unimelb.edu.au> wrote:
> 1. SSYS process exiting considered harmful
>
> The first problem is that setting process flag SSYS on a process that
> exits, as the afs_osi_Invisible routine on Solaris 10 does, causes the
> system not to clean up the contract state of the process. This leaves
> a dangling kernel-memory pointer in the contract table which used to
> point to the process struct.
>
> Any user can corrupt kernel memory and cause a panic with the 'ctstat'
> command and the system cannot shut down without either panicing or
> going into an infinite loop as svc.startd repeatedly tries to kill the
> non-existent process.
>
> I really don't know why the code would set SSYS on a userland process
> that's about to exit in the first place. Can anyone shed any light?
Threads that call afs_osi_Invisible are not about to exit; they're about to
become long-lived AFS kernel threads. Setting SSYS is correct; we just
need to figure out how to clean it up when the process exits. The right
thing to do here is probably to introduce a new osi-layer function to be
called just before such a daemon exits, which on Solaris could reasonably
turn SSYS back off.
There's another issue here, which is that AFS's kernel threads probably
should not be considered part of the contract under which afsd is started.
That is certain to cause all sorts of havoc as SMF tries to kill off the
contract if afsd should die prematurely. I'll leave it somewhat up in the
air whether the right place to fix this is in afsd or in the kernel code.
> I'm not sure of the placing of the cleanup code for case #2, as no
> spot seems particularly better than another in afs_shutdown().
On the contrary, the shutdown process is carefully orchestrated to insure
that each subsystem is shut down only when nothing is depending on it still
being up. The required order is similar to the reverse of startup order,
but not exactly the same.
In this case, shutting down the interface poll task fairly late is probably
the right thing. You probably should do it before setting afs_termState to
AFSOP_STOP_COMPLETE, though. More importantly, you destroy the task queue
and the lock it uses without making sure the task isn't currently running!
Simply returning at the start of the task if afs_shuttingdown is true isn't
good enough; in fact, that does almost nothing -- if the task is _not_
running when you shut down, then destroying the queue should prevent it
from being started again. If it _is_ running, then it's almost certainly
past that check, and is eventually going to end up touching the lock and/or
the task queue you've already destroyed.
> Since it is fairly small I've included it here. I apologise if that's
> against list etiquette.
Including it here is fine, but a better approach would have been to send it
to openafs-bugs and then mention the ticket number here; that way it makes
its way into the bug-tracking system.
-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
Carnegie Mellon University - Pittsburgh, PA