[OpenAFS-devel] Solaris fixes for 1.4.x / AFS_SUN510_ENV

Wed, 30 Jan 2008 14:44:34 -0500

--On Wednesday, January 30, 2008 06:14:02 PM +1100 Mike Battersby 
<mib@unimelb.edu.au> wrote:

> 1. SSYS process exiting considered harmful
>
>   The first problem is that setting process flag SSYS on a process that
>   exits, as the afs_osi_Invisible routine on Solaris 10 does, causes the
>   system not to clean up the contract state of the process.  This leaves
>   a dangling kernel-memory pointer in the contract table which used to
>   point to the process struct.
>
>   Any user can corrupt kernel memory and cause a panic with the 'ctstat'
>   command and the system cannot shut down without either panicing or
>   going into an infinite loop as svc.startd repeatedly tries to kill the
>   non-existent process.
>
> I really don't know why the code would set SSYS on a userland process
> that's about to exit in the first place.  Can anyone shed any light?

Threads that call afs_osi_Invisible are not about to exit; they're about to 
become long-lived AFS kernel threads.  Setting SSYS is correct; we just 
need to figure out how to clean it up when the process exits.  The right 
thing to do here is probably to introduce a new osi-layer function to be 
called just before such a daemon exits, which on Solaris could reasonably 
turn SSYS back off.

There's another issue here, which is that AFS's kernel threads probably 
should not be considered part of the contract under which afsd is started. 
That is certain to cause all sorts of havoc as SMF tries to kill off the 
contract if afsd should die prematurely.  I'll leave it somewhat up in the 
air whether the right place to fix this is in afsd or in the kernel code.

> I'm not sure of the placing of the cleanup code for case #2, as no
> spot seems particularly better than another in afs_shutdown().

On the contrary, the shutdown process is carefully orchestrated to insure 
that each subsystem is shut down only when nothing is depending on it still 
being up.  The required order is similar to the reverse of startup order, 
but not exactly the same.

In this case, shutting down the interface poll task fairly late is probably 
the right thing.  You probably should do it before setting afs_termState to 
AFSOP_STOP_COMPLETE, though.  More importantly, you destroy the task queue 
and the lock it uses without making sure the task isn't currently running! 
Simply returning at the start of the task if afs_shuttingdown is true isn't 
good enough; in fact, that does almost nothing -- if the task is _not_ 
running when you shut down, then destroying the queue should prevent it 
from being started again.  If it _is_ running, then it's almost certainly 
past that check, and is eventually going to end up touching the lock and/or 
the task queue you've already destroyed.

> Since it is fairly small I've included it here.  I apologise if that's
> against list etiquette.

Including it here is fine, but a better approach would have been to send it 
to openafs-bugs and then mention the ticket number here; that way it makes 
its way into the bug-tracking system.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Carnegie Mellon University - Pittsburgh, PA