[OpenAFS-devel] Solaris fixes for 1.4.x / AFS_SUN510_ENV

Mike Battersby mib@unimelb.edu.au
Thu, 31 Jan 2008 15:44:16 +1100


Dear AFS Devs,

Thanks for your responses.  Rather than reply individually I'll collect
some followups here.

Jeffrey Hutzelman <jhutz@cmu.edu> wrote:
> Threads that call afs_osi_Invisible are not about to exit; they're about to 
> become long-lived AFS kernel threads.

That's not quite what happens: it's set on at least one thread that
exits immediately.  With more thought I assume this is because it
expects the created kernel thread to inherit this flag from the
userspace process.

> There's another issue here, which is that AFS's kernel threads probably 
> should not be considered part of the contract under which afsd is started. 

I can't see any indication that the kernel threads are included in
contracts at all, so that probably isn't an issue.  thread_create()
definitely doesn't do it (if it did we'd be screwed creating the
taskq too).

> More importantly, you destroy the task queue 
> and the lock it uses without making sure the task isn't currently running! 

No, it doesn't work that way.  The ddi_taskq_destroy call will block
until the taskq doesn't have any more scheduled tasks.  From
ddi_taskq_destroy(9F):

      The ddi_taskq_destroy() function  waits  for  any  scheduled
      tasks  to  complete,  then  destroys  the  taskq.  The caller
      should guarantee that no new tasks  are  scheduled  for  the
      closing taskq.

Hence the need to stop the task from scheduling more copies of itself,
otherwise this would block forever.


chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil> wrote:
> when i looked at this problem some time ago, i couldnt find a way to
> drop an association with a contract.

It looks like you might be able to do

     contract_process_exit(pstr->p_ct_process, pstr, 0)

to pretend to any watchers you exited with exitval 0.  This probably
isn't part of the documented API, but then again neither is
thread_create.  Or even contract_exit(pstr) might work.  Frankly I
wouldn't go near it myself. :)


I also had a couple of private emails asking about reproducing
this problem.  To tell the truth I don't understand why everyone
doesn't see it, but this is the ticket we logged for it last year:

   http://rt.central.org/rt/Ticket/Display.html?id=79232

We are using the precompiled openafs client for Solaris 10,
starting afsd up with "-afsdb" and no other flags.  We've been
able to reproduce every time on:

   - Solaris 10u4 x86 kernel 120012-14
   - Solaris 10u4 x86 kernel 127112-06
   - Solaris 10u3 x86 kernel 118855-33
   - Solaris 10u4 sparc kernel 120011-14

The symptom other than the panic is the presence of obviously
non-pid numbers (huge or negative) in the member processes
section of 'ctstat -va' under the contract running the afsd's.

Turning on kernel debugging as recommended by Sun, by putting into
/etc/system the line "set kmem_flags=0xf" causes the machine to
panic as soon as afsd starts.

Derrick Brashear has been nice enough to send me a more official
fix for both problems that I'll look into now, so I'll refrain
from sending anything to openafs-bugs for the time being unless
anyone tells me otherwise.

Best wishes,

  - Mike