[OpenAFS-devel] FreeBSD: rxk_ListenerPid not dying fixed(?)

Benjamin Kaduk kaduk@MIT.EDU
Thu, 20 Jan 2011 00:41:58 -0500 (EST)


  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---559023410-876410067-1295502118=:640
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Wed, 19 Jan 2011, Derrick Brashear wrote:

> On Wed, Jan 19, 2011 at 10:30 PM, Toby Burress <kurin@delete.org> wrote:
>> I was wondering if I could trouble to have someone double check my
>> diagnostic.
>>
>> So when dismounting /afs, the master branch hangs. =A0It looks like
>> this is happening because osi_StopListener() in src/rx/FBSD/rx_knet.c
>> calls osi_NetSend() telling the Listener to go away, and then
>> afs_osi_Sleep().
>>
>> Then in src/rx/rx_kcommon.c, rxk_ListenerProc() gets the signal and
>> calls osi_rxWakeup(), allowing osi_StopListener() to return and umount
>> to exit.
>>
>> However, it looks like afs_osi_Sleep() is being called with
>> rxk_ListenerPid as its argument, and osi_rxWakeup() with afs_termState.
>> This causes afs_getevent to return the wrong event to osi_rxWakeup,
>> and as a result wakeup() is never called and umount hangs.
>>
>> Editing rx_kcommon.c to use rxk_ListenerPid instead of afs_termState
>> allows umount to exit cleanly (although afsd isn't able to restart after
>> that;
>
> it's not supposed to. unload the module, then reload it.
>
>> it looks like after the restart afs_getevent is being called with
>> something that just points to zeroed memory).
>>
>> Is this all wrong? =A0I spend most of my time in pythonland, so kernel
>> debugging is, uh, new to me.
>
> i'd have to look but that sounds correct

The same from me, with the addition that the shutdown code is known to be=
=20
buggy in its present state and I haven't had much time to look at it.
There is a lock order reversal involved in the psignal() call, IIRC, which=
=20
has not been closely examined for deadlock potential.  (Between the vnode=
=20
lock for the ufs vnode of the /afs directory, and the allproc lock, IIRC.=
=20
It should be in the jabber logs.)

Shutdown sometimes works by chance, when that codepath doesn't need to=20
run.  If you can reliably get it to (1) use that codepath and (2) shutdown=
=20
cleanly, please submit to gerrit.
I might also recommend using the rc script found in
http://web.mit.edu/freebsd/openafs/openafs.shar .

-Ben Kaduk
---559023410-876410067-1295502118=:640--