[OpenAFS-devel] volserver hangs possible fix

Ted Anderson TedAnderson@mindspring.com
Tue, 12 Apr 2005 11:11:29 -0400


A scenario and request for clarification below...

On 4/11/2005 15:21, Tom Keiser wrote:
> On Apr 11, 2005 2:35 PM, Horst Birthelmer <horst@riback.net> wrote:
>>On Apr 6, 2005, at 1:40 PM, Tom Keiser wrote:
>>>While auditing the callback code, I noticed some funky condition
>>>variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater.
>>>One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast on
>>>fsync_cond without holding the associated lock.  Here's a quick patch
>>>for that race:
>>
>>I don't actually think that would be a race condition. It just leads to
>>some undefined scheduling behavior.
>>If you don't have locked the mutex with which the other thread is doing
>>it's wait you may just continue with that thread depending on your
>>schedulers mood ;-)
> 
> It is a race condition.  pthread_cond_wait() is not an atomic
> operation; it relies on the mutex to synchronize a set of operations
> that enqueue the calling thread onto the list of threads blocked on
> the cond var, drop the associated lock, and become blocked by the
> kernel scheduler.  If you call signal/broadcast without holding the
> lock, you can race against these bookkeeping operations in
> pthread_cond_wait().  Thus, a thread in the process of blocking will
> never find out about that signal/broadcast, because cond vars are not
> stateful like semaphores.
> 
>>Locking that mutex means you _will_ definitely continue when you call
>>the unlock and not at some undefined point in time.
>>Here's what the POSIX standard says about this:
> 
> Actually, it cannot guarantee that type of ordering.  Precisely when
> the thread will be scheduled to run is an implementation detail of
> whatever platform you're running, the number of processors, etc.  The
> core distinction is whether you will _always_ wake up on a call to
> signal/broadcast, or only _sometimes_ wake up on a call to
> signal/broadcast.  In the general case, there are algorithms where
> there will be no future signal/broadcast unless the thread wakes up
> every time.  Luckily, the callback algorithm only experiences delays
> in the current implementation.
> 
> 
>><quote>
>>If more than one thread is blocked on a condition variable, the
>>scheduling policy shall determine the order in which threads are
>>unblocked. When each thread unblocked as a result of a
>>pthread_cond_broadcast() or pthread_cond_signal() returns from its call
>>to pthread_cond_wait() or pthread_cond_timedwait(), the thread shall
>>own the mutex with which it called pthread_cond_wait() or
>>pthread_cond_timedwait(). The thread(s) that are unblocked shall
>>contend for the mutex according to the scheduling policy (if
>>applicable), and as if each had called pthread_mutex_lock().
>>
>>The pthread_cond_broadcast() or pthread_cond_signal() functions may be
>>called by a thread whether or not it currently owns the mutex that
>>threads calling pthread_cond_wait() or pthread_cond_timedwait() have
>>associated with the condition variable during their waits; however, if
>>predictable scheduling behavior is required, then that mutex shall be
>>locked by the thread calling pthread_cond_broadcast() or
>>pthread_cond_signal().
>><\quote>
> 
> The paragraph directly following your quote is extremely important:
> 
> <quote>
> The pthread_cond_broadcast() and pthread_cond_signal() functions shall
> have no effect if there are no threads currently blocked on cond.
> </quote>
> 
> And here's an important related paragraph from phtread_cond_wait() in the spec:
> 
> <quote>
> These functions atomically release mutex and cause the calling thread
> to block on the condition variable cond; atomically here means
> "atomically with respect to access by another thread to the mutex and
> then the condition variable". That is, if another thread is able to
> acquire the mutex after the about-to-block thread has released it,
> then a subsequent call to pthread_cond_broadcast() or
> pthread_cond_signal() in that thread shall behave as if it were issued
> after the about-to-block thread has blocked.
> </quote>
> 
> By "predictable scheduling" they are referring to whether there is a
> deterministic relationship between calling signal/broadcast, and a
> thread waking up.  Calling signal/broadcast outside of the lock is
> nondeterministic; it may or may not have any effect.
> 
> When you use signal/broadcast outside of the lock you have no
> guarantee that the thread you're attempting to wake up is busy,
> preparing to block, or blocked.  In the callback code we've been
> discussing the result is that you could unnecessarily wait 5 minutes
> to break callbacks that could just as easily have been broken right
> now, and on busy servers you've just increased your callback break
> backlog as well :-(

Just to make sure I understand this problem, let me try providing a 
specific scenario:
     thread A        thread B
     --------        --------
  1:                 lock M;
  2:                 check queue;
  3:                 cond_wait(C,M);     // queue is empty
  4:                  -> kernel_scheduler {
  5:                         unlock M;
  6: lock M;                 ...
  7: add item;               ...
  8: unlock M;               ...
  9: broadcast on C;         ...         // noop, nobody blocked on C
10:                         put B on C;
11:                         desched B;
12:          DEADLOCK

Is this a problem that can occur with a corrently operating 
implementation of POSIX threads, i.e. an example of correct but 
non-deterministic scheduling?  I've expanded the operation of cond_wait 
as I understand it.  I observe that M correctly protects the queue 
itself.  Treating "broadcast C" and "put B on C" as black boxes, 
however, it doesn't seem to help to move broadcast(C) inside M, i.e. by 
swapping 8 & 9.  On the other hand, tt would seem a simple matter to 
reverse the sequence of 5 & 10, but perhaps that is inconvenient for 
some multi-processor schedulers.

As I understand your comment, you are saying that the broadcast must be 
protected by the mutex or race conditions are possible.  What am I 
missing here?

Ted Anderson