[OpenAFS-devel] volserver hangs possible fix
Ted Anderson
TedAnderson@mindspring.com
Tue, 12 Apr 2005 11:11:29 -0400
A scenario and request for clarification below...
On 4/11/2005 15:21, Tom Keiser wrote:
> On Apr 11, 2005 2:35 PM, Horst Birthelmer <horst@riback.net> wrote:
>>On Apr 6, 2005, at 1:40 PM, Tom Keiser wrote:
>>>While auditing the callback code, I noticed some funky condition
>>>variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater.
>>>One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast on
>>>fsync_cond without holding the associated lock. Here's a quick patch
>>>for that race:
>>
>>I don't actually think that would be a race condition. It just leads to
>>some undefined scheduling behavior.
>>If you don't have locked the mutex with which the other thread is doing
>>it's wait you may just continue with that thread depending on your
>>schedulers mood ;-)
>
> It is a race condition. pthread_cond_wait() is not an atomic
> operation; it relies on the mutex to synchronize a set of operations
> that enqueue the calling thread onto the list of threads blocked on
> the cond var, drop the associated lock, and become blocked by the
> kernel scheduler. If you call signal/broadcast without holding the
> lock, you can race against these bookkeeping operations in
> pthread_cond_wait(). Thus, a thread in the process of blocking will
> never find out about that signal/broadcast, because cond vars are not
> stateful like semaphores.
>
>>Locking that mutex means you _will_ definitely continue when you call
>>the unlock and not at some undefined point in time.
>>Here's what the POSIX standard says about this:
>
> Actually, it cannot guarantee that type of ordering. Precisely when
> the thread will be scheduled to run is an implementation detail of
> whatever platform you're running, the number of processors, etc. The
> core distinction is whether you will _always_ wake up on a call to
> signal/broadcast, or only _sometimes_ wake up on a call to
> signal/broadcast. In the general case, there are algorithms where
> there will be no future signal/broadcast unless the thread wakes up
> every time. Luckily, the callback algorithm only experiences delays
> in the current implementation.
>
>
>><quote>
>>If more than one thread is blocked on a condition variable, the
>>scheduling policy shall determine the order in which threads are
>>unblocked. When each thread unblocked as a result of a
>>pthread_cond_broadcast() or pthread_cond_signal() returns from its call
>>to pthread_cond_wait() or pthread_cond_timedwait(), the thread shall
>>own the mutex with which it called pthread_cond_wait() or
>>pthread_cond_timedwait(). The thread(s) that are unblocked shall
>>contend for the mutex according to the scheduling policy (if
>>applicable), and as if each had called pthread_mutex_lock().
>>
>>The pthread_cond_broadcast() or pthread_cond_signal() functions may be
>>called by a thread whether or not it currently owns the mutex that
>>threads calling pthread_cond_wait() or pthread_cond_timedwait() have
>>associated with the condition variable during their waits; however, if
>>predictable scheduling behavior is required, then that mutex shall be
>>locked by the thread calling pthread_cond_broadcast() or
>>pthread_cond_signal().
>><\quote>
>
> The paragraph directly following your quote is extremely important:
>
> <quote>
> The pthread_cond_broadcast() and pthread_cond_signal() functions shall
> have no effect if there are no threads currently blocked on cond.
> </quote>
>
> And here's an important related paragraph from phtread_cond_wait() in the spec:
>
> <quote>
> These functions atomically release mutex and cause the calling thread
> to block on the condition variable cond; atomically here means
> "atomically with respect to access by another thread to the mutex and
> then the condition variable". That is, if another thread is able to
> acquire the mutex after the about-to-block thread has released it,
> then a subsequent call to pthread_cond_broadcast() or
> pthread_cond_signal() in that thread shall behave as if it were issued
> after the about-to-block thread has blocked.
> </quote>
>
> By "predictable scheduling" they are referring to whether there is a
> deterministic relationship between calling signal/broadcast, and a
> thread waking up. Calling signal/broadcast outside of the lock is
> nondeterministic; it may or may not have any effect.
>
> When you use signal/broadcast outside of the lock you have no
> guarantee that the thread you're attempting to wake up is busy,
> preparing to block, or blocked. In the callback code we've been
> discussing the result is that you could unnecessarily wait 5 minutes
> to break callbacks that could just as easily have been broken right
> now, and on busy servers you've just increased your callback break
> backlog as well :-(
Just to make sure I understand this problem, let me try providing a
specific scenario:
thread A thread B
-------- --------
1: lock M;
2: check queue;
3: cond_wait(C,M); // queue is empty
4: -> kernel_scheduler {
5: unlock M;
6: lock M; ...
7: add item; ...
8: unlock M; ...
9: broadcast on C; ... // noop, nobody blocked on C
10: put B on C;
11: desched B;
12: DEADLOCK
Is this a problem that can occur with a corrently operating
implementation of POSIX threads, i.e. an example of correct but
non-deterministic scheduling? I've expanded the operation of cond_wait
as I understand it. I observe that M correctly protects the queue
itself. Treating "broadcast C" and "put B on C" as black boxes,
however, it doesn't seem to help to move broadcast(C) inside M, i.e. by
swapping 8 & 9. On the other hand, tt would seem a simple matter to
reverse the sequence of 5 & 10, but perhaps that is inconvenient for
some multi-processor schedulers.
As I understand your comment, you are saying that the broadcast must be
protected by the mutex or race conditions are possible. What am I
missing here?
Ted Anderson