[OpenAFS-devel] volserver hangs possible fix
Ted Anderson
TedAnderson@mindspring.com
Mon, 18 Apr 2005 10:30:00 -0400
On 4/18/2005 08:58, Horst Birthelmer wrote:
> The problem isn't whether cond_wait is atomic. It's what happens to the
> algorithm if it's not.
> Imagine the scenario where it's not atomic (and this was the part where
> I agreed with Tom) and you have the mutex locked in the cond_wait call,
> but the thread isn't in the queue yet.
> Now this thread gets interrupted by whatever event and you perform a
> ...cond_broadcast(). All the threads are woken up except that one not in
> there yet. You have a thread waiting a cond_var you weren't aware of...
> actually you have that thread waiting on that condition variable while
> you already performed a broadcast. That's pretty weird for the algorithm.
Okay, but I don't agree that this situation would generate a problem in
correctly written CV code. As long as the mutex is still held by the
trying-to-sleep thread when the broadcast() occurs, then its *reason*
for sleeping will still be true and hence there will eventually be
another thread to come along and wake it up.
However, I am concerned that you introduce this scenario with "where
it's not atomic". Are there cases where cond_wait() is not atomic it is
necessary to write code to take that into account? So my question is
still, what these correct but not atomic cond_wait() implementations are
like and how putting the broadcast() into the protection of the mutex
would help.
I should also say that I have not looked that the particular callback
code at issue here. Perhaps it is using broadcast() in some unusual
fashion (i.e. not using the producer/consumer model) that affects this
discussion.
Ted