[OpenAFS-devel] volserver hangs possible fix

Ted Anderson TedAnderson@mindspring.com
Mon, 18 Apr 2005 10:30:00 -0400


On 4/18/2005 08:58, Horst Birthelmer wrote:
> The problem isn't whether cond_wait is atomic. It's what happens to the 
> algorithm if it's not.
> Imagine the scenario where it's not atomic (and this was the part where 
> I agreed with Tom) and you have the mutex locked in the cond_wait call, 
> but the thread isn't  in the queue yet.
> Now this thread gets interrupted by whatever event and you perform a 
> ...cond_broadcast(). All the threads are woken up except that one not in 
> there yet. You have a thread waiting a cond_var you weren't aware of... 
> actually you have that thread waiting on that condition variable while 
> you already performed a broadcast. That's pretty weird for the algorithm.

Okay, but I don't agree that this situation would generate a problem in 
correctly written CV code.  As long as the mutex is still held by the 
trying-to-sleep thread when the broadcast() occurs, then its *reason* 
for sleeping will still be true and hence there will eventually be 
another thread to come along and wake it up.

However, I am concerned that you introduce this scenario with "where 
it's not atomic".  Are there cases where cond_wait() is not atomic it is 
necessary to write code to take that into account?  So my question is 
still, what these correct but not atomic cond_wait() implementations are 
like and how putting the broadcast() into the protection of the mutex 
would help.

I should also say that I have not looked that the particular callback 
code at issue here.  Perhaps it is using broadcast() in some unusual 
fashion (i.e. not using the producer/consumer model) that affects this 
discussion.

Ted