[OpenAFS-devel] volserver hangs possible fix

Horst Birthelmer horst@riback.net
Mon, 18 Apr 2005 17:01:06 +0200


On Apr 18, 2005, at 4:30 PM, Ted Anderson wrote:

> On 4/18/2005 08:58, Horst Birthelmer wrote:
>> The problem isn't whether cond_wait is atomic. It's what happens to 
>> the algorithm if it's not.
>> Imagine the scenario where it's not atomic (and this was the part 
>> where I agreed with Tom) and you have the mutex locked in the 
>> cond_wait call, but the thread isn't  in the queue yet.
>> Now this thread gets interrupted by whatever event and you perform a 
>> ...cond_broadcast(). All the threads are woken up except that one not 
>> in there yet. You have a thread waiting a cond_var you weren't aware 
>> of... actually you have that thread waiting on that condition 
>> variable while you already performed a broadcast. That's pretty weird 
>> for the algorithm.
>
> Okay, but I don't agree that this situation would generate a problem 
> in correctly written CV code.  As long as the mutex is still held by 
> the trying-to-sleep thread when the broadcast() occurs, then its 
> *reason* for sleeping will still be true and hence there will 
> eventually be another thread to come along and wake it up.

Right, that's what I meant by "weird for your algorithm". If you 
designed it the wrong way it'll hang here.

>
> However, I am concerned that you introduce this scenario with "where 
> it's not atomic".  Are there cases where cond_wait() is not atomic it 
> is necessary to write code to take that into account?  So my question 
> is still, what these correct but not atomic cond_wait() 
> implementations are like and how putting the broadcast() into the 
> protection of the mutex would help.
>

Most implementations don't have an atomic cond_wait since it's not 
mandatory by POSIX ;-)
It's just you have to treat it that way since there's no guaranty that 
you can rely on an atomic implementation.
It's no "problem" at all, it's just one aspect you have to keep in mind.

I introduced that with "where it's not atomic" because that's the very 
assumption in this discussion. None of the arguments would be true if 
it was.

By definition, a con_wait is used for waiting on some event inside a 
critical section. This means you entered the cond_wait with the mutex 
held. Now the cond_wait call enqueues the thread into the queue of 
threads waiting on this condition variable and unlocks the mutex. If 
you use the broadcast call protected by the same mutex you will always 
have a "settled" environment. This means you're sure you don't have any 
threads inside the critical section since you won't get to do broadcast 
(you would be waiting on the mutex_lock) and during the broadcast you 
won't have any threads entering the critical section. I can just repeat 
myself. It's more safe but from my point of view not a fix to the 
problem we had.


> I should also say that I have not looked that the particular callback 
> code at issue here.  Perhaps it is using broadcast() in some unusual 
> fashion (i.e. not using the producer/consumer model) that affects this 
> discussion.
>

Well, it did not held the mutex during the broadcast, so there is for 
some people a theoretical possibility for a problem.

Horst