[OpenAFS-devel] volserver hangs possible fix

Horst Birthelmer horst@riback.net
Mon, 18 Apr 2005 23:55:01 +0200


On Apr 18, 2005, at 11:30 PM, Jeffrey Hutzelman wrote:

>
> On Monday, April 18, 2005 10:04:45 PM +0200 Horst Birthelmer 
> <horst@riback.net> wrote:
>
>> That's one passage I didn't post in my last postings, which actually
>> started the fire... ;-)
>> I still don't see the confusion. It's sort of what I said in the first
>> place.
>> You still can hold the mutex but miss the broadcast and wait forever
>> there ...
>
> Well, one bit of confusion is that people keep talking about how it 
> doesn't work if pthread_cond_wait is not atomic.  That's not a 
> problem, because pthread_cond_wait is NEVER not atomic.  It is ALWAYS 
> atomic.

Well, I just adopted that idea to show that not even that would be a 
race condition and somehow it happens every time. I get held 
responsible for stuff I didn't meant to say or do ;-)

OK, I reread my postings, maybe I wasn't clear enough in a few places 
but I wouldn't call that being confused :-)

>
>
>> That's one point the other is, you can be in the critical section with
>> one thread and broadcasting the others,
>> which as I pointed out for I have no idea how many times now, is 
>> _not_ a
>> race condition.
>
> Sure you can, but never in a situation where it matters.
>
> Suppose again that thread A is the broadcasting thread, and thread B 
> is the waiter thread that we are interested in.
>
> Now, in the example under discussion, thread A looks like this:
>
> {
>  ...
>  acquire mutex
>  update queue
>  release mutex
>  cond_broadcast
>  ...
> }
>
> And thread B looks like this:
>
> acquire mutex
> while (1) {
>  while (queue is not empty) {
>    pop work from queue
>    release mutex
>    do work
>    acquire mutex
>  }
>  cond_wait
> }
>
>
> Note that thread B must release the mutex to do work, but calls 
> cond_wait only if it has observed the queue to be empty since the 
> mutex was last acquired.  So, I see about three possible cases:
>
> Case I - Everything happens in the expected order:
>
>  Thread A               Thread B
>                         acquire mutex
>                         queue is empty
>                         cond_wait -> SLEEP (with release)
>  acquire mutex
>  add item N
>  release mutex
>  cond_broadcast         WAKEUP (with acquire)
>                         queue is not empty
>                         pop item N from queue
>                         release mutex
>                         process item N
>                         acquire mutex
>                         queue is empty
>                         cond_wait -> SLEEP (with release)
>
>
> Case II - Not really a deadlock
>
>  acquire mutex
>  add item N
>  release mutex
>                         acquire mutex
>                         queue is not empty
>                         pop item N from queue
>                         release mutex
>                         process item N
>                         acquire mutex
>                         queue is empty
>  cond_broadcast         NO EFFECT
>                         cond_wait -> SLEEP (with release)
>
>
> Case III - Also OK
>
>  acquire mutex
>  add item N
>  release mutex
>                         acquire mutex
>                         queue is not empty
>                         pop item N from queue
>                         release mutex
>                         process item N
>                         acquire mutex
>                         queue is empty
>                         cond_wait -> SLEEP (with release)
>  cond_broadcast         WAKEUP (with acquire)
>                         queue is empty
>                         cond_wait -> SLEEP (with release)
>
>
> Note that it is possible (as in case II) for the cond_broadcast to 
> have no effect on thread B, because it is not in cond_wait yet.  But 
> it is not possible for this to result in item N not being processed, 
> because thread B will never call cond_wait unless it has observed an 
> empty queue since last acquiring the mutex.
>
>
>> There still is this theoretical possibility where the thread will be
>> waiting forever on the cv, but let's put that aside.
>
> Not while there's work in the queue; see above.
>

So you agree with my initial posting that we didn't really have a 
problem here and that this is not the cause for those volserver hangs.

Horst