[OpenAFS-devel] volserver hangs possible fix

Jeffrey Hutzelman jhutz@cmu.edu
Mon, 18 Apr 2005 17:30:45 -0400


On Monday, April 18, 2005 10:04:45 PM +0200 Horst Birthelmer 
<horst@riback.net> wrote:

> That's one passage I didn't post in my last postings, which actually
> started the fire... ;-)
> I still don't see the confusion. It's sort of what I said in the first
> place.
> You still can hold the mutex but miss the broadcast and wait forever
> there ...

Well, one bit of confusion is that people keep talking about how it doesn't 
work if pthread_cond_wait is not atomic.  That's not a problem, because 
pthread_cond_wait is NEVER not atomic.  It is ALWAYS atomic.


> That's one point the other is, you can be in the critical section with
> one thread and broadcasting the others,
> which as I pointed out for I have no idea how many times now, is _not_ a
> race condition.

Sure you can, but never in a situation where it matters.

Suppose again that thread A is the broadcasting thread, and thread B is the 
waiter thread that we are interested in.

Now, in the example under discussion, thread A looks like this:

{
  ...
  acquire mutex
  update queue
  release mutex
  cond_broadcast
  ...
}

And thread B looks like this:

acquire mutex
while (1) {
  while (queue is not empty) {
    pop work from queue
    release mutex
    do work
    acquire mutex
  }
  cond_wait
}


Note that thread B must release the mutex to do work, but calls cond_wait 
only if it has observed the queue to be empty since the mutex was last 
acquired.  So, I see about three possible cases:

Case I - Everything happens in the expected order:

  Thread A               Thread B
                         acquire mutex
                         queue is empty
                         cond_wait -> SLEEP (with release)
  acquire mutex
  add item N
  release mutex
  cond_broadcast         WAKEUP (with acquire)
                         queue is not empty
                         pop item N from queue
                         release mutex
                         process item N
                         acquire mutex
                         queue is empty
                         cond_wait -> SLEEP (with release)


Case II - Not really a deadlock

  acquire mutex
  add item N
  release mutex
                         acquire mutex
                         queue is not empty
                         pop item N from queue
                         release mutex
                         process item N
                         acquire mutex
                         queue is empty
  cond_broadcast         NO EFFECT
                         cond_wait -> SLEEP (with release)


Case III - Also OK

  acquire mutex
  add item N
  release mutex
                         acquire mutex
                         queue is not empty
                         pop item N from queue
                         release mutex
                         process item N
                         acquire mutex
                         queue is empty
                         cond_wait -> SLEEP (with release)
  cond_broadcast         WAKEUP (with acquire)
                         queue is empty
                         cond_wait -> SLEEP (with release)


Note that it is possible (as in case II) for the cond_broadcast to have no 
effect on thread B, because it is not in cond_wait yet.  But it is not 
possible for this to result in item N not being processed, because thread B 
will never call cond_wait unless it has observed an empty queue since last 
acquiring the mutex.


> There still is this theoretical possibility where the thread will be
> waiting forever on the cv, but let's put that aside.

Not while there's work in the queue; see above.

-- Jeff