[OpenAFS-devel] volserver hangs possible fix
Horst Birthelmer
horst@riback.net
Mon, 18 Apr 2005 23:55:01 +0200
On Apr 18, 2005, at 11:30 PM, Jeffrey Hutzelman wrote:
>
> On Monday, April 18, 2005 10:04:45 PM +0200 Horst Birthelmer
> <horst@riback.net> wrote:
>
>> That's one passage I didn't post in my last postings, which actually
>> started the fire... ;-)
>> I still don't see the confusion. It's sort of what I said in the first
>> place.
>> You still can hold the mutex but miss the broadcast and wait forever
>> there ...
>
> Well, one bit of confusion is that people keep talking about how it
> doesn't work if pthread_cond_wait is not atomic. That's not a
> problem, because pthread_cond_wait is NEVER not atomic. It is ALWAYS
> atomic.
Well, I just adopted that idea to show that not even that would be a
race condition and somehow it happens every time. I get held
responsible for stuff I didn't meant to say or do ;-)
OK, I reread my postings, maybe I wasn't clear enough in a few places
but I wouldn't call that being confused :-)
>
>
>> That's one point the other is, you can be in the critical section with
>> one thread and broadcasting the others,
>> which as I pointed out for I have no idea how many times now, is
>> _not_ a
>> race condition.
>
> Sure you can, but never in a situation where it matters.
>
> Suppose again that thread A is the broadcasting thread, and thread B
> is the waiter thread that we are interested in.
>
> Now, in the example under discussion, thread A looks like this:
>
> {
> ...
> acquire mutex
> update queue
> release mutex
> cond_broadcast
> ...
> }
>
> And thread B looks like this:
>
> acquire mutex
> while (1) {
> while (queue is not empty) {
> pop work from queue
> release mutex
> do work
> acquire mutex
> }
> cond_wait
> }
>
>
> Note that thread B must release the mutex to do work, but calls
> cond_wait only if it has observed the queue to be empty since the
> mutex was last acquired. So, I see about three possible cases:
>
> Case I - Everything happens in the expected order:
>
> Thread A Thread B
> acquire mutex
> queue is empty
> cond_wait -> SLEEP (with release)
> acquire mutex
> add item N
> release mutex
> cond_broadcast WAKEUP (with acquire)
> queue is not empty
> pop item N from queue
> release mutex
> process item N
> acquire mutex
> queue is empty
> cond_wait -> SLEEP (with release)
>
>
> Case II - Not really a deadlock
>
> acquire mutex
> add item N
> release mutex
> acquire mutex
> queue is not empty
> pop item N from queue
> release mutex
> process item N
> acquire mutex
> queue is empty
> cond_broadcast NO EFFECT
> cond_wait -> SLEEP (with release)
>
>
> Case III - Also OK
>
> acquire mutex
> add item N
> release mutex
> acquire mutex
> queue is not empty
> pop item N from queue
> release mutex
> process item N
> acquire mutex
> queue is empty
> cond_wait -> SLEEP (with release)
> cond_broadcast WAKEUP (with acquire)
> queue is empty
> cond_wait -> SLEEP (with release)
>
>
> Note that it is possible (as in case II) for the cond_broadcast to
> have no effect on thread B, because it is not in cond_wait yet. But
> it is not possible for this to result in item N not being processed,
> because thread B will never call cond_wait unless it has observed an
> empty queue since last acquiring the mutex.
>
>
>> There still is this theoretical possibility where the thread will be
>> waiting forever on the cv, but let's put that aside.
>
> Not while there's work in the queue; see above.
>
So you agree with my initial posting that we didn't really have a
problem here and that this is not the cause for those volserver hangs.
Horst