[OpenAFS-devel] volserver hangs possible fix
Jeffrey Hutzelman
jhutz@cmu.edu
Tue, 12 Apr 2005 12:04:23 -0400
On Tuesday, April 12, 2005 11:11:29 AM -0400 Ted Anderson
<TedAnderson@mindspring.com> wrote:
> Just to make sure I understand this problem, let me try providing a
> specific scenario:
> thread A thread B
> -------- --------
> 1: lock M;
> 2: check queue;
> 3: cond_wait(C,M); // queue is empty
> 4: -> kernel_scheduler {
> 5: unlock M;
> 6: lock M; ...
> 7: add item; ...
> 8: unlock M; ...
> 9: broadcast on C; ... // noop, nobody blocked on C
> 10: put B on C;
> 11: desched B;
> 12: DEADLOCK
>
> Is this a problem that can occur with a corrently operating
> implementation of POSIX threads, i.e. an example of correct but
> non-deterministic scheduling? I've expanded the operation of cond_wait
> as I understand it. I observe that M correctly protects the queue
> itself. Treating "broadcast C" and "put B on C" as black boxes, however,
> it doesn't seem to help to move broadcast(C) inside M, i.e. by swapping 8
> & 9. On the other hand, tt would seem a simple matter to reverse the
> sequence of 5 & 10, but perhaps that is inconvenient for some
> multi-processor schedulers.
>
> As I understand your comment, you are saying that the broadcast must be
> protected by the mutex or race conditions are possible. What am I
> missing here?
What you're missing here is that cond_wait() is atomic. The lines you list
as 5 and 10 happen at the same time, so it's not possible for thread A to
acquire the lock before thread B is on the wait-list for the CV. This is
why it is necessary to hold the lock through the broadcast in order to
prevent a race -- otherwise, the broadcast could indeed occur between lines
2 and 10, and thread B would miss the wakeup.
-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
Sr. Research Systems Programmer
School of Computer Science - Research Computing Facility
Carnegie Mellon University - Pittsburgh, PA