[OpenAFS-devel] volserver hangs possible fix

Jeffrey Hutzelman jhutz@cmu.edu
Wed, 20 Apr 2005 10:54:17 -0400


On Monday, April 18, 2005 11:55:01 PM +0200 Horst Birthelmer 
<horst@riback.net> wrote:

> So you agree with my initial posting that we didn't really have a problem
> here and that this is not the cause for those volserver hangs.

Yes, and no.

You originally said that not holding the mutex when calling cond_broadcast 
does not introduce a race.  I agree with that statement.

However, there actually is a race, because the FSYNC lock is not held 
continuously from when BreakLaterCallBacks() decides there is no more work 
to do until FsyncCheckLWP() calls pthread_cond_timedwait.  So it is 
possible for more callbacks to be added and the broadcast to be sent during 
this window, which will result in no work being done until Fsync_CheckLWP 
wakes up on the 5 minute timeout.

In practice, I think that race should be uncommon, but I haven't worked out 
how likely the various possible scheduling variants are.

-- Jeff