[OpenAFS-devel] volserver hangs possible fix
Derrick J Brashear
shadow@dementia.org
Mon, 11 Apr 2005 14:10:01 -0400 (EDT)
> to BreakVolumeCallbacksLater, which doesn't do any callbacks itself.
> Rather, it walks the callback hash chains, and sets the callback
> delayed flag for all the callbacks for the requested volumeId.
> BreakVolumeCallbacksLater and BreakLaterCallBacks both need H_LOCK,
> but MultiBreakCallBack_r drops H_LOCK before the multirx loop. So, if
> I'm reading the code correctly, recent 1.3's aren't vulnerable to this
> deadlock.
That was in fact the point of the code.
> While auditing the callback code, I noticed some funky condition
> variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater.
> One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast on
> fsync_cond without holding the associated lock. Here's a quick patch
> for that race:
There was a reason we did this:
>
> diff -uNr openafs-1.3.81-dist/src/viced/callback.c
> openafs-1.3.81-callback-race-fix/src/viced/callback.c
> --- openafs-1.3.81-dist/src/viced/callback.c 2005-03-11
> 02:03:46.000000000 -0500
> +++ openafs-1.3.81-callback-race-fix/src/viced/callback.c
> 2005-04-06 05:21:02.358508897 -0400
> @@ -1408,7 +1408,9 @@
>
> ViceLog(25, ("Fsync thread wakeup\n"));
> #ifdef AFS_PTHREAD_ENV
> + FSYNC_LOCK;
> assert(pthread_cond_broadcast(&fsync_cond) == 0);
> + FSYNC_UNLOCK;
> #else
> LWP_NoYieldSignal(fsync_wait);
> #endif
>
>
> Basically, FsyncCheckLWP treats the cond var like a semaphore. One
> solution that comes to mind is to add a global counter of outstanding
> delayed callbacks that is protected by FSYNC_LOCK, and have
> BreakVolumeCallbacksLater and other related functions increment it,
> and have BreakLaterCallBacks decrement it. Then, have FsyncCheckLWP
> call pthread_cond_timedwait only if the counter is zero.
that sounds like a decent way to deal. maybe later this week.