[OpenAFS-devel] volserver hangs possible fix

Derrick J Brashear shadow@dementia.org
Mon, 11 Apr 2005 14:10:01 -0400 (EDT)


> to BreakVolumeCallbacksLater, which doesn't do any callbacks itself.
> Rather, it walks the callback hash chains, and sets the callback
> delayed flag for all the callbacks for the requested volumeId.
> BreakVolumeCallbacksLater and BreakLaterCallBacks both need H_LOCK,
> but MultiBreakCallBack_r drops H_LOCK before the multirx loop.  So, if
> I'm reading the code correctly, recent 1.3's aren't vulnerable to this
> deadlock.

That was in fact the point of the code.

> While auditing the callback code, I noticed some funky condition
> variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater.
> One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast on
> fsync_cond without holding the associated lock.  Here's a quick patch
> for that race:

There was a reason we did this:

>
> diff -uNr openafs-1.3.81-dist/src/viced/callback.c
> openafs-1.3.81-callback-race-fix/src/viced/callback.c
> --- openafs-1.3.81-dist/src/viced/callback.c    2005-03-11
> 02:03:46.000000000 -0500
> +++ openafs-1.3.81-callback-race-fix/src/viced/callback.c
> 2005-04-06 05:21:02.358508897 -0400
> @@ -1408,7 +1408,9 @@
>
>     ViceLog(25, ("Fsync thread wakeup\n"));
> #ifdef AFS_PTHREAD_ENV
> +    FSYNC_LOCK;
>     assert(pthread_cond_broadcast(&fsync_cond) == 0);
> +    FSYNC_UNLOCK;
> #else
>     LWP_NoYieldSignal(fsync_wait);
> #endif
>
>
> Basically, FsyncCheckLWP treats the cond var like a semaphore.  One
> solution that comes to mind is to add a global counter of outstanding
> delayed callbacks that is protected by FSYNC_LOCK, and have
> BreakVolumeCallbacksLater and other related functions increment it,
> and have BreakLaterCallBacks decrement it.  Then, have FsyncCheckLWP
> call pthread_cond_timedwait only if the counter is zero.

that sounds like a decent way to deal. maybe later this week.