[OpenAFS-devel] volserver hangs possible fix

Marcus Watts mdw@umich.edu
Thu, 07 Apr 2005 03:27:27 -0400


Tom Keiser <tkeiser@gmail.com> writes:
...
> 
> 
> Your stack trace looks like an LWP volserver, so I'm assuming that's
> either a production version, or an old devel build.

Sorry, my traceback was indeed from 1.2.13.  Our production staff
believes that's the latest "stable" version and therefore the preferable
version to run.

> 
> In newer 1.3.x versions, V_BreakVolumeCallbacks is a function pointer
> to BreakVolumeCallbacksLater, which doesn't do any callbacks itself. 
> Rather, it walks the callback hash chains, and sets the callback
> delayed flag for all the callbacks for the requested volumeId. 
> BreakVolumeCallbacksLater and BreakLaterCallBacks both need H_LOCK,
> but MultiBreakCallBack_r drops H_LOCK before the multirx loop.  So, if
> I'm reading the code correctly, recent 1.3's aren't vulnerable to this
> deadlock.

That's very encouraging to hear, and sounds like at least potentially a
better solution than what I had, assuming H_LOCK is not a problem.

> 
> As a side note, if you find annoying problems like this, and you're
> running an OS where you can get a core without destroying the process
> (e.g. with gcore on solaris), drop the core snapshots to disk for
> future analysis ;)

This is from linux, and the core dump I had was from "gcore" in gdb.
I didn't think to get one from the fileserver as well, but I wish I had
now.

> 
> While auditing the callback code, I noticed some funky condition
> variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater. 
> One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast on
> fsync_cond without holding the associated lock.  Here's a quick patch
> for that race:
> 
> 
> diff -uNr openafs-1.3.81-dist/src/viced/callback.c
> openafs-1.3.81-callback-race-fix/src/viced/callback.c
> --- openafs-1.3.81-dist/src/viced/callback.c    2005-03-11
> 02:03:46.000000000 -0500
> +++ openafs-1.3.81-callback-race-fix/src/viced/callback.c      
> 2005-04-06 05:21:02.358508897 -0400
> @@ -1408,7 +1408,9 @@
>  
>      ViceLog(25, ("Fsync thread wakeup\n"));
>  #ifdef AFS_PTHREAD_ENV
> +    FSYNC_LOCK;
>      assert(pthread_cond_broadcast(&fsync_cond) == 0);
> +    FSYNC_UNLOCK;
>  #else
>      LWP_NoYieldSignal(fsync_wait);
>  #endif
> 
> 
> Basically, FsyncCheckLWP treats the cond var like a semaphore.  One
> solution that comes to mind is to add a global counter of outstanding
> delayed callbacks that is protected by FSYNC_LOCK, and have
> BreakVolumeCallbacksLater and other related functions increment it,
> and have BreakLaterCallBacks decrement it.  Then, have FsyncCheckLWP
> call pthread_cond_timedwait only if the counter is zero.

Sounds complicated.  I'll have to look at it before I have a useful opinion.
Since this is not our first problem with callbacks, I want to understand
this a lot better anyways.

> 
> instead of implementing yet another linked list, I'd recommend using
> the rx_queue package.  It has a clean API, and a fast implementation.

I did think of this, but was in a hurry...  It probably doesn't
matter now assuming 1.3.8x has a better solution.

...
> > +       VOL_UNLOCK;
> > +       VATTACH_UNLOCK;
> 
> instead of synchronizing list accesses with these big global mutexes,
> why not use the lock you created above?

FSYNC_com was using those "big global mutexes" already, and the condition 
variable was to give FSYNC_cbthread something to wait for.  But--see below,
I appreciate what you're getting at here.

> 
> > +       if (!fp) {
> 
> for pthreads this should be a while(!fp) because cond_wait can return spuriously

fp is a local variable, so a while would mean if it wasn't set, it would
never get set.  The logic I had means a spurious return from cond_wait
would loop back through the "big global mutex" logic, safely fail to
remove the first element from fcblist, and go back to waiting.
So an extra return is ok...--but:

> 
> > +#ifdef AFS_PTHREAD_ENV
> > +           pthread_cond_wait(&fssync_cblist_cond, &fssync_cblist_mutex);
> 
> doing cond_wait without having the lock leads to undefined behavior

Ah hah.  Now I see what you're getting at; indeed, if a call to
FSYNC_add_cblist happens entirely between the time when the list is
discovered to be empty and the call to cond_wait, then the
cond_broadcast would happen first, and that's bad.  So yes, I should
acquire the mutex before adding things to the list.

Hm.  The code I was looking at for LWP_NoYieldSignal vs.
pthread_cond_broadcast was in vol/volume.c.  It uses vol_glock_mutex to
protect vol_put_volume_cond in cond_wait, but VPutVolume_r doesn't
acquire vol_glock_mutex before clearing vp->goingOffline and calling
cond_broadcast.  Unless there's another lock running around in this
that I don't understand (which is quite possible) this appears to have
the same problem as my code.  Probably it's a good thing it's a global
so if it ever gets stuck here, additional activity will unstick
things.

...

So, yes, cargo cult programming at 3am is not ideal, but I may have
stumbled up against something of interest anyways.

On a somewhat separate note: this logic is using an inet domain socket
which is in fact reachable via the network (at least by other hosts
on the same network segment).  It should probably be using a unix domain
socket instead, and that could also simplify some of the other logic.

			-Marcus Watts
			UM ITCS Umich Systems Group