[OpenAFS-devel] volserver hangs possible fix

Horst Birthelmer horst@riback.net
Mon, 11 Apr 2005 22:26:49 +0200


On Apr 11, 2005, at 9:21 PM, Tom Keiser wrote:

> On Apr 11, 2005 2:35 PM, Horst Birthelmer <horst@riback.net> wrote:
>>
>> On Apr 6, 2005, at 1:40 PM, Tom Keiser wrote:
>>
>>> Your stack trace looks like an LWP volserver, so I'm assuming that's
>>> either a production version, or an old devel build.
>>>
>>> In newer 1.3.x versions, V_BreakVolumeCallbacks is a function pointer
>>> to BreakVolumeCallbacksLater, which doesn't do any callbacks itself.
>>> Rather, it walks the callback hash chains, and sets the callback
>>> delayed flag for all the callbacks for the requested volumeId.
>>> BreakVolumeCallbacksLater and BreakLaterCallBacks both need H_LOCK,
>>> but MultiBreakCallBack_r drops H_LOCK before the multirx loop.  So, 
>>> if
>>> I'm reading the code correctly, recent 1.3's aren't vulnerable to 
>>> this
>>> deadlock.
>>>
>>> As a side note, if you find annoying problems like this, and you're
>>> running an OS where you can get a core without destroying the process
>>> (e.g. with gcore on solaris), drop the core snapshots to disk for
>>> future analysis ;)
>>>
>>> While auditing the callback code, I noticed some funky condition
>>> variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater.
>>> One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast 
>>> on
>>> fsync_cond without holding the associated lock.  Here's a quick patch
>>> for that race:
>>>
>>
>> I don't actually think that would be a race condition. It just leads 
>> to
>> some undefined scheduling behavior.
>> If you don't have locked the mutex with which the other thread is 
>> doing
>> it's wait you may just continue with that thread depending on your
>> schedulers mood ;-)
>>
>
> It is a race condition.  pthread_cond_wait() is not an atomic
> operation; it relies on the mutex to synchronize a set of operations
> that enqueue the calling thread onto the list of threads blocked on
> the cond var, drop the associated lock, and become blocked by the
> kernel scheduler.  If you call signal/broadcast without holding the
> lock, you can race against these bookkeeping operations in
> pthread_cond_wait().  Thus, a thread in the process of blocking will
> never find out about that signal/broadcast, because cond vars are not
> stateful like semaphores.

The mutex passed to a condition variable isn't for guarding the 
bookkeeping, but that's another story, I guess.

>
>> Locking that mutex means you _will_ definitely continue when you call
>> the unlock and not at some undefined point in time.
>> Here's what the POSIX standard says about this:
>>
>
> Actually, it cannot guarantee that type of ordering.  Precisely when
> the thread will be scheduled to run is an implementation detail of
> whatever platform you're running, the number of processors, etc.  The
> core distinction is whether you will _always_ wake up on a call to
> signal/broadcast, or only _sometimes_ wake up on a call to
> signal/broadcast.  In the general case, there are algorithms where
> there will be no future signal/broadcast unless the thread wakes up
> every time.  Luckily, the callback algorithm only experiences delays
> in the current implementation.

That's right but only on multiprocessor systems with for my taste a 
poor POSIX implementation.

>
>> <quote>
>> If more than one thread is blocked on a condition variable, the
>> scheduling policy shall determine the order in which threads are
>> unblocked. When each thread unblocked as a result of a
>> pthread_cond_broadcast() or pthread_cond_signal() returns from its 
>> call
>> to pthread_cond_wait() or pthread_cond_timedwait(), the thread shall
>> own the mutex with which it called pthread_cond_wait() or
>> pthread_cond_timedwait(). The thread(s) that are unblocked shall
>> contend for the mutex according to the scheduling policy (if
>> applicable), and as if each had called pthread_mutex_lock().
>>
>> The pthread_cond_broadcast() or pthread_cond_signal() functions may be
>> called by a thread whether or not it currently owns the mutex that
>> threads calling pthread_cond_wait() or pthread_cond_timedwait() have
>> associated with the condition variable during their waits; however, if
>> predictable scheduling behavior is required, then that mutex shall be
>> locked by the thread calling pthread_cond_broadcast() or
>> pthread_cond_signal().
>> <\quote>
>>
>
> The paragraph directly following your quote is extremely important:
>
> <quote>
> The pthread_cond_broadcast() and pthread_cond_signal() functions shall
> have no effect if there are no threads currently blocked on cond.
> </quote>
>

This might be important but not for the problem we're having.

> And here's an important related paragraph from phtread_cond_wait() in 
> the spec:
>
> <quote>
> These functions atomically release mutex and cause the calling thread
> to block on the condition variable cond; atomically here means
> "atomically with respect to access by another thread to the mutex and
> then the condition variable". That is, if another thread is able to
> acquire the mutex after the about-to-block thread has released it,
> then a subsequent call to pthread_cond_broadcast() or
> pthread_cond_signal() in that thread shall behave as if it were issued
> after the about-to-block thread has blocked.
> </quote>
>
> By "predictable scheduling" they are referring to whether there is a
> deterministic relationship between calling signal/broadcast, and a
> thread waking up.  Calling signal/broadcast outside of the lock is
> nondeterministic; it may or may not have any effect.
>
> When you use signal/broadcast outside of the lock you have no
> guarantee that the thread you're attempting to wake up is busy,
> preparing to block, or blocked.  In the callback code we've

Isn't that exactly what I said??
You have no idea when your thread is gonna wake up without that lock 
held...
Looks like we were talking about the same thing but using different 
arguments.

> been
> discussing the result is that you could unnecessarily wait 5 minutes
> to break callbacks that could just as easily have been broken right
> now, and on busy servers you've just increased your callback break
> backlog as well :-(

Right, too!!

I wasn't trying to start any wars. Are we here on a POSIX contest?? I 
don't think so ;-)
I was just trying to sensitize people that that might not be the fix 
for the volserver hang some of us were experiencing.

Horst