[OpenAFS-devel] Thundering herds and the vnode state machine

Simon Wilkinson simonxwilkinson@gmail.com
Tue, 21 Feb 2012 18:05:34 +0000


I've been looking recently at reasons why the fileserver performs badly =
on multi processor systems. As part of this, I've been taking a look at =
the way in which the vnode state machine is implemented as part of the =
demand-attach fileserver. On cursory inspection, our current =
implementation seems to have some significant implementation problems, =
including being susceptible to a variation of the thundering herd =
problem.

Firstly, some background (those familiar with DAFS can cut to the next =
paragraph!). In 1.4 a vnode can be unlocked, read locked, or write =
locked. These locks are handled using the standard AFS locks.h locking =
model - using either pthreads or LWP, depending on the build of the =
fileserver. With the demand-attach file server, these locks change into =
being a set of vnode states. Whilst many different states are defined, =
these states broadly divide into exclusive states (roughly akin to =
holding the write lock), STATE_READ (roughly similar to the read lock), =
STATE_ONLINE (similar to unlocked), and STATE_ERROR (which means =
something has gone wrong)

Threads use a pair of functions to either wait until a vnode is =
quiescent (in a non-exclusive state, and with no readers), or until it =
is non-exclusive (there is a third function which allows a thread to =
wait upon any state change, but that appears to be unused). These waits =
are implemented using a single, per-vnode, condition variable. Whenever =
a vnode's state is changed, we broadcast and wake up all threads waiting =
on that variable.

It's these broadcasts that cause us problems on multi-processor systems. =
Firstly, we broadcast regardless of the state change that has just =
occurred. If we have gone into an exclusive state, then we're waking up =
a load of threads that will be unable to make any progress. Secondly, =
broadcasting wakes up all pending threads, but the volume global lock =
means that only one can make progress. If the one that wins this race =
requires exclusive access, then all of the other woken threads will in =
turn acquire the global lock, note that they can't gain access to the =
vnode, and go back to sleep again. On a contended system, this will lead =
to a huge number of false wakeups. Thirdly, there are some situations =
where we broadcast multiple times for a single state change.

I think any solution to this would require threads to indicate what they =
are going to do once they have waited. This allow us to selectively wake =
threads requiring exclusive access but broadcast to threads requiring =
read access. These wakeups would then only be performed if the state =
that we have transitioned in to would allow those threads to make =
forward progress.

I'd welcome input from others more familiar with this code as to whether =
this is actually a problem, or if I'm missing something with the pthread =
condvar implementation that mitigates the problem.

Cheers,

Simon.