[OpenAFS-devel] Thundering herds and the vnode state machine

Fri, 24 Feb 2012 11:54:30 +0100

Simon,

before you start redesigning the locking of vnodes it would perhaps be
worth to measure how often a situation happens that many threads
are trying to access the same vnode. In my AFS/OSD implementation I also
do a kind of locking of files which has to stay over the life of a single
RPC so I couldn't (or at least that was what I thought) use the normal
vnode locking for it but had to invent something new. In the structure
used for locking also the number of actual waiters and actual readers (when it's 
read locked) are stored and I have also variables to keep the
maximal number of files ever simultanously locked during the lifetime of the 
fileserver and the maximum number of simultanous asynchronous transactions.

It turns out these numbers are moderate. So in our case it certainly isn't
something which could slow down the fileserver.

In AFS/OSD any access to the data not only asynchronous access has to use
this locking, however access only to the vnode (FetchStatus...) not. So the
real vnode locking could be heavier.

I would suggest to have in the struct Vnode a field for the number of waiters
which should be updated by the locking macros or routines. These routines could
also compare the actual number of waiters with the up to now ever seen maximum 
and update the maximum and write a log message (if you don't want to have such a 
Variable-RPC as we have in AFS/OSD).

Cheers,
Hartmut

Simon Wilkinson wrote:
> I've been looking recently at reasons why the fileserver performs badly on
> multi processor systems. As part of this, I've been taking a look at the way
> in which the vnode state machine is implemented as part of the demand-attach
> fileserver. On cursory inspection, our current implementation seems to have
> some significant implementation problems, including being susceptible to a
> variation of the thundering herd problem.
>
> Firstly, some background (those familiar with DAFS can cut to the next
> paragraph!). In 1.4 a vnode can be unlocked, read locked, or write locked.
> These locks are handled using the standard AFS locks.h locking model - using
> either pthreads or LWP, depending on the build of the fileserver. With the
> demand-attach file server, these locks change into being a set of vnode
> states. Whilst many different states are defined, these states broadly divide
> into exclusive states (roughly akin to holding the write lock), STATE_READ
> (roughly similar to the read lock), STATE_ONLINE (similar to unlocked), and
> STATE_ERROR (which means something has gone wrong)
>
> Threads use a pair of functions to either wait until a vnode is quiescent (in
> a non-exclusive state, and with no readers), or until it is non-exclusive
> (there is a third function which allows a thread to wait upon any state
> change, but that appears to be unused). These waits are implemented using a
> single, per-vnode, condition variable. Whenever a vnode's state is changed,
> we broadcast and wake up all threads waiting on that variable.
>
> It's these broadcasts that cause us problems on multi-processor systems.
> Firstly, we broadcast regardless of the state change that has just occurred.
> If we have gone into an exclusive state, then we're waking up a load of
> threads that will be unable to make any progress. Secondly, broadcasting
> wakes up all pending threads, but the volume global lock means that only one
> can make progress. If the one that wins this race requires exclusive access,
> then all of the other woken threads will in turn acquire the global lock,
> note that they can't gain access to the vnode, and go back to sleep again. On
> a contended system, this will lead to a huge number of false wakeups.
> Thirdly, there are some situations where we broadcast multiple times for a
> single state change.
>
> I think any solution to this would require threads to indicate what they are
> going to do once they have waited. This allow us to selectively wake threads
> requiring exclusive access but broadcast to threads requiring read access.
> These wakeups would then only be performed if the state that we have
> transitioned in to would allow those threads to make forward progress.
>
> I'd welcome input from others more familiar with this code as to whether this
> is actually a problem, or if I'm missing something with the pthread condvar
> implementation that mitigates the problem.
>
> Cheers,
>
> Simon.
>
> _______________________________________________ OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel

-- 
-----------------------------------------------------------------
Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------