[OpenAFS-devel] fileserver profiling

Wed, 2 Mar 2005 12:25:22 -0500

On Wed, 02 Mar 2005 11:11:36 -0500, Jeffrey Hutzelman <jhutz@cmu.edu> wrote:
> 
> 
> On Wednesday, March 02, 2005 10:55:09 AM -0500 Tom Keiser
> <tkeiser@gmail.com> wrote:
> 
> > I've been profiling the pthreaded fileserver for a while now, and one
> > thing continues to bother me.  We call the time() syscall so
> > frequently that it's often 20% of our total execution time.
> > There are plenty of hacks to reduce this overhead, but every one I
> > know of will introduce some amount of non-determinism.  Is that an
> > acceptable risk?
> >
> > Having a thread in a nanosleep; gettimeofday loop might be acceptable,
> > but during periods of high load, you couldn't guarantee that your
> > global epoch time variable would be _monotonically_ increasing.  I
> > suppose a combination of a SIGALRM handler and gettimeofday; setitimer
> > might mitigate that problem to some extent.  Anyone have any better
> > suggestions on how to reduce this overhead?
> >
> > Certainly there are parts of the code that should continue to call
> > time() directly, but how much do we care if we accept a kerberos
> > ticket that expired a few milliseconds ago, when acceptable clock
> > drift is many orders of magnitude larger?
> 
> I'm not sure where are all the places we are currently actually calling
> time().  I can think of four likely situations in which we care about time:
> 
> (1) validating authentication
> (2) ubik voting
> (3) file timestamps
> (4) various timing-sensitive bits in rx
> 
> For (1) and (2) we don't care about sub-second accuracy.  This is pretty
> much true of any place we currently call time(), which returns an integer
> number of seconds.
> 
> For (2), we _do_ care about being accurate within a few seconds.  There is
> a certain small amount of skew allowed in the voting protocol, but it is
> intended to deal mostly with potential network latency, and throwing it
> away on clock skew can hurt.
> 

Yeah, I figured ubik would not be a good candidate for this kind of
optimization.

> For (2) and (3) we _do_ care that the clock we actually use is
> monotonically increasing.  As it stands right now, some fairly unhappy
> things can happen if the system clock is stepped backwards for some reason.
> 

When you say stepped backwards, do you mean with respect to previous
values of a userspace "clock", or with respect to the real system
clock?  I can certainly guarantee that a userspace clock is
non-decreasing, but I cannot guarantee a non-negative difference with
respect to the system clock at any given time.

> As for (4), there are any number of things in rx that are timing sensitive,
> but for the most part they don't work by calling time().
> 

According to the DTrace pid provider, these are the top ten callers of
time() over a few second period on a 1.3.78 fileserver:

  11               fileserver`rxi_ReadProc+0x2b0
  13               fileserver`TryAttach+0x80
  15               fileserver`rxi_SendXmitList+0x2a4
  16               fileserver`common_StoreData64+0x83c
  21               fileserver`VGetVnode_r+0x130
  29               fileserver`VGetVnode_r+0xad8
 151               fileserver`rx_ReadvProc+0x40
5966               fileserver`rxi_ListenerProc+0x118
7969               fileserver`rxi_SendAck+0x5c0
23760               fileserver`rxi_FillReadVec+0x10c

For rxi_FillReadVec at least, it's not rx, but the rxkad security
object's op_CheckPacket member, which DTrace misses since its address
is loaded from a function pointer.  I have full stack traces for
~38000 calls to time().  If anyone wants to see more than just the
immediate caller, I can post them somewhere.

-- 
Tom Keiser
tkeiser@gmail.com