[OpenAFS-devel] fileserver profiling

Fri, 4 Mar 2005 16:01:52 -0500

On Thu, 3 Mar 2005 21:12:00 -0500 (EST), Nickolai Zeldovich
<kolya@mit.edu> wrote:
> On Wed, 2 Mar 2005, Tom Keiser wrote:
> 
> > I've been profiling the pthreaded fileserver for a while now, and one
> > thing continues to bother me.  We call the time() syscall so
> > frequently that it's often 20% of our total execution time.
> 
> I feel that you might be barking up the wrong tree here.  From what I've
> seen, the fileserver is rarely limited by CPU speed.  Making it use 20%
> less CPU, or 20% more CPU isn't going to make a big difference -- there's
> lots of processor cycles that aren't being used anyway.  The bottleneck is
> more likely to be in disk seeks, context switches, the network stack, or
> what have you.
> 

Whether the code is cpu-bound is very dependent upon your cpu
architecture. It's fairly easy for me to use about 1.25 ultrasparc's
worth of clock cycles on a 1.3.79 fileserver running Solaris 10 with
several 100Mbps attached clients.  If all the connections are using
rxkad_crypt, even more cycles can be used, but total throughput drops
significantly (fcrypt is pretty slow on sparc).  So far, my traces
have shown that scaling to multiple cpus is limited by lock contention
and context switches on sparc.  My profiling thus far points the
finger at three major problems:

(1) rx_freePktQ_lock
(2) rx_stats_mutex
(3) call->lock, particularly in rxi_ReceivePacket (line 2563 in
src/rx/rx.c version 1.68):

2559     channel = np->header.cid & RX_CHANNELMASK;
2560     call = conn->call[channel];
2561 #ifdef	RX_ENABLE_LOCKS
2562     if (call)
2563 	MUTEX_ENTER(&call->lock);
2564     /* Test to see if call struct is still attached to conn. */
2565     if (call != conn->call[channel]) {

The time() issue is rather small compared to some of these lock
contention problems.  I've been working on patches to make as many
data structures as possible thread-specific (see RT tickets 17547 and
17805 for more info, more patches coming soon).  While total cpu
utilization is often not an issue on X86, it is on other
architectures.  Furthermore, the fact that many threads cannot operate
in parallel due to lock contention is increasing latency in all cases,
and decreasing throughput in some.

> It looks like you used dtrace for profiling the fileserver, so it might be
> interesting to see if you can use dtrace to get some insight into what's
> going on with the system as a whole -- is the fileserver spending lots of
> time waiting for disk seeks, going through the UDP send/receive code path,
> or something else?  Perhaps we aren't coalescing small reads and writes
> somewhere?  I haven't used dtrace myself, so I can't say whether this
> would be easy or hard to do..
> 

I'm in the process of writing a library of dtrace scripts to help find
bottlenecks in different parts of openafs.  Once I get it more
polished, I'll release it publicly.  All of what you've mentioned is
possible; I'll try to find time to make sure all of those areas are
covered.

One of my primary goals is to improve the parallelism of the code.  I
set up a very contrived scenario the other day to show that this is
indeed a problem.  To simulate load, I wrote a stupid pthreads app
that has 64 threads doing while(1){open();write();close();} to a bunch
of files on a single volume.  Then, I had another client doing a
typical workload over a separate network connection to a different
volume on a different vice partition on a different i/o bus, and
quality of service for that client decreased substantially everytime I
ran the pthreads load testing app.  The standard deviations of
throughput and i/o latency shot way up even though the only point of
contention in this scenario was the fileserver process.  The
fileserver threads were blocked on mutexes most of the time, and the
scsi tagged command queues for both vice disks were not being fully
utilized.

Perhaps the time() call issue isn't that important, but it definitely
increases the latency of each RPC, and that's why I brought it up as a
potential place for improvement.

-- 
Tom Keiser
tkeiser@gmail.com