[OpenAFS] Re: Debugging a network performance problem that affects AFS

Fri, 14 Jan 2011 08:44:58 -0500

On 01/13/2011 04:42 PM, Andrew Deason wrote:
> On Thu, 13 Jan 2011 15:24:00 -0500
> Dale Pontius<pontius@btv.ibm.com>  wrote:
>
>> I'm wondering if it's possible to collect access time statistics out
>> of an OpenAFS Linux client.
> "access time" is a bit vague to me; you just want to see how quickly it
> is getting a response from the fileserver? There are numerous steps
> involved in fetching data, and the cause of bad performance could be in
> many places.
I guess I'm thinking "round-trip time" from data request to data 
response.  I installed and fired up wireshark a while back, with the 
thought of tying together request and response packets to measure 
response time.  But this is far from my normal job, I was just starting 
to play with wireshark, and normal-job demands pushed it to the back burner.
>> A little time with google and I see the "-enable_peer_stats" and
>> "-enable_process_stats" options when starting the client daemon, and
>> this very well may furnish the information that I need.
> You don't need to start the client with those options; see the 'fs
> rxstatpeer' and 'fs rxstatproc' commands to turn the stats on and off.
>
> However, your bigger problem is retrieving the statistics. I don't think
> we offer much in the tree that's very useful; you can try
> src/libadmin/samples/rxstat_get_peer and rxstat_get_process, but I don't
> expect them to be very robust. Of course, I'm not sure if there are
> other tools to retrieve the data floating around somewhere (or in
> IBM...).
Perhaps there are tools, but I don't have them.  In fact, the "standard" 
deployments don't even have some of the standard OpenAFS tools like 
afsmonitor and it's underlying programs.  My system is multiboot, with 
one of my options being Gentoo, and it's OpenAFS install is more 
complete.  I have played with afsmonitor a little, rapidly getting 
swamped in information.  At the time I was hoping to tune my cache 
parameters, and again normal-job demands pushed that to the back-burner, 
too.
>> A subsequent search gets me to the "rxdebug" document, though that
>> document appears to be server-centric as opposed to querying the
>> client.  Nor does it tell me what information I can collect or if
>> access time is part of that information - only mentioning serveral
>> parameters that it does collect.
> rxdebug is useful for clients and servers. The 'rxdebug -rxstats'
> statistics and other information are useful for debugging performance
> problems, but won't tell you much about time taken to process RPCs. It's
> more useful for just indicating if there's a problem with packets
> getting lost or if there's some other problems interfering with packets
> and such.
>
> If you just want the RTT to the various fileservers, 'rxdebug -peers'
> can tell you that. The RTT calculated by Rx isn't always accurate
> (depending on the version in use and other factors), but it will tell
> you what Rx thinks the RTT is.
>
> Oh, and also, 'rxdebug' can be used as a simple test of fileserver
> overloaded-ness. If you just run 'rxdebug<fileserver>', you'll see a
> couple of lines that say
>
> X calls waiting for a thread
>
> and
>
> Y calls have waited for a thread
>
> Which is how many calls are currently not being serviced due to a lack
> of available threads, and a running count of how many calls have waited,
> respectively. You normally want them to be 0; the higher they are, the
> slower the fileserver is going to be.
I'll have to give this a try.  I know that "thread waiting" is one of 
the things that they have looked at and occasionally found, but is not 
all of the problem that we see.
>> Can someone toss me a bone here - or a link?
> If you want something quick, you can look at the output of
>
> $ xstat_cm_test<client>  -collID 2 -onceonly
>
> Which will give you a bunch of statistics for the client. Many of the
> fields are briefly described here:
> <http://docs.openafs.org/AdminGuide/apc.html#HDRWQ618>.
>
> For RPC timings, for reading data you probably want to be looking at
> FetchStatus, FetchData, and InlineBulkStatus.
I'm currently running Fedora Core 13 on a multiboot machine, and:
[user@hostname~]$ xstat_cm_test hostname -collID 2 -onceonly

Starting up the xstat_cm service, no debugging, one-shot operation

-----------------------------------------------------------
** Data size mismatch in performance collection!** Expecting 1064, got 759
** Version mismatch with Cache Manager
[user@hostname~]$

I'll have to reboot with Gentoo and give this another try.

Dale
-- 

Dale Pontius
Senior Engineer
IBM Corporation
Phone: (802) 769-6850
Tie-Line: 446-6850
email: pontius@us.ibm.com

This e-mail and its attachments, if any, may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply e-mail and delete all copies of this message from your system without copying it and notify sender of the misdirection by reply e-mail.