[OpenAFS] Openafs 1.4.2 on Debian Etch kernel 2.6.18 slow

Sun, 11 Feb 2007 19:07:27 -0500

I've got a bunch of questions.  Even if you only have time to answer a
few of them, it will help us to narrow down the root cause.

On 2/10/07, Derek Harkness <dharknes@umd.umich.edu> wrote:
> I'm attempting to deploy/update a new AFS fileserver.  The new server is the
> first to upgraded from Debian sarge, OpenAFS 1.3.xx, kernel 2.4 to Etch,
> 2.6.18, AFS 1.4.2, reiserfs and a new 7 terabyte XRaid.
>
> The upgrade went fine except I file writes to the new system are so slow the
> system is unusable.  On the server iostat shows a transfer rate of ~40KB/s
> and an iowait of 20 during AFS operations.  If I stop the fileserver and

First and foremost, do local volume package operations (e.g. the
salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or
is it only stuff that involves Rx?  What about vos dump foo localhost
on the ailing fileserver?  The fact that iowait is going through the
roof may be indicative of an io subsystem problem, so eliminating
network/Rx problems at the top of the decision tree will be useful.

I'm not familiar with the Linux iostat utility, but if it supports
per-disk stats similar similar to the -x option on Solaris, or the -D
option on AIX, then please post some data while the problem is
occurring.

> perform io directly on the XRaid I can read and write between
> 100MB/s-500MB/s.
>

A single fibre channel port (excepting 10Gb E-ports) can't transmit
500MB/s.  From what I've heard, apple's fc raid products only provide
a single 2Gb sfp per controller, and don't support fc multipathing.
So, you're limited to a max theoretical of ~203MB/s (less in AL mode).
 Thus, I'm guessing your tests are, at least in some cases, only
stressing the page cache, rather than anything across the fabric (for
that matter, is there a fabric?).  In order to declare the storage
subsystem OK, we need to be sure you've tested every layer of the
storage stack.

Please tell us specifically what you did to verify "direct" io.  For example:

* Were you running some well-known benchmark suite?  If so, what
options did you pass?
* Did it involve one file or many?
* Were any fsync()s issued?
* Did it modify any filesystem metadata, or only file data?
* Was it single threaded or multi-threaded?
* How much data was read/written?
* How big were the files involved?
* Did you do anything to mitigate/bypass caching?

Other questions that might be useful:

* How deep are the tagged command queues for the xserve lun(s)?
* Do all the disks pass surface scans?
* Are the disks and/or controllers reporting SMART events?
* If this stuff is fabric attached, have you looked at port error
counts, port performance data, etc?

> Does anyone have any suggestions on how I might trouble shoot this problem?
> So far I've checked network performance, io performance directly to the
> XRaid, and the reisfer filesystem.  It all seems to be pointing me back to

How have you verified that "network performance" is ok?  What are the
ethernet port error counts like?  What are the packet retransmit rates
like?

I don't know much of anything about apple's storage line, but if they
have any sort of performance analysis and/or problem determination
tools, what do they say?

> the some problem is the AFS fileserver.
>
> Hardware:
> HP DL380
> 2x2.8ghz Hyperthreaded Xeon CPU
> 4 Gigs of RAM
> Gigabit ethernet
> MPTFusion fiber channel card
> Apple XRaid
>
> I've got 2 other identical box currently run AFS and working fine.  The only
> difference is the other boxes are running an old OS.
>

Are the machines running the older kernel still running 1.3.x?

Until we can better understand your testing methodology, I'd have to
say this could be a hardware problem, a kernel driver problem, an AFS
problem, or even a network problem.  We need more information to
narrow it down.

Regards,

-- 
Tom Keiser
tkeiser@gmail.com