[OpenAFS] Windows cache problem revisited...

Stephen Joyce stephen@physics.unc.edu
Sat, 13 Mar 2004 14:56:22 -0500 (EST)


Hi all,

Back in December '03 Rodney Dyer at UNC-C reported a problem the OpenAFS
Windows client.  I wanted to followup and confirm this rather annoying
problem.  For those that might have forgotten, the problem is that the afs
cache operates normally while the cache is filling however once the cache
is full, additional file operations leave open windows file handles.

This problem can be reproduced consistently (details below).  Once this
handle problem has occurred on a client, afs performance drops considerably
as handle-count increases.  File open and copy operations slow to 10-20x
normal.  Leaving the client idle does not mitigate the problem--the handles
never close--so that once the problem occurs, the client will never recover
until the afs service is restarted or the client rebooted.

...to reproduce this problem, simply install OpenAFS for Windows with the
default 20MB cache.  Open the Windows task manager, view the process list,
and add a "Handles" column (View->Select Columns->Handle Count).  The
handle count for afsd_service.exe should be reasonable (way under 1000)...
Now, copy files from AFS to the local hard drive until the AFS cache is
full (or close) as revealed by "fs getcacheparms".  Along the way, while
the cache is filling, the number of handles in use by afsd_service will
increase a bit, but not abnormally.

Once the cache is full, however, the handle count for afsd_service will
increase for each file copied from AFS to the local disk... note that
copying 1 2000MB file will still give reasonable performance (well,
reasonable for AFS) and the handle-count increases by 1-3.  Copying
2000 1MB files (or 2000 1K files!) will increase the handle-count
substancially.  It's the actual file open op that increases the
handle-count, but the handle-count does not decrease when the file is
closed.  (I use the source trees of gcc and other large software packages
to test this, but any directory trees with thousands of individual files
should suffice.)

I have browsed the OpenAFS Windows code and identified a couple of places
in the code which look like they could contribute to this problem, but
before I spend more hours diagnosing this (and rebuilding the Windows
client--I'm not a Windows programmer!), I wanted to ask to make sure that
no one else had already identified and fixed this problem.

If anyone believes they have fixed this problem, or has additional insight
which would make fixing the problem easier, please either contact me, or
send it to the list (depending on whether it's of general interest or not).
If anyone is an expert with the OpenAFS Windows client code and is willing
to help out, a response would be quite appreciated.


*** extra info:
I've replicated this problem on OpenAFS 1.2.6-1.2.10 and also on Transarc
AFS 3.6-something (I can look this up if important, but suffice it to say
it's reproducible on Transarc AFS).

Adjusting the size and params of the AFS cache on Windows can affect
performance and delay the onset of this issue (by delaying the cache being
filled) with the tradeoff of making performance suck more right from the
get-go.

Once the afsd_service starts spiraling out of control (with thousands of
file handles and many MB of mem usage) it never decreases even when left
alone... I've left clients alone over a weekend; the handle count is the
same on Monday that it was on Friday...

I have NOT tested the OpenAFS 1.3.x client.

...I know I could "just use linux/macos/other Windows-alternative",
but for now, at least, Windows is the only viable option for some of my
users, so please no flame-bait responses.

Thanks for reading.

Cheers,
Stephen
--
Stephen Joyce
Systems Administrator                                            P A N I C
Physics & Astronomy Department                         Physics & Astronomy
University of North Carolina at Chapel Hill         Network Infrastructure
voice: (919) 962-7214                                        and Computing
fax: (919) 962-0480

When solving a system "panic", you must first ask yourself what you
were doing that could possibly frighten an operating system.