[OpenAFS] Windows cache rehashed...
Rodney M Dyer
rmdyer@uncc.edu
Tue, 16 Dec 2003 02:02:02 -0500
Everyone,
I know we've discussed this already but I'd like to bring some new
information to the table that I believe shows a definite bug instead of
simply a "that's the way it works for now" issue.
(Jeffrey, thanks for adding an incident report for me. I'm just adding new
information here.)
At the end of our previous conversation in the thread...
"Windows cache problem revisited..."
https://lists.openafs.org/pipermail/openafs-info/2003-December/011370.html
...I had resigned myself to using a small AFS cache of 48 Meg instead of
256 Meg. This seemed to solve some of the issues that I was having. But
now, after further testing I find that I was wrong.
Here is our situation. In our IT group, we actually use AFS to run
applications from. We have some applications installed locally, but others
(many others) are installed and run from AFS. Some of these applications
are quite large and on execution require multiple megabytes of download
(from AFS) to run. We have one application in particular that is causing
us grief because of its size. The application is called "ProEngineer", or
ProE. Over the course of the last year we've had reports from the
professors teaching the classes that it can take as long as 10 to 15
minutes to startup. We found it odd, but assumed it was because of network
loading and the effect of everyone trying to run ProE at the same time.
To try and eliminate the problem we've thoroughly replicated the
application to multiple servers in the same building, and set our AFS
preferences so that loading would be minimized. This hasn't had any
effect. Well the end of the semester has arrived and we need to fix this
problem because the professor is now saying we should install ProE
locally. We don't like running them locally if they are runnable from the net.
Since Friday, three members of our IT staff (including me) have been
testing various senarios of starting times for the ProE application. We've
tested large AFS cache sizes and small.
Here is what we've found...
* We tested on new Dell OptiPlex GX 270 P4 3.0 GHz machines with 1 Gig RAM
and 100 MBit connections to our file servers.
* We set the AFS cache to 256 Meg and 48 Meg.
* With a fresh restart of AFS and an empty cache, fs getcacheparms returns...
AFS using 100 of the cache's available 256000 1K byte blocks.
* We started ProE. The load time on average was 30 seconds. This is
on-par with our Sun Solaris 9 Blade 150's load-time. The cache setting had
little to no effect (as expected on first load).
* The resultant fs getcacheparms after ProE is loaded is (for 256 Meg
cache)...
AFS using 57685 of the cache's available 256000 1K byte blocks.
* Starting ProE again resulted in a load time of 10-15
seconds...excellent, cache works.
* Even with a 48 Meg cache, the load time was a decent average of 25 to 30
seconds.
Now this is not what we observed when we walked into our labs and started
our testing. When we first sat down to the machines and ran ProE cold it
loaded in about 2 to 5 minutes. So we thought something must be happening
to the AFS client during the day that would cause it to go into "slow
mode". We always restart our AFS service at 4:00am and delete the cache
(via a task scheduled script), so we are assured of a fresh cache in the
morning. The problem is, we have various students logging on to the lab
machines during the day which are causing some anomaly.
So we immediately thought to check and see what would happen if we
overflowed the cache. What I mean here is simply to set the cache to some
size, then load many files from AFS, enough to cause the cache to be fully
utilized, more than the cache size value. When we did this our load time
for ProE suddenly when down the drain. Instead of loading quickly, or even
average of 30 seconds, it was starting to take upwards of a minute. This
was even after we had stopped ProE and restarted it again. It is almost as
if there is a "leak" somewhere in the service that is causing the service
to slow to a crawl, using up all the CPU.
At this point I don't believe it has anything to do with the number of
handles or the amount of RAM in the machine. The problem appears to be
totally within the AFS service itself.
As I had previously stated in a recent thread, I also see the problem when
copying very large single files, files greater than 256 Meg at a time to
and from AFS.
I'm not sure if this is a cache problem, or a problem somewhere else in the
AFS code, but it sure seems to be losing track of some important buffered
information somewhere.
If anyone needs any more data I'll be happy to provide.
Help is appreciated,
Thanks again,
Rodney
Rodney M. Dyer
Windows Systems Programmer
Mosaic Computing Group
William States Lee College of Engineering
University of North Carolina at Charlotte
Email: rmdyer@uncc.edu
Web: http://www.coe.uncc.edu/~rmdyer
Phone (704)687-3518
Help Desk Line (704)687-3150
FAX (704)687-2352
Office 267 Smith Building