[OpenAFS] Windows cache rehashed...

Tue, 16 Dec 2003 02:02:02 -0500

Everyone,

I know we've discussed this already but I'd like to bring some new 
information to the table that I believe shows a definite bug instead of 
simply a "that's the way it works for now" issue.

(Jeffrey, thanks for adding an incident report for me.  I'm just adding new 
information here.)

At the end of our previous conversation in the thread...

"Windows cache problem revisited..."
https://lists.openafs.org/pipermail/openafs-info/2003-December/011370.html

...I had resigned myself to using a small AFS cache of 48 Meg instead of 
256 Meg.  This seemed to solve some of the issues that I was having.  But 
now, after further testing I find that I was wrong.

Here is our situation.  In our IT group, we actually use AFS to run 
applications from.  We have some applications installed locally, but others 
(many others) are installed and run from AFS.  Some of these applications 
are quite large and on execution require multiple megabytes of download 
(from AFS) to run.  We have one application in particular that is causing 
us grief because of its size.  The application is called "ProEngineer", or 
ProE.  Over the course of the last year we've had reports from the 
professors teaching the classes that it can take as long as 10 to 15 
minutes to startup.  We found it odd, but assumed it was because of network 
loading and the effect of everyone trying to run ProE at the same time.

To try and eliminate the problem we've thoroughly replicated the 
application to multiple servers in the same building, and set our AFS 
preferences so that loading would be minimized.  This hasn't had any 
effect.  Well the end of the semester has arrived and we need to fix this 
problem because the professor is now saying we should install ProE 
locally.  We don't like running them locally if they are runnable from the net.

Since Friday, three members of our IT staff (including me) have been 
testing various senarios of starting times for the ProE application.  We've 
tested large AFS cache sizes and small.

Here is what we've found...

*  We tested on new Dell OptiPlex GX 270 P4 3.0 GHz machines with 1 Gig RAM 
and 100 MBit connections to our file servers.

*  We set the AFS cache to 256 Meg and 48 Meg.

*  With a fresh restart of AFS and an empty cache, fs getcacheparms returns...

      AFS using 100 of the cache's available 256000 1K byte blocks.

*  We started ProE.  The load time on average was 30 seconds.  This is 
on-par with our Sun Solaris 9 Blade 150's load-time.  The cache setting had 
little to no effect (as expected on first load).

*  The resultant fs getcacheparms after ProE is loaded is (for 256 Meg 
cache)...

      AFS using 57685 of the cache's available 256000 1K byte blocks.

*  Starting ProE again resulted in a load time of 10-15 
seconds...excellent, cache works.

*  Even with a 48 Meg cache, the load time was a decent average of 25 to 30 
seconds.

Now this is not what we observed when we walked into our labs and started 
our testing.  When we first sat down to the machines and ran ProE cold it 
loaded in about 2 to 5 minutes.  So we thought something must be happening 
to the AFS client during the day that would cause it to go into "slow 
mode".  We always restart our AFS service at 4:00am and delete the cache 
(via a task scheduled script), so we are assured of a fresh cache in the 
morning.  The problem is, we have various students logging on to the lab 
machines during the day which are causing some anomaly.

So we immediately thought to check and see what would happen if we 
overflowed the cache.  What I mean here is simply to set the cache to some 
size, then load many files from AFS, enough to cause the cache to be fully 
utilized, more than the cache size value.  When we did this our load time 
for ProE suddenly when down the drain.  Instead of loading quickly, or even 
average of 30 seconds, it was starting to take upwards of a minute.  This 
was even after we had stopped ProE and restarted it again.  It is almost as 
if there is a "leak" somewhere in the service that is causing the service 
to slow to a crawl, using up all the CPU.

At this point I don't believe it has anything to do with the number of 
handles or the amount of RAM in the machine.  The problem appears to be 
totally within the AFS service itself.

As I had previously stated in a recent thread, I also see the problem when 
copying very large single files, files greater than 256 Meg at a time to 
and from AFS.

I'm not sure if this is a cache problem, or a problem somewhere else in the 
AFS code, but it sure seems to be losing track of some important buffered 
information somewhere.

If anyone needs any more data I'll be happy to provide.

Help is appreciated,

Thanks again,

Rodney

Rodney M. Dyer
Windows Systems Programmer
Mosaic Computing Group
William States Lee College of Engineering
University of North Carolina at Charlotte
Email: rmdyer@uncc.edu
Web: http://www.coe.uncc.edu/~rmdyer
Phone (704)687-3518
Help Desk Line (704)687-3150
FAX (704)687-2352
Office  267 Smith Building