[OpenAFS] Windows cache rehashed...

Rodney M Dyer rmdyer@uncc.edu
Fri, 19 Dec 2003 15:11:31 -0500


Jeffrey,

I suppose I must apologize for sending any incorrect data in my original 
post on 12/1.  At that point in my problem diagnosis I thought it was my 
physical RAM that was being used up causing lots of paging to occur.  I was 
using a machine with only 256 Meg physical.   Based on your statements that 
followed about the AFS cache implementation, I changed to using a smaller 
cache.  You would think this would have made the original problem of slow 
starting apps totally disappear, but it did not.  The machine certainly is 
more responsive, due to less, or no paging going on, but the AFS cache 
still seems to degrade application startup times.  This was proven once we 
started testing on a machine with a Gig of RAM where the Windows cache 
didn't even enter into the equation.  So I'm now working on another 
problem, which definitely seems to be a bug in the AFS cache manager.

Ok, I've tried to simplify this as much as possible.  My previous email 
documents the exact method to produce the bug and only takes about 5 
minutes to reproduce.  You should see the same symptoms at your site.  You 
should be able to watch the handle count rise well above 256 handles.  You 
should be able to obtain the same results as I, more easily than I can 
gather it all into a detailed report for you (see data at bottom).  If you 
are not seeing the same results, just let me know, I would be curious why.

Here is the method again, using correct units, with added text for clarity...

To reproduce the problem, use the following settings...

      Windows XP SP1, 1 Gig RAM, P4 3.0 Gig, 100 MBit connectivity
      OpenAFS 1.2.10
      Cache size:  8192K  ( 8 Meg cache )

      Note:  For those who need exacting definitions, this is an 8 Meg 
cache, not 32 Meg, not 256 Meg, not 8 Gig...just a simple 8 Meg 
cache.  Units are checked.  Based on your information, the current Windows 
AFS cache implementation should handle this cache size easily without problems.

      Chunk size:  32K
      Status Entries:  1000
      Background Threads: 6
      Service Threads: 8

1.  Make a temporary local directory to copy some files to...

      c:\>mkdir "c:\temp\test"

2.  Change into the temporary folder...

      c:\>cd "c:\temp\test"

3.  Make sure you start with a fresh cache...

      c:\>net stop "IBM AFS Client"

      c:\>del "c:\afscache"

           Note:  It may take some time here before the AFS service let's 
go of the cache, keep trying the delete until the file is gone.  (I'm not 
sure why it takes so long sometimes for AFS to shutdown.  Its probably the 
same problem that manifests the handle leak.)

      c:\>net start "IBM AFS Client"


4.  Now bring up the task manager and select the columns for 
"afsd_service.exe" handles, etc., using the view->select columns menu.

5.  Now, in the default temporary directory at the command prompt, start a 
recursive copy of a large tree of files out of your cells AFS space.  It 
doesn't matter what files...any files will do.

      c:\temp\test>xcopy "\\%computername%-afs\all\your-cell\dir1..."  /s 
/e /f /c

      The "/s /e /f /c" means...all subdirectories, even empty ones, show 
the files as they are being copied, and continue on errors.

      Again, any files will do.  You may need to copy a large number of 
files and/or some big files.  At our site I just started the copy on a very 
large tree and let it go.  For example, the following should work fine...

      c:\temp\test>xcopy 
"\\%computername%-afs\all\your-cell-name-here\*.*"  /s /e /f /c

      (Make sure you don't have any symbolic links in AFS that might create 
a recursive loop in whatever tree of files you are copying.  The xopy.exe 
program will follow them if you do.)

      As the copy is progressing, as the handles start rising, keep 
watching.  After the count of handles rises into the thousands, I just 
pressed CTRL+C, or CTRL+Break.  Depending on your AFS permissions, you may 
need a token to do the copy.  Make sure the size of the files being copied 
are plenty larger than the cache size of 8192K.

Now, if you watch the Task Manager's "afsd_service.exe" handle count it 
will start out ok, but soon rise out of control.  Stopping the copy has no 
effect of reducing the handles.

Using the above method I was able to easily obtain the following numbers...

      Using the above config of 8 Meg cache with 32K chunks.
      After about 987 Meg copied from AFS to the local "c:\temp\test" folder.
      http://www.coe.uncc.edu/~rmdyer/test_8MB_afscache.jpg

      Here's another, same senario, just using the AFS client defaults for 
cache and chunk...
      http://www.coe.uncc.edu/~rmdyer/test_32MB_afscache.jpg

Is this enough information?  When you say..."Please add this data to the 
Request (#2628)".  How do I do this?

Happy Holidays!  Sorry to be such a problem (an ass).

Rodney