[OpenAFS-devel] Patch to implement default tuning proposal
discussed a while ago
Jeffrey Hutzelman
jhutz@cmu.edu
Wed, 17 Aug 2005 11:37:07 -0400
On Tuesday, August 16, 2005 01:20:42 PM -0400 chas williams - CONTRACTOR
<chas@cmf.nrl.navy.mil> wrote:
> In message <20050816154428.C04501BAF3@citi.umich.edu>,Jim Rees writes:
>> I'm inclined to commit the new code and let Niklas or others work on
>> making it better for large caches later. Comments?
>
> how about choosing the sqrt(cachesize/1024) as the "average file size"?
>
> cachesize avg file size(k) #files
>
> 150M ~12 12500
> 350M ~18 19444
> 1G ~32 32768
> 10G ~98 102040
> 20G ~143 146653
>
> i choose sqrt() for no particular reason other than the numbers seems to
> more closly match the original sizes for smaller caches and for larger
> caches it matches the newly suggested average of 32k. for huge caches
> it "safely" limits the amount of kernel space required.
Careful here...
The value we really _want_ to use here is the average storage usage of a
chunk. If we estimate this value too high, then we won't have enough files
to use all of the data in the cache.
In the previous thread, I made some gross assumptions which I expected
would work out pretty well. Basically, there are three kinds of chunks:
(1) Chunks holding an entire file that is smaller than the chunk size
(2) Chunks holding a non-final part of a file larger than the chunk size
(3) Chunks holding the final part of a file larger than the chunk size
The gross assumption that I made was that instead of trying to predict the
average chunk size, we could make a guess based on the average file size.
I'm not claiming this is a perfect prediction of cache usage, just that
it's probably a pretty good first-order approximation. Assuming a smaller
file size means we create more files; I felt that it was entirely
appropriate to use measurements of average file sizes around 32K as
justification to increase the assumed average chunk size to that from 10K
or whatever it was before.
It is important to note that in order to get maximum performance and
resource utilization out of the cache manager, it must be tuned to your
particular usage patterns. In this way it is similar to all manner of
database and file storage systems, not to mention the operating system
itself. The goal here is to provide defaults that will result in
reasonable performance for the majority of users who lack the skills or
experience to do their own tuning. I'd much rather optimize for those
cases and make people like Chas do their own tuning, if the defaults aren't
good enough.
I think we should commit with the current logic, and if we discover that
the common case requires a different assumption about average chunk size,
we can easily change the parameters later.
I also think we should consider developing some better tools to analyze
cache usage. This should include the ability to estimate the working set
size in terms of storage, chunks, vnodes, and volumes, and to recommend
increases or decreases in parameters based on actual usage.
-- Jeff