[OpenAFS-devel] ZFS cache usage tracking

Andrew Deason adeason@sinenomine.net
Fri, 4 Sep 2009 11:24:52 -0500


(Bcc bugs)

Recently, we've seen that the unix CM's cache tracking figures for a ZFS
cache can be very wrong. I know the tracked cache usage value (the 'fs
getcacheparms' value) was never 100% accurate, but with certain ZFS
configurations, it can be wrong on the order of multiples of the
cacheinfo size itself. (note that this is a different problem than the
one fixed in gerrit 338)

The discrepancy we're hitting can be easily seen by plainly doing this
on a ZFS filesystem with default settings:

dd if=/dev/urandom of=somefile bs=1024 count=1024
sleep 10
dd if=/dev/urandom of=somefile bs=1 count=1
sleep 10
stat somefile | grep Size
  Size: 1               Blocks: 261        IO Block: 131072 regular file

So, a file that is 1M then truncated down to 1 byte still takes up
130k-ish disk space.

Now, when the CM truncates a 1M cachefile to something under 130k, the
CM will record that file as taking up the file length rounded up to the
next kb. Which is, well, a lot smaller than what it actually takes up.

In the absolute worst case, I think we could take up 5 times the
cacheinfo size on disk (128k for each cache file, cachesize/32k cache
files by default). While that's unlikely to hit, we have already seen it
go over by a gig or two on a cache smaller than 4G.

Now, this is with a recordsize of 128k (the default, I believe).
Changing the recordsize to something smaller obviously makes smaller
files take up less space. With recordsize=1k, a 1 byte (or 1k) file
appears to take up only 5k. But this has the downside of larger files
causing more overhead (a 1M file takes up about 1122k).

I'm not sure what to do about this. Does anyone reading this know enough
about ZFS internals to shed some light on this? I've got a few potential
directions to go in, though:

(A) If someone can provide an equation that says a "if file is X bytes
long, and we have a recordsize of Y, then the file will take up at most
Z bytes on disk", we could make a special case in the cache-tracking
logic for ZFS. The recordsize appears to be obtainable via the statvfs
blocksize.

(B) If someone knows of a way in-kernel to tell a file in ZFS to not
behave in this way, we could make a certain call on the vnode.

For example, just creating a 1-byte file does not take up 130k, it's
only when you make a large file and truncate it down; there may be a way
to make those two cases equivalent size-wise. The brute-force-y solution
could be to unlink files instead of truncating them in certain cases,
but that seems suboptimal.

(C) We simply use the blocksize instead of the fragsize to calculate
afs_fsfragsize in the special zfs case? This still seems like it would
result in incorrect cache tracking, but maybe it's good enough?

(D) force afs_fsfragsize to 128k-1 for ZFS, but that's obviously
horribly inefficient for most cases.

Or the obvious (E), tell people to not use ZFS disk caches.

-- 
Andrew Deason
adeason@sinenomine.net