[OpenAFS] openafs on Fedora 12?

Rainer Toebbicke rtb@pclella.cern.ch
Fri, 11 Dec 2009 10:00:10 +0100

Chas Williams (CONTRACTOR) wrote:
> In message <4B20B344.5010101@pclella.cern.ch>,Rainer Toebbicke writes:
>> Chas Williams (CONTRACTOR) wrote:
>>> i still wonder if the cache manager shouldnt open a single file (sparse
>>> mode) and just seek/read/write.  this would solve a couple of potential
>>> problems with other filesystems as well.
>> There are some issues with the canonical approach of just using one file and 
>> seek to chunkno*chunksize:
>> 1. directories are read in total, regardless of chunk boundaries;
> ah.  i did indeed forget this point.  this is particular annoying with
> regard to memcache (it causes a realloc of the chunk if the chunk is
> undersized).  for now, we could ensure that chunk sizes are 'sufficiently'
> large.

With the current "dir" package this means a chunk size of 2MB. Assuming the 
unit of transfer is still "chunksize" and you do not intentionally fill chunks 
partially you'd give up a valuable tuning parameter.

>> 2. it is, to my knowledge and on a POSIX level, not possible to "free" parts 
>> of a file. Hence, if the number of chunks in the cache exceeds the size of 
>> /usr/vice/cache you run out of space;
> i dont ever wish to free parts of a file.  i just wanted to create the
> file quickly to avoid making the user wait while a 1GB is written.
> oversubscribing /usr/vice/cache is somewhat like asking the doctor why
> it hurts when you hit yourself with a hammer.

We typically create a 10GiB AFS cache with ~100000 cache files, but a 
chunksize of 256 kB. What's wrong with that? The cache occupancy is measured 
in kiB anyway and the cache manager figures out whom to recycle. As bigger 
chunks have an increased probability of being only partially filled (because, 
after all, we also have "small" files), this all works out without the user 
seeing any adverse effect. With your 2 MB chunk size suggest above such a 
cache would have to be... 200 GB.

BTW: on decent machines an individual 1 GiB write does not make the user wait: 
on write the data is first copied in the the AFS file's mapping, later into 
the cache file's mapping (the former step can be avoided by writing into the 
chunk files directly).  On reads the reader is woken up on every RX packet, 
ensuring streaming to the user. Here again, the double copy can be avoided.

>> 3. unless done carefully, if you re-write parts of a file the system may end 
>> up reading it in first (partial blocks).
>> With individual cache files and well-placed truncate() calls these issues go 
>> away.
> i am not convinced that the well placed truncate calls have any meaning.
> the filesystems in question tend to just do what they want.

They do! They free the blocks used up by the cache file, just in case the 
chunk you're writing is smaller. They also make sure that while re-writing 
non-block/page-aligned parts data do not have to be read in just to be thrown 
away on the next write.

So if you want to put the cache into one big file you'll at least have to 
think about space allocation and fragmentation. You'd also better ensure 
page-aligned writes.

Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155