[OpenAFS] sanity check please.

Jeffrey Hutzelman jhutz@cmu.edu
Mon, 05 Sep 2005 15:03:12 -0400


On Monday, September 05, 2005 08:24:37 -0700 Pucky Loucks=20
<ploucks@h2st.com> wrote:

> Thanks so much for you response Lars, I've commented below.
>
> On 4-Sep-05, at 4:01 AM, Lars Schimmer wrote:
>>
>> | 1) Is this going to become a huge management issue?
>>
>> Depends.
>> If you use some nice scripts, it could be managed easy.
>> At all its just kinda "vos create volume" "fs mkmount path volume" "fs
>> setacl path rights" and "backup volume". And after all, the big
>> overview
>> ;-) As long as you manage your cell well enough, it=B4s easy. Please
>> don=B4t
>> create one dir with all volumes in it.
>>
> am I correct in that a volume isn't replicated until a "vol release"  "fs
> checkvolumes"? i.e. as I write new images I'll need to decide  when I
> should run those commands?  I might just end up replicating it  at the
> end of the day. for 1/2 day.

RO replicas are updated only when someone (or something) runs 'vos release' =

on that volume.  Some sites have scripts that run every night to release=20
those volumes that have changed (where the RW is newer than the RO); others =

release volumes only manually, and take advantage of this to release only=20
coherent sets of changes.

'fs checkv' is the command you run on your client to force it to discover=20
VLDB changes before the 2-hour cache times out.  Generally you'll only need =

to run this when you have a volume that's just been released for the first=20
time, on clients that accessed the RW volume before it was replicated.=20
Even then, you only need it if you care about seeing the change _right=20
now_, instead of sometime in the next 2 hours.


>> You can create one volume for each day, for each 5000 images. But
>> think
>> about: 5000 images in one directory is a real mess to search through.
>> And if that are "big" images, the size of this daily volume grows
>> fast,
>> so a replicate volume takes far more time. Replication is near line
>> speed (as long as there are no large amounts of small files >1kb; but
>> you talk about images, so that files should be larger), but e.g.
>> 100 gig
>> in a volume takes its time to replicate at 100Mbit.
>> I suggest 50 volumes a 100 images/day, e.g. numbered "day.001.01" or
>> else, as you can find the volume easy and you can easy replicate them
>> with a script. And if you distribute these volumes over 5-10 file
>> servers, the replication process span over the network and is faster
>> ended at all. Speed is a question of design.
>>
> this sounds like a good idea.  all of the logic of what file is  located
> where is in my applications database so I won't really need  to search
> from the command line for a file.

Generally, you should create volumes based on purpose and function.  You=20
can create new volumes whenever you want, move volumes around at will=20
(transparently), and adjust quotas whenever you want.  So, you need to=20
think about logical organization of data, rather then phsyical=20
organization.  When deciding how to organize things, consider whether you=20
will be able to tell what is in a volume just from the volume name (not the =

place it is mounted).  If so, there's a good chance you're on the right=20
track.

Depending on your application, breaking up large amounts of data into=20
volumes by time might not be a bad idea.  If you have a large number of=20
volumes which, once populated, rarely change, then you have fewer things to =

release and fewer things to back up on any given day (if the data hasn't=20
changed in a year, then a backup you did 6 months ago is as good as one you =

did yesterday).


>> | 3) what's the recommend max size for a volume?
>>
>> I once worked with 250 GB volumes. But to replicate these big
>> volumes suxx.
>>
> good to know that you used such a large volume, even if it was really
> slow for replication

Large volumes are certainly possible, and if you have large amounts of=20
data, may even be appropriate.  The thing to avoid is treating volumes as=20
if they were partitons on your disk.  Don't create a few giant volumes=20
containing lots of unrelated data.  With disk partitioning, you have to do=20
that, because you get a limited number of partitions and because=20
filesystems are hard to resize.  Volume quotas can be changed trivially,=20
and volumes can be moved around transparently if you discover that the=20
partition currently holding a volume doesn't have enough space.  And, to=20
exhaust the number of volumes the VLDB can handle, you'd have to create a=20
volume per second for something like 10 years.


Also, the amount of time replication takes is _not_ based on the amount of=20
data in the volume being replicated.  After the first release, AFS does=20
volume replication using incremental dumps, so the time required depends on =

the amount of _new_ data and on the number of files in the volume -- a=20
certain amount of work must be done for each file, but the only files whose =

contents are transferred are those which have changed.

>> And there is limit in files in a volume: max. 64k files with <16
>> letters
>> allowed in one volume.
>> ~From a mailinglist-entry:
>> The directory structure contains 64K slots.
>> filenames under 16 chars occupy 1 slot.
>> filenames between 16 and 32 chars occupy 2 slots
>> filenames between 33 and 48 chars occupy 3 slots, and on
>>
> this part confuses me, do you have the link to the original topic?

The issue Lars is describing here has to do with the total number of files=20
you can have _in one directory_.  If the filenames are all less than 16=20
characters, that's a little under 64K files per directory.  AFS directories =

are hash tables, so lookups are fast even in large directories.

The maximum number of files in a _volume_ is much, much larger.  A=20
first-order theoretical maximum would be 2^30, or about 1 billion files=20
(vnode numbers are 32-bit integers; if you assume that something treats=20
them as signed, then the largest valid number would be 2^31-1, but only=20
even vnode numbers are used to refer to plain files).  The actual number is =

likely to be a few (binary) orders of magnitude smaller, as I don't think=20
there's been much testing of the case where the vnode index file is larger=20
than 2GB.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA