[OpenAFS] sanity check please.

Pucky Loucks ploucks@h2st.com
Mon, 5 Sep 2005 18:36:12 -0700


--Apple-Mail-6--512999262
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=ISO-8859-1;
	delsp=yes;
	format=flowed

yes I've seen that, but I've come to love AFS.  seems to me it scales =20=

better.

On 5-Sep-05, at 5:03 PM, y f wrote:

> Maybe I'm off the topic, have you given a eye on MogileFS (http://=20
> www.danga.com/mogilefs/), which is for its web images services.
>
> On 9/6/05, Jeffrey Hutzelman < jhutz@cmu.edu> wrote:
>
>
> On Monday, September 05, 2005 08:24:37 -0700 Pucky Loucks
> <ploucks@h2st.com> wrote:
>
> > Thanks so much for you response Lars, I've commented below.
> >
> > On 4-Sep-05, at 4:01 AM, Lars Schimmer wrote:
> >>
> >> | 1) Is this going to become a huge management issue?
> >>
> >> Depends.
> >> If you use some nice scripts, it could be managed easy.
> >> At all its just kinda "vos create volume" "fs mkmount path =20
> volume" "fs
> >> setacl path rights" and "backup volume". And after all, the big
> >> overview
> >> ;-) As long as you manage your cell well enough, it=B4s easy. =
Please
> >> don=B4t
> >> create one dir with all volumes in it.
> >>
> > am I correct in that a volume isn't replicated until a "vol =20
> release"  "fs
> > checkvolumes"? i.e. as I write new images I'll need to decide  =20
> when I
> > should run those commands?  I might just end up replicating it  =20
> at the
> > end of the day. for 1/2 day.
>
> RO replicas are updated only when someone (or something) runs 'vos =20
> release'
> on that volume.  Some sites have scripts that run every night to =20
> release
> those volumes that have changed (where the RW is newer than the =20
> RO); others
> release volumes only manually, and take advantage of this to =20
> release only
> coherent sets of changes.
>
> 'fs checkv' is the command you run on your client to force it to =20
> discover
> VLDB changes before the 2-hour cache times out.  Generally you'll =20
> only need
> to run this when you have a volume that's just been released for =20
> the first
> time, on clients that accessed the RW volume before it was replicated.
> Even then, you only need it if you care about seeing the change _right
> now_, instead of sometime in the next 2 hours.
>
>
> >> You can create one volume for each day, for each 5000 images. But
> >> think
> >> about: 5000 images in one directory is a real mess to search =20
> through.
> >> And if that are "big" images, the size of this daily volume grows
> >> fast,
> >> so a replicate volume takes far more time. Replication is near line
> >> speed (as long as there are no large amounts of small files =20
> >1kb; but
> >> you talk about images, so that files should be larger), but e.g.
> >> 100 gig
> >> in a volume takes its time to replicate at 100Mbit.
> >> I suggest 50 volumes a 100 images/day, e.g. numbered " day.=20
> 001.01" or
> >> else, as you can find the volume easy and you can easy replicate =20=

> them
> >> with a script. And if you distribute these volumes over 5-10 file
> >> servers, the replication process span over the network and is =20
> faster
> >> ended at all. Speed is a question of design.
> >>
> > this sounds like a good idea.  all of the logic of what file is  =20
> located
> > where is in my applications database so I won't really need  to =20
> search
> > from the command line for a file.
>
> Generally, you should create volumes based on purpose and =20
> function.  You
> can create new volumes whenever you want, move volumes around at will
> (transparently), and adjust quotas whenever you want.  So, you need to
> think about logical organization of data, rather then phsyical
> organization.  When deciding how to organize things, consider =20
> whether you
> will be able to tell what is in a volume just from the volume name =20
> (not the
> place it is mounted).  If so, there's a good chance you're on the =20
> right
> track.
>
> Depending on your application, breaking up large amounts of data into
> volumes by time might not be a bad idea.  If you have a large =20
> number of
> volumes which, once populated, rarely change, then you have fewer =20
> things to
> release and fewer things to back up on any given day (if the data =20
> hasn't
> changed in a year, then a backup you did 6 months ago is as good as =20=

> one you
> did yesterday).
>
>
> >> | 3) what's the recommend max size for a volume?
> >>
> >> I once worked with 250 GB volumes. But to replicate these big
> >> volumes suxx.
> >>
> > good to know that you used such a large volume, even if it was =20
> really
> > slow for replication
>
> Large volumes are certainly possible, and if you have large amounts of
> data, may even be appropriate.  The thing to avoid is treating =20
> volumes as
> if they were partitons on your disk.  Don't create a few giant volumes
> containing lots of unrelated data.  With disk partitioning, you =20
> have to do
> that, because you get a limited number of partitions and because
> filesystems are hard to resize.  Volume quotas can be changed =20
> trivially,
> and volumes can be moved around transparently if you discover that the
> partition currently holding a volume doesn't have enough space.  =20
> And, to
> exhaust the number of volumes the VLDB can handle, you'd have to =20
> create a
> volume per second for something like 10 years.
>
>
> Also, the amount of time replication takes is _not_ based on the =20
> amount of
> data in the volume being replicated.  After the first release, AFS =20
> does
> volume replication using incremental dumps, so the time required =20
> depends on
> the amount of _new_ data and on the number of files in the volume -- a
> certain amount of work must be done for each file, but the only =20
> files whose
> contents are transferred are those which have changed.
>
> >> And there is limit in files in a volume: max. 64k files with <16
> >> letters
> >> allowed in one volume.
> >> ~=46rom a mailinglist-entry:
> >> The directory structure contains 64K slots.
> >> filenames under 16 chars occupy 1 slot.
> >> filenames between 16 and 32 chars occupy 2 slots
> >> filenames between 33 and 48 chars occupy 3 slots, and on
> >>
> > this part confuses me, do you have the link to the original topic?
>
> The issue Lars is describing here has to do with the total number =20
> of files
> you can have _in one directory_.  If the filenames are all less =20
> than 16
> characters, that's a little under 64K files per directory.  AFS =20
> directories
> are hash tables, so lookups are fast even in large directories.
>
> The maximum number of files in a _volume_ is much, much larger.  A
> first-order theoretical maximum would be 2^30, or about 1 billion =20
> files
> (vnode numbers are 32-bit integers; if you assume that something =20
> treats
> them as signed, then the largest valid number would be 2^31-1, but =20
> only
> even vnode numbers are used to refer to plain files).  The actual =20
> number is
> likely to be a few (binary) orders of magnitude smaller, as I don't =20=

> think
> there's been much testing of the case where the vnode index file is =20=

> larger
> than 2GB.
>
> -- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
>    Sr. Research Systems Programmer
>    School of Computer Science - Research Computing Facility
>    Carnegie Mellon University - Pittsburgh, PA
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>


--Apple-Mail-6--512999262
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=ISO-8859-1

<HTML><BODY style=3D"word-wrap: break-word; -khtml-nbsp-mode: space; =
-khtml-line-break: after-white-space; ">yes I've seen that, but I've =
come to love AFS.=A0 seems to me it scales better.=A0=A0<DIV><BR =
class=3D"khtml-block-placeholder"><DIV><DIV><DIV>On 5-Sep-05, at 5:03 =
PM, y f wrote:</DIV><BR class=3D"Apple-interchange-newline"><BLOCKQUOTE =
type=3D"cite">Maybe I'm off the topic, have you given a eye on MogileFS =
(<A =
href=3D"http://www.danga.com/mogilefs/">http://www.danga.com/mogilefs/</A>=
), which is for its web images services.<BR><BR><DIV><SPAN =
class=3D"gmail_quote">On 9/6/05, <B class=3D"gmail_sendername">Jeffrey =
Hutzelman</B> &lt; <A href=3D"mailto:jhutz@cmu.edu">jhutz@cmu.edu</A>&gt; =
wrote:</SPAN><BLOCKQUOTE class=3D"gmail_quote" style=3D"border-left: 1px =
solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: =
1ex;"><BR><BR>On Monday, September 05, 2005 08:24:37 -0700 Pucky Loucks =
<BR>&lt;<A href=3D"mailto:ploucks@h2st.com">ploucks@h2st.com</A>&gt; =
wrote:<BR><BR>&gt; Thanks so much for you response Lars, I've commented =
below.<BR>&gt;<BR>&gt; On 4-Sep-05, at 4:01 AM, Lars Schimmer =
wrote:<BR>&gt;&gt;<BR> &gt;&gt; | 1) Is this going to become a huge =
management issue?<BR>&gt;&gt;<BR>&gt;&gt; Depends.<BR>&gt;&gt; If you =
use some nice scripts, it could be managed easy.<BR>&gt;&gt; At all its =
just kinda "vos create volume" "fs mkmount path volume" "fs <BR>&gt;&gt; =
setacl path rights" and "backup volume". And after all, the =
big<BR>&gt;&gt; overview<BR>&gt;&gt; ;-) As long as you manage your cell =
well enough, it=B4s easy. Please<BR>&gt;&gt; don=B4t<BR>&gt;&gt; create =
one dir with all volumes in it. <BR>&gt;&gt;<BR>&gt; am I correct in =
that a volume isn't replicated until a "vol release"=A0=A0"fs<BR>&gt; =
checkvolumes"? i.e. as I write new images I'll need to decide=A0=A0when =
I<BR>&gt; should run those commands?=A0=A0I might just end up =
replicating it=A0=A0at the <BR>&gt; end of the day. for 1/2 =
day.<BR><BR>RO replicas are updated only when someone (or something) =
runs 'vos release'<BR>on that volume.=A0=A0Some sites have scripts that =
run every night to release<BR>those volumes that have changed (where the =
RW is newer than the RO); others <BR>release volumes only manually, and =
take advantage of this to release only<BR>coherent sets of =
changes.<BR><BR>'fs checkv' is the command you run on your client to =
force it to discover<BR>VLDB changes before the 2-hour cache times =
out.=A0=A0Generally you'll only need <BR>to run this when you have a =
volume that's just been released for the first<BR>time, on clients that =
accessed the RW volume before it was replicated.<BR>Even then, you only =
need it if you care about seeing the change _right <BR>now_, instead of =
sometime in the next 2 hours.<BR><BR><BR>&gt;&gt; You can create one =
volume for each day, for each 5000 images. But<BR>&gt;&gt; =
think<BR>&gt;&gt; about: 5000 images in one directory is a real mess to =
search through. <BR>&gt;&gt; And if that are "big" images, the size of =
this daily volume grows<BR>&gt;&gt; fast,<BR>&gt;&gt; so a replicate =
volume takes far more time. Replication is near line<BR>&gt;&gt; speed =
(as long as there are no large amounts of small files &gt;1kb; but =
<BR>&gt;&gt; you talk about images, so that files should be larger), but =
e.g.<BR>&gt;&gt; 100 gig<BR>&gt;&gt; in a volume takes its time to =
replicate at 100Mbit.<BR>&gt;&gt; I suggest 50 volumes a 100 images/day, =
e.g. numbered " day.001.01" or<BR>&gt;&gt; else, as you can find the =
volume easy and you can easy replicate them<BR>&gt;&gt; with a script. =
And if you distribute these volumes over 5-10 file<BR>&gt;&gt; servers, =
the replication process span over the network and is faster <BR>&gt;&gt; =
ended at all. Speed is a question of design.<BR>&gt;&gt;<BR>&gt; this =
sounds like a good idea.=A0=A0all of the logic of what file =
is=A0=A0located<BR>&gt; where is in my applications database so I won't =
really need=A0=A0to search <BR>&gt; from the command line for a =
file.<BR><BR>Generally, you should create volumes based on purpose and =
function.=A0=A0You<BR>can create new volumes whenever you want, move =
volumes around at will<BR>(transparently), and adjust quotas whenever =
you want.=A0=A0So, you need to <BR>think about logical organization of =
data, rather then phsyical<BR>organization.=A0=A0When deciding how to =
organize things, consider whether you<BR>will be able to tell what is in =
a volume just from the volume name (not the <BR>place it is =
mounted).=A0=A0If so, there's a good chance you're on the =
right<BR>track.<BR><BR>Depending on your application, breaking up large =
amounts of data into<BR>volumes by time might not be a bad idea.=A0=A0If =
you have a large number of <BR>volumes which, once populated, rarely =
change, then you have fewer things to<BR>release and fewer things to =
back up on any given day (if the data hasn't<BR>changed in a year, then =
a backup you did 6 months ago is as good as one you <BR>did =
yesterday).<BR><BR><BR>&gt;&gt; | 3) what's the recommend max size for a =
volume?<BR>&gt;&gt;<BR>&gt;&gt; I once worked with 250 GB volumes. But =
to replicate these big<BR>&gt;&gt; volumes suxx.<BR>&gt;&gt;<BR>&gt; =
good to know that you used such a large volume, even if it was really =
<BR>&gt; slow for replication<BR><BR>Large volumes are certainly =
possible, and if you have large amounts of<BR>data, may even be =
appropriate.=A0=A0The thing to avoid is treating volumes as<BR>if they =
were partitons on your disk.=A0=A0Don't create a few giant volumes =
<BR>containing lots of unrelated data.=A0=A0With disk partitioning, you =
have to do<BR>that, because you get a limited number of partitions and =
because<BR>filesystems are hard to resize.=A0=A0Volume quotas can be =
changed trivially, <BR>and volumes can be moved around transparently if =
you discover that the<BR>partition currently holding a volume doesn't =
have enough space.=A0=A0And, to<BR>exhaust the number of volumes the =
VLDB can handle, you'd have to create a <BR>volume per second for =
something like 10 years.<BR><BR><BR>Also, the amount of time replication =
takes is _not_ based on the amount of<BR>data in the volume being =
replicated.=A0=A0After the first release, AFS does<BR>volume replication =
using incremental dumps, so the time required depends on <BR>the amount =
of _new_ data and on the number of files in the volume -- a<BR>certain =
amount of work must be done for each file, but the only files =
whose<BR>contents are transferred are those which have =
changed.<BR><BR>&gt;&gt; And there is limit in files in a volume: max. =
64k files with &lt;16 <BR>&gt;&gt; letters<BR>&gt;&gt; allowed in one =
volume.<BR>&gt;&gt; ~=46rom a mailinglist-entry:<BR>&gt;&gt; The =
directory structure contains 64K slots.<BR>&gt;&gt; filenames under 16 =
chars occupy 1 slot.<BR>&gt;&gt; filenames between 16 and 32 chars =
occupy 2 slots <BR>&gt;&gt; filenames between 33 and 48 chars occupy 3 =
slots, and on<BR>&gt;&gt;<BR>&gt; this part confuses me, do you have the =
link to the original topic?<BR><BR>The issue Lars is describing here has =
to do with the total number of files <BR>you can have _in one =
directory_.=A0=A0If the filenames are all less than 16<BR>characters, =
that's a little under 64K files per directory.=A0=A0AFS =
directories<BR>are hash tables, so lookups are fast even in large =
directories. <BR><BR>The maximum number of files in a _volume_ is much, =
much larger.=A0=A0A<BR>first-order theoretical maximum would be 2^30, or =
about 1 billion files<BR>(vnode numbers are 32-bit integers; if you =
assume that something treats <BR>them as signed, then the largest valid =
number would be 2^31-1, but only<BR>even vnode numbers are used to refer =
to plain files).=A0=A0The actual number is<BR>likely to be a few =
(binary) orders of magnitude smaller, as I don't think <BR>there's been =
much testing of the case where the vnode index file is larger<BR>than =
2GB.<BR><BR>-- Jeffrey T. Hutzelman (N3NHS) &lt;<A =
href=3D"mailto:jhutz+@cmu.edu">jhutz+@cmu.edu</A>&gt;<BR>=A0=A0 Sr. =
Research Systems Programmer <BR>=A0=A0 School of Computer Science - =
Research Computing Facility<BR>=A0=A0 Carnegie Mellon University - =
Pittsburgh, =
PA<BR><BR>_______________________________________________<BR>OpenAFS-info =
mailing list<BR><A href=3D"mailto:OpenAFS-info@openafs.org"> =
OpenAFS-info@openafs.org</A><BR><A =
href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info">https://l=
ists.openafs.org/mailman/listinfo/openafs-info</A><BR></BLOCKQUOTE></DIV><=
BR></BLOCKQUOTE></DIV><BR></DIV></DIV></BODY></HTML>=

--Apple-Mail-6--512999262--