[OpenAFS] sanity check please.

y f yfttyfs@gmail.com
Tue, 6 Sep 2005 08:03:13 +0800


------=_Part_12215_22683110.1125964993484
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Maybe I'm off the topic, have you given a eye on MogileFS (
http://www.danga.com/mogilefs/), which is for its web images services.

On 9/6/05, Jeffrey Hutzelman <jhutz@cmu.edu> wrote:
>=20
>=20
>=20
> On Monday, September 05, 2005 08:24:37 -0700 Pucky Loucks
> <ploucks@h2st.com> wrote:
>=20
> > Thanks so much for you response Lars, I've commented below.
> >
> > On 4-Sep-05, at 4:01 AM, Lars Schimmer wrote:
> >>
> >> | 1) Is this going to become a huge management issue?
> >>
> >> Depends.
> >> If you use some nice scripts, it could be managed easy.
> >> At all its just kinda "vos create volume" "fs mkmount path volume" "fs
> >> setacl path rights" and "backup volume". And after all, the big
> >> overview
> >> ;-) As long as you manage your cell well enough, it=B4s easy. Please
> >> don=B4t
> >> create one dir with all volumes in it.
> >>
> > am I correct in that a volume isn't replicated until a "vol release" "f=
s
> > checkvolumes"? i.e. as I write new images I'll need to decide when I
> > should run those commands? I might just end up replicating it at the
> > end of the day. for 1/2 day.
>=20
> RO replicas are updated only when someone (or something) runs 'vos=20
> release'
> on that volume. Some sites have scripts that run every night to release
> those volumes that have changed (where the RW is newer than the RO);=20
> others
> release volumes only manually, and take advantage of this to release only
> coherent sets of changes.
>=20
> 'fs checkv' is the command you run on your client to force it to discover
> VLDB changes before the 2-hour cache times out. Generally you'll only nee=
d
> to run this when you have a volume that's just been released for the firs=
t
> time, on clients that accessed the RW volume before it was replicated.
> Even then, you only need it if you care about seeing the change _right
> now_, instead of sometime in the next 2 hours.
>=20
>=20
> >> You can create one volume for each day, for each 5000 images. But
> >> think
> >> about: 5000 images in one directory is a real mess to search through.
> >> And if that are "big" images, the size of this daily volume grows
> >> fast,
> >> so a replicate volume takes far more time. Replication is near line
> >> speed (as long as there are no large amounts of small files >1kb; but
> >> you talk about images, so that files should be larger), but e.g.
> >> 100 gig
> >> in a volume takes its time to replicate at 100Mbit.
> >> I suggest 50 volumes a 100 images/day, e.g. numbered "day.001.01" or
> >> else, as you can find the volume easy and you can easy replicate them
> >> with a script. And if you distribute these volumes over 5-10 file
> >> servers, the replication process span over the network and is faster
> >> ended at all. Speed is a question of design.
> >>
> > this sounds like a good idea. all of the logic of what file is located
> > where is in my applications database so I won't really need to search
> > from the command line for a file.
>=20
> Generally, you should create volumes based on purpose and function. You
> can create new volumes whenever you want, move volumes around at will
> (transparently), and adjust quotas whenever you want. So, you need to
> think about logical organization of data, rather then phsyical
> organization. When deciding how to organize things, consider whether you
> will be able to tell what is in a volume just from the volume name (not=
=20
> the
> place it is mounted). If so, there's a good chance you're on the right
> track.
>=20
> Depending on your application, breaking up large amounts of data into
> volumes by time might not be a bad idea. If you have a large number of
> volumes which, once populated, rarely change, then you have fewer things=
=20
> to
> release and fewer things to back up on any given day (if the data hasn't
> changed in a year, then a backup you did 6 months ago is as good as one=
=20
> you
> did yesterday).
>=20
>=20
> >> | 3) what's the recommend max size for a volume?
> >>
> >> I once worked with 250 GB volumes. But to replicate these big
> >> volumes suxx.
> >>
> > good to know that you used such a large volume, even if it was really
> > slow for replication
>=20
> Large volumes are certainly possible, and if you have large amounts of
> data, may even be appropriate. The thing to avoid is treating volumes as
> if they were partitons on your disk. Don't create a few giant volumes
> containing lots of unrelated data. With disk partitioning, you have to do
> that, because you get a limited number of partitions and because
> filesystems are hard to resize. Volume quotas can be changed trivially,
> and volumes can be moved around transparently if you discover that the
> partition currently holding a volume doesn't have enough space. And, to
> exhaust the number of volumes the VLDB can handle, you'd have to create a
> volume per second for something like 10 years.
>=20
>=20
> Also, the amount of time replication takes is _not_ based on the amount o=
f
> data in the volume being replicated. After the first release, AFS does
> volume replication using incremental dumps, so the time required depends=
=20
> on
> the amount of _new_ data and on the number of files in the volume -- a
> certain amount of work must be done for each file, but the only files=20
> whose
> contents are transferred are those which have changed.
>=20
> >> And there is limit in files in a volume: max. 64k files with <16
> >> letters
> >> allowed in one volume.
> >> ~From a mailinglist-entry:
> >> The directory structure contains 64K slots.
> >> filenames under 16 chars occupy 1 slot.
> >> filenames between 16 and 32 chars occupy 2 slots
> >> filenames between 33 and 48 chars occupy 3 slots, and on
> >>
> > this part confuses me, do you have the link to the original topic?
>=20
> The issue Lars is describing here has to do with the total number of file=
s
> you can have _in one directory_. If the filenames are all less than 16
> characters, that's a little under 64K files per directory. AFS directorie=
s
> are hash tables, so lookups are fast even in large directories.
>=20
> The maximum number of files in a _volume_ is much, much larger. A
> first-order theoretical maximum would be 2^30, or about 1 billion files
> (vnode numbers are 32-bit integers; if you assume that something treats
> them as signed, then the largest valid number would be 2^31-1, but only
> even vnode numbers are used to refer to plain files). The actual number i=
s
> likely to be a few (binary) orders of magnitude smaller, as I don't think
> there's been much testing of the case where the vnode index file is large=
r
> than 2GB.
>=20
> -- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
> Sr. Research Systems Programmer
> School of Computer Science - Research Computing Facility
> Carnegie Mellon University - Pittsburgh, PA
>=20
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

------=_Part_12215_22683110.1125964993484
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Maybe I'm off the topic, have you given a eye on MogileFS
(<a href=3D"http://www.danga.com/mogilefs/">http://www.danga.com/mogilefs/<=
/a>), which is for its web images services.<br><br><div><span class=3D"gmai=
l_quote">On 9/6/05, <b class=3D"gmail_sendername">Jeffrey Hutzelman</b> &lt=
;
<a href=3D"mailto:jhutz@cmu.edu">jhutz@cmu.edu</a>&gt; wrote:</span><blockq=
uote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, 20=
4); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br><br>On Monday, Septe=
mber 05, 2005 08:24:37 -0700 Pucky Loucks
<br>&lt;<a href=3D"mailto:ploucks@h2st.com">ploucks@h2st.com</a>&gt; wrote:=
<br><br>&gt; Thanks so much for you response Lars, I've commented below.<br=
>&gt;<br>&gt; On 4-Sep-05, at 4:01 AM, Lars Schimmer wrote:<br>&gt;&gt;<br>
&gt;&gt; | 1) Is this going to become a huge management issue?<br>&gt;&gt;<=
br>&gt;&gt; Depends.<br>&gt;&gt; If you use some nice scripts, it could be =
managed easy.<br>&gt;&gt; At all its just kinda &quot;vos create volume&quo=
t; &quot;fs mkmount path volume&quot; &quot;fs
<br>&gt;&gt; setacl path rights&quot; and &quot;backup volume&quot;. And af=
ter all, the big<br>&gt;&gt; overview<br>&gt;&gt; ;-) As long as you manage=
 your cell well enough, it=B4s easy. Please<br>&gt;&gt; don=B4t<br>&gt;&gt;=
 create one dir with all volumes in it.
<br>&gt;&gt;<br>&gt; am I correct in that a volume isn't replicated until a=
 &quot;vol release&quot;&nbsp;&nbsp;&quot;fs<br>&gt; checkvolumes&quot;? i.=
e. as I write new images I'll need to decide&nbsp;&nbsp;when I<br>&gt; shou=
ld run those commands?&nbsp;&nbsp;I might just end up replicating it&nbsp;&=
nbsp;at the
<br>&gt; end of the day. for 1/2 day.<br><br>RO replicas are updated only w=
hen someone (or something) runs 'vos release'<br>on that volume.&nbsp;&nbsp=
;Some sites have scripts that run every night to release<br>those volumes t=
hat have changed (where the RW is newer than the RO); others
<br>release volumes only manually, and take advantage of this to release on=
ly<br>coherent sets of changes.<br><br>'fs checkv' is the command you run o=
n your client to force it to discover<br>VLDB changes before the 2-hour cac=
he times out.&nbsp;&nbsp;Generally you'll only need
<br>to run this when you have a volume that's just been released for the fi=
rst<br>time, on clients that accessed the RW volume before it was replicate=
d.<br>Even then, you only need it if you care about seeing the change _righ=
t
<br>now_, instead of sometime in the next 2 hours.<br><br><br>&gt;&gt; You =
can create one volume for each day, for each 5000 images. But<br>&gt;&gt; t=
hink<br>&gt;&gt; about: 5000 images in one directory is a real mess to sear=
ch through.
<br>&gt;&gt; And if that are &quot;big&quot; images, the size of this daily=
 volume grows<br>&gt;&gt; fast,<br>&gt;&gt; so a replicate volume takes far=
 more time. Replication is near line<br>&gt;&gt; speed (as long as there ar=
e no large amounts of small files &gt;1kb; but
<br>&gt;&gt; you talk about images, so that files should be larger), but e.=
g.<br>&gt;&gt; 100 gig<br>&gt;&gt; in a volume takes its time to replicate =
at 100Mbit.<br>&gt;&gt; I suggest 50 volumes a 100 images/day, e.g. numbere=
d &quot;
day.001.01&quot; or<br>&gt;&gt; else, as you can find the volume easy and y=
ou can easy replicate them<br>&gt;&gt; with a script. And if you distribute=
 these volumes over 5-10 file<br>&gt;&gt; servers, the replication process =
span over the network and is faster
<br>&gt;&gt; ended at all. Speed is a question of design.<br>&gt;&gt;<br>&g=
t; this sounds like a good idea.&nbsp;&nbsp;all of the logic of what file i=
s&nbsp;&nbsp;located<br>&gt; where is in my applications database so I won'=
t really need&nbsp;&nbsp;to search
<br>&gt; from the command line for a file.<br><br>Generally, you should cre=
ate volumes based on purpose and function.&nbsp;&nbsp;You<br>can create new=
 volumes whenever you want, move volumes around at will<br>(transparently),=
 and adjust quotas whenever you want.&nbsp;&nbsp;So, you need to
<br>think about logical organization of data, rather then phsyical<br>organ=
ization.&nbsp;&nbsp;When deciding how to organize things, consider whether =
you<br>will be able to tell what is in a volume just from the volume name (=
not the
<br>place it is mounted).&nbsp;&nbsp;If so, there's a good chance you're on=
 the right<br>track.<br><br>Depending on your application, breaking up larg=
e amounts of data into<br>volumes by time might not be a bad idea.&nbsp;&nb=
sp;If you have a large number of
<br>volumes which, once populated, rarely change, then you have fewer thing=
s to<br>release and fewer things to back up on any given day (if the data h=
asn't<br>changed in a year, then a backup you did 6 months ago is as good a=
s one you
<br>did yesterday).<br><br><br>&gt;&gt; | 3) what's the recommend max size =
for a volume?<br>&gt;&gt;<br>&gt;&gt; I once worked with 250 GB volumes. Bu=
t to replicate these big<br>&gt;&gt; volumes suxx.<br>&gt;&gt;<br>&gt; good=
 to know that you used such a large volume, even if it was really
<br>&gt; slow for replication<br><br>Large volumes are certainly possible, =
and if you have large amounts of<br>data, may even be appropriate.&nbsp;&nb=
sp;The thing to avoid is treating volumes as<br>if they were partitons on y=
our disk.&nbsp;&nbsp;Don't create a few giant volumes
<br>containing lots of unrelated data.&nbsp;&nbsp;With disk partitioning, y=
ou have to do<br>that, because you get a limited number of partitions and b=
ecause<br>filesystems are hard to resize.&nbsp;&nbsp;Volume quotas can be c=
hanged trivially,
<br>and volumes can be moved around transparently if you discover that the<=
br>partition currently holding a volume doesn't have enough space.&nbsp;&nb=
sp;And, to<br>exhaust the number of volumes the VLDB can handle, you'd have=
 to create a
<br>volume per second for something like 10 years.<br><br><br>Also, the amo=
unt of time replication takes is _not_ based on the amount of<br>data in th=
e volume being replicated.&nbsp;&nbsp;After the first release, AFS does<br>=
volume replication using incremental dumps, so the time required depends on
<br>the amount of _new_ data and on the number of files in the volume -- a<=
br>certain amount of work must be done for each file, but the only files wh=
ose<br>contents are transferred are those which have changed.<br><br>&gt;&g=
t; And there is limit in files in a volume: max. 64k files with &lt;16
<br>&gt;&gt; letters<br>&gt;&gt; allowed in one volume.<br>&gt;&gt; ~From a=
 mailinglist-entry:<br>&gt;&gt; The directory structure contains 64K slots.=
<br>&gt;&gt; filenames under 16 chars occupy 1 slot.<br>&gt;&gt; filenames =
between 16 and 32 chars occupy 2 slots
<br>&gt;&gt; filenames between 33 and 48 chars occupy 3 slots, and on<br>&g=
t;&gt;<br>&gt; this part confuses me, do you have the link to the original =
topic?<br><br>The issue Lars is describing here has to do with the total nu=
mber of files
<br>you can have _in one directory_.&nbsp;&nbsp;If the filenames are all le=
ss than 16<br>characters, that's a little under 64K files per directory.&nb=
sp;&nbsp;AFS directories<br>are hash tables, so lookups are fast even in la=
rge directories.
<br><br>The maximum number of files in a _volume_ is much, much larger.&nbs=
p;&nbsp;A<br>first-order theoretical maximum would be 2^30, or about 1 bill=
ion files<br>(vnode numbers are 32-bit integers; if you assume that somethi=
ng treats
<br>them as signed, then the largest valid number would be 2^31-1, but only=
<br>even vnode numbers are used to refer to plain files).&nbsp;&nbsp;The ac=
tual number is<br>likely to be a few (binary) orders of magnitude smaller, =
as I don't think
<br>there's been much testing of the case where the vnode index file is lar=
ger<br>than 2GB.<br><br>-- Jeffrey T. Hutzelman (N3NHS) &lt;<a href=3D"mail=
to:jhutz+@cmu.edu">jhutz+@cmu.edu</a>&gt;<br>&nbsp;&nbsp; Sr. Research Syst=
ems Programmer
<br>&nbsp;&nbsp; School of Computer Science - Research Computing Facility<b=
r>&nbsp;&nbsp; Carnegie Mellon University - Pittsburgh, PA<br><br>_________=
______________________________________<br>OpenAFS-info mailing list<br><a h=
ref=3D"mailto:OpenAFS-info@openafs.org">
OpenAFS-info@openafs.org</a><br><a href=3D"https://lists.openafs.org/mailma=
n/listinfo/openafs-info">https://lists.openafs.org/mailman/listinfo/openafs=
-info</a><br></blockquote></div><br>

------=_Part_12215_22683110.1125964993484--