[OpenAFS] Re: Advice on a use case

Fri, 9 Nov 2012 11:43:11 -0800

--20cf3071cd086e31b204ce1529d9
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Nov 9, 2012 at 9:45 AM, Andrew Deason <adeason@sinenomine.net>wrote:

> On Thu, 8 Nov 2012 22:48:56 -0800
> Timothy Balcer <timothy@telmate.com> wrote:
>
> > Well, unless I am missing something seriously obvious, for example it
> > took 1.5 hours to rsync a subdirectory to an AFS volume that had not a
> > lot of content, but many directories.
>
> Creating lots of files is not fast. Due to the consistency guarantees of
> AFS, you have to wait for at least a network RTT for every single file
> you create. That is, a mkdir() call is going to take at least 50ms if
> the server is 50ms away. Most/all recursive copying tools will wait for
> that mkdir() to complete before doing anything else, so it's slow.
>
>
Yes.. I understand that. I was commenting on the slowness as compared to
rsyncing over NFS, for example, which takes 5 hours for the entire tree
when done from the top level of the tree. That tree contains 15 of the
directories that I mentioned in my earlier post. So 15 * 24k dirs.. and to
answer the question, 232,974 files of small size for the one subdirectory
in question.

Arguably we could maybe introduce something to 'fs storebehind' to do
> these operations asynchronously to the fileserver, but that has issues
> (as mentioned in the 'fs storebehind' manpage). And, well, it doesn't
> exist right now anyway, so that doesn't help you :)
>
> What can possibly make this faster, then, is copying/creating
> files/directories in parallel. <snip>
>

Yes, I routinely run 100's of parallel transfers using a combination of tar
and rsync.. tar gets the files over in raw form, and rsync mops up behind.
The rsync pass is to correct any problems with the tar copy, and is run
twice on a fixed list, generated at transfer time. I have found that even
when using a tuned rsync process designed to improve transfer speeds, many
parallel tar/untar processes from local to  NFSv4 followed by a "local"
rsync to the same destination works better for new files, when timeliness
is important.

I also use rsync modules in other cases, in on demand synchronization of
audio files to replica sites. That also works fairly well.

But I would like to replace all of this NFS and rsync on demand with AFS,
ultimately. :)

>
> Also, I was assuming you're rsync'ing to an empty destination in AFS;
> that is, just using rsync to copy stuff around. If you're actually
> trying to synchronize a dir tree in AFS that's at least partially
> populated, see Jeff's comments about stat caches and stuff.
>

I am yes, copying to empty areas in volumes. Creating the files but using
rsync to do so. I probably should have mentioned.. I could do this with
10,000 parallel rsync/tar/whatever no problem... but I didn't want to scale
up to many parallel copies until I heard from you folks about that
possibility.

> > No, I am writing from a local audio/video server to a local repo,
> > which needs to be very fast in order to service live streaming in
> > parallel with write on a case by case basis.
>
> It seems like it could just write to /foo during the stream capture, and
> copy it to /afs/bar/baz when it's done. But if the union mount scheme
> makes it easier for you, then okay :)
>
> But I'm not sure I understand... above you discuss making these
> directory trees made up of a lot of directories or relatively small
> files. I would've thought that video captures of a live stream would not
> be particularly small... copying video to AFS sounds more like the
> "small number of large files" use case, which is much more manageable.
> Is this a lot of small video files or something?
>

I apologize.. there are two use cases here I was collapsing into one. In
one case, I am making LOTS of directories and putting small image files
into their leafs.. in the other case I am streaming video/audio.. so you
are correct, and I was being obtuse. As I can be when I am typing at 100
miles an hour :) I'll try to me more precise!!

<snip>
> I'm not sure what scalability issues here you're expecting; making
> volumes smaller but more in number is typically something you do to
> improve scalability. We usually encourage more small volumes instead of
> fewer big volumes.
>
> What I would guess you may run into:
>
>  - The speed of creating the volumes. I'm not actually sure how fast
>    this goes, since creating a lot of volumes quickly isn't usually a
>    concern... so you'll have to try it :)
>

Not a big concern.. its fine to think of AFS as the archival area which is
joined in purpose to the "live" area, and so I can create volumes at will
in the background, regardless of conditions on the active servers, and
transfer files, make replicas and release them without affecting
operations. That's the reason for using AUFS here.. to be able to present a
single read-write store to the applications for various purposes. So I can
batch this operation fairly easily.

>
>  - Fileserver startup/shutdown time for non-DAFS is somewhat heavily
>    influenced by the number of volumes on the server this is a
>    significant issue when you start to have tens or hundreds of
>    thousands of volumes on a server.
>
> That second point is addressed by DAFS, which can handle at least a
> million or so volumes per server rather quickly (a few seconds for
> startup). I'm not sure if you know what DAFS is, but converting to using
> it should be straightforward. There is a section about DAFS and how to
> convert to using it in appendix C of the Quick Start Guide:
> <http://docs.openafs.org/QuickStartUnix/index.html#DAFS.html>.
>

Thanks so much for pointing me to this! DAFS, in combination with a solid
Volume schema could be the solution I have been looking for!

> --
> Andrew Deason
> adeason@sinenomine.net
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

-- 
Timothy Balcer / IT Services
Telmate / San Francisco, CA
Direct / (415) 300-4313
Customer Service / (800) 205-5510

--20cf3071cd086e31b204ce1529d9
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Fri, Nov 9, 2012 at 9:45 AM, Andrew Deason <span dir=3D"ltr">&lt;<a href=
=3D"mailto:adeason@sinenomine.net" target=3D"_blank">adeason@sinenomine.net=
</a>&gt;</span> wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">

<div>On Thu, 8 Nov 2012 22:48:56 -0800<br>
Timothy Balcer &lt;<a href=3D"mailto:timothy@telmate.com" target=3D"_blank"=
>timothy@telmate.com</a>&gt; wrote:<br>
<br>
&gt; Well, unless I am missing something seriously obvious, for example it<=
br>
&gt; took 1.5 hours to rsync a subdirectory to an AFS volume that had not a=
<br>
&gt; lot of content, but many directories.<br>
<br>
</div>Creating lots of files is not fast. Due to the consistency guarantees=
 of<br>
AFS, you have to wait for at least a network RTT for every single file<br>
you create. That is, a mkdir() call is going to take at least 50ms if<br>
the server is 50ms away. Most/all recursive copying tools will wait for<br>
that mkdir() to complete before doing anything else, so it&#39;s slow.<br>
<br></blockquote><div><br>Yes.. I understand that. I was commenting on the =
slowness as compared to rsyncing over NFS, for example, which takes 5 hours=
 for the entire tree when done from the top level of the tree. That tree co=
ntains 15 of the directories that I mentioned in my earlier post. So 15 * 2=
4k dirs.. and to answer the question, 232,974 files of small size for the o=
ne subdirectory in question.<br>

<br><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
Arguably we could maybe introduce something to &#39;fs storebehind&#39; to =
do<br>
these operations asynchronously to the fileserver, but that has issues<br>
(as mentioned in the &#39;fs storebehind&#39; manpage). And, well, it doesn=
&#39;t<br>
exist right now anyway, so that doesn&#39;t help you :)<br>
<br>
What can possibly make this faster, then, is copying/creating<br>
files/directories in parallel. &lt;snip&gt;<br></blockquote><div><br>Yes, I=
 routinely run 100&#39;s of parallel transfers using a combination of tar a=
nd rsync.. tar gets the files over in raw form, and rsync mops up behind. T=
he rsync pass is to correct any problems with the tar copy, and is run twic=
e on a fixed list, generated at transfer time. I have found that even when =
using a tuned rsync process designed to improve transfer speeds, many paral=
lel tar/untar processes from local to=A0 NFSv4 followed by a &quot;local&qu=
ot; rsync to the same destination works better for new files, when timeline=
ss is important.<br>

<br>I also use rsync modules in other cases, in on demand synchronization o=
f audio files to replica sites. That also works fairly well.<br><br>But I w=
ould like to replace all of this NFS and rsync on demand with AFS, ultimate=
ly. :)<br>

</div><div><br>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Also, I was assuming you&#39;re rsync&#39;ing to an empty destination in AF=
S;<br>
that is, just using rsync to copy stuff around. If you&#39;re actually<br>
trying to synchronize a dir tree in AFS that&#39;s at least partially<br>
populated, see Jeff&#39;s comments about stat caches and stuff.<br></blockq=
uote><div><br>I am yes, copying to empty areas in volumes. Creating the fil=
es but using rsync to do so. I probably should have mentioned.. I could do =
this with 10,000 parallel rsync/tar/whatever no problem... but I didn&#39;t=
 want to scale up to many parallel copies until I heard from you folks abou=
t that possibility. <br>

<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
<div><br>
&gt; No, I am writing from a local audio/video server to a local repo,<br>
&gt; which needs to be very fast in order to service live streaming in<br>
&gt; parallel with write on a case by case basis.<br>
<br>
</div>It seems like it could just write to /foo during the stream capture, =
and<br>
copy it to /afs/bar/baz when it&#39;s done. But if the union mount scheme<b=
r>
makes it easier for you, then okay :)<br>
<br>
But I&#39;m not sure I understand... above you discuss making these<br>
directory trees made up of a lot of directories or relatively small<br>
files. I would&#39;ve thought that video captures of a live stream would no=
t<br>
be particularly small... copying video to AFS sounds more like the<br>
&quot;small number of large files&quot; use case, which is much more manage=
able.<br>
Is this a lot of small video files or something?<br></blockquote><div><br>I=
 apologize.. there are two use cases here I was collapsing into one. In one=
 case, I am making LOTS of directories and putting small image files into t=
heir leafs.. in the other case I am streaming video/audio.. so you are corr=
ect, and I was being obtuse. As I can be when I am typing at 100 miles an h=
our :) I&#39;ll try to me more precise!! <br>

<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
<div>&lt;snip&gt;<br>
</div>I&#39;m not sure what scalability issues here you&#39;re expecting; m=
aking<br>
volumes smaller but more in number is typically something you do to<br>
improve scalability. We usually encourage more small volumes instead of<br>
fewer big volumes.<br>
<br>
What I would guess you may run into:<br>
<br>
=A0- The speed of creating the volumes. I&#39;m not actually sure how fast<=
br>
=A0 =A0this goes, since creating a lot of volumes quickly isn&#39;t usually=
 a<br>
=A0 =A0concern... so you&#39;ll have to try it :)<br></blockquote><div><br>=
Not a big concern.. its fine to think of AFS as the archival area which is =
joined in purpose to the &quot;live&quot; area, and so I can create volumes=
 at will in the background, regardless of conditions on the active servers,=
 and transfer files, make replicas and release them without affecting opera=
tions. That&#39;s the reason for using AUFS here.. to be able to present a =
single read-write store to the applications for various purposes. So I can =
batch this operation fairly easily.<br>

=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex">
<br>
=A0- Fileserver startup/shutdown time for non-DAFS is somewhat heavily<br>
=A0 =A0influenced by the number of volumes on the server this is a<br>
=A0 =A0significant issue when you start to have tens or hundreds of<br>
=A0 =A0thousands of volumes on a server.<br>
<br>
That second point is addressed by DAFS, which can handle at least a<br>
million or so volumes per server rather quickly (a few seconds for<br>
startup). I&#39;m not sure if you know what DAFS is, but converting to usin=
g<br>
it should be straightforward. There is a section about DAFS and how to<br>
convert to using it in appendix C of the Quick Start Guide:<br>
&lt;<a href=3D"http://docs.openafs.org/QuickStartUnix/index.html#DAFS.html"=
 target=3D"_blank">http://docs.openafs.org/QuickStartUnix/index.html#DAFS.h=
tml</a>&gt;.<br></blockquote><div><br>Thanks so much for pointing me to thi=
s! DAFS, in combination with a solid Volume schema could be the solution I =
have been looking for! <br>

<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
<div><div><br>
--<br>
Andrew Deason<br>
<a href=3D"mailto:adeason@sinenomine.net" target=3D"_blank">adeason@sinenom=
ine.net</a><br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org" target=3D"_blank">OpenAFS-info@=
openafs.org</a><br>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><span style=
=3D"border-collapse:collapse;color:rgb(102,102,102);font-family:verdana,san=
s-serif;font-size:x-small">Timothy Balcer / IT Services<br>Telmate / San Fr=
ancisco, CA<br>

Direct / </span><span style=3D"border-collapse:collapse;font-family:verdana=
,sans-serif;font-size:x-small"><font color=3D"#1155cc"><a href=3D"tel:%2841=
5%29%20300-4313" value=3D"+14153004313" target=3D"_blank">(415) 300-4313</a=
></font><br>
<font color=3D"#666666">Customer Service /=A0</font><a value=3D"+1800205551=
0" style=3D"color:rgb(17,85,204)">(800) 205-5510</a></span><br>

--20cf3071cd086e31b204ce1529d9--