[OpenAFS] Re: Advice on a use case

Timothy Balcer timothy@telmate.com
Thu, 8 Nov 2012 22:48:56 -0800


--047d7b6774207d4eb704ce0a584e
Content-Type: text/plain; charset=ISO-8859-1

On Tue, Nov 6, 2012 at 8:49 AM, Andrew Deason <adeason@sinenomine.net>wrote:

> On Tue, 6 Nov 2012 00:06:53 -0800
> Timothy Balcer <timothy@telmate.com> wrote:
>
> > I have a need to think about replicating large volumes (multigigabyte)
> > of large number (many terabytes of data total), to at least two other
> > servers besides the read write volume, and to perform these releases
> > relatively frequently (much more than once a day, preferably)
>
> How much more frequently? Hourly? Some people do 4 times hourly (and
> maybe more) successfully.
>

Well, unless I am missing something seriously obvious, for example it took
1.5 hours to rsync a subdirectory to an AFS volume that had not a lot of
content, but many directories.

How frequently depends on use, and being able to release faster than the
writes.. I don't have performance data on the writes yet, but that will
change anyway.. we are going from 200+ clients to many more. Which is why I
am working with AFS in the first place for this.. the environment is a
write once, read many situation.

>
> > Also, these other two (or more) read-only volumes for each read write
> > volume will be remote volumes, transiting across relatively fat, but
> > less than gigabit, pipes (100+ megabits)
>
> Latency may matter more than bandwidth; do you know what it is?
>

depending on the colo site, between 30 and 60ms


>
> > For the moment what I have decided to experiment with is a simple
> > system.  My initial idea is to work the afs read-only volume tree into
> > an AUFS union, with a local read write partition in the mix. This way,
> > writes will be local, but I can periodically "flush" writes to the AFS
> > tree, double check they have been written and released, and then
> > remove them from the local partition.. this should maintain integrity
> > and high availability for the up-to-the-moment recordings, given I
> > RAID the local volume. Obviously, this still introduces a single point
> > of failure... so I'd like to flush as frequently as possible.
> > Incidentally, it seems you can NFS export such a union system fairly
> > simply.
>
> I'm not sure I understand the purpose of this; are you trying to write
> new data from all of the 'remote' locations, and you need those writes
> to 'finish' quickly?
>

No, I am writing from a local audio/video server to a local repo, which
needs to be very fast in order to service live streaming in parallel with
write on a case by case basis.

That local repo would be in a R/W branch above the AFS R/O branch, so:

 dirs=/Read-Write=rw:/afs/path/to/read-only=ro aufs /union

This way I can present the /union to the application server as a read/write
repo for all its needs, including archival use, but still have AFS
underneath for replication and distribution.

*sigh*

I wish OSD was primetime :)


> > But, I feel as if I am missing something... it has become clear that
> > releasing is a pretty intensive operation, and if we're talking about
> > multiple gigabytes per release, I can imagine it being extremely
> > difficult.  Is there a schema that i can use with OpenAFS that will
> > help alleviate this problem? Or perhaps another approach I am missing
> > that may solve it better?
>
> Eh, some people do that; it just reduces the benefit of the client-side
> caching. Every time you release a volume, the server tells clients that
> for all data in that volume, the client needs to check with the server
> to see if the cached data is different from what's actually in the
> volume. But that may not matter so much, especially for a small number
> of large files.
>

Well thats the thing.. this is a large number of small to medium sized
files that are being written continuously. In addition, there is a quite
deep directory structure. I'm trying to get it flattened out to improve
scaling, but at the moment it is taking 1.5 hours to rsync a subdirectory
containing about 5G of data, but  23,681 directories, for example

releasing is a whole 'nuther animal... ;-)


> To improve things, you can maybe try to reduce the number of volumes
> that are changing. That is, if you are adding new data in batches, I
> don't know if it's feasible for you to add that 'batch' of data by
> creating a new volume instead of writing to existing volumes.
>

That's feasible..... but what if, for example, vol1 is mounted at *
/afs/foo/home/bar* and contains a thousand directories. The new content is
a thousand more directories, but at the exact same level of the tree. How
would I handle that? As far as I can tell, OpenAFS only allows a volume
being mounted on its very own directory, and you can't nest them together
like that.

How unfeasible would it be to create N volumes, where N >= 500 per shot? I
would end up with many thousands of tiny volumes.. none of which I have
trouble with, but would that be scalable? Let's assume I have spread out db
and file servers in such a way to equalize load.


>
>
> And, of course, the release process may not be fast enough to actually
> do releases as quickly as you want. There are maybe some ways to ship
> around volume dumps yourself to get around that, and some pending
> improvements to the volserver that would help, but I would only think
> about that after you try the releases yourself.
>

The idea of doing R/W "checkpoint" volumes that I only have to release once
in a while after the first release is very appealing... if you can suggest
a solution to the problem above.. I am all ears!! :) I would be VERY happy
to be able to allocate space, quota and location, on the fly, in batchwise
operations.


> --
> Andrew Deason
> adeason@sinenomine.net
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>



-- 
Timothy Balcer

--047d7b6774207d4eb704ce0a584e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On Tue, Nov 6, 2012 at 8:49 AM, Andrew D=
eason <span dir=3D"ltr">&lt;<a href=3D"mailto:adeason@sinenomine.net" targe=
t=3D"_blank">adeason@sinenomine.net</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">
<div class=3D"im">On Tue, 6 Nov 2012 00:06:53 -0800<br>
Timothy Balcer &lt;<a href=3D"mailto:timothy@telmate.com">timothy@telmate.c=
om</a>&gt; wrote:<br>
<br>
&gt; I have a need to think about replicating large volumes (multigigabyte)=
<br>
&gt; of large number (many terabytes of data total), to at least two other<=
br>
&gt; servers besides the read write volume, and to perform these releases<b=
r>
&gt; relatively frequently (much more than once a day, preferably)<br>
<br>
</div>How much more frequently? Hourly? Some people do 4 times hourly (and<=
br>
maybe more) successfully.<br></blockquote><div><br>Well, unless I am missin=
g something seriously obvious, for example it took 1.5 hours to rsync a sub=
directory to an AFS volume that had not a lot of content, but many director=
ies.<br>
<br>How frequently depends on use, and being able to release faster than th=
e writes.. I don&#39;t have performance data on the writes yet, but that wi=
ll change anyway.. we are going from 200+ clients to many more. Which is wh=
y I am working with AFS in the first place for this.. the environment is a =
write once, read many situation. <br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex">
<div class=3D"im"><br>
&gt; Also, these other two (or more) read-only volumes for each read write<=
br>
&gt; volume will be remote volumes, transiting across relatively fat, but<b=
r>
&gt; less than gigabit, pipes (100+ megabits)<br>
<br>
</div>Latency may matter more than bandwidth; do you know what it is?<br></=
blockquote><div><br>depending on the colo site, between 30 and 60ms<br>=A0<=
br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">

<div class=3D"im"><br>
&gt; For the moment what I have decided to experiment with is a simple<br>
&gt; system. =A0My initial idea is to work the afs read-only volume tree in=
to<br>
&gt; an AUFS union, with a local read write partition in the mix. This way,=
<br>
&gt; writes will be local, but I can periodically &quot;flush&quot; writes =
to the AFS<br>
&gt; tree, double check they have been written and released, and then<br>
&gt; remove them from the local partition.. this should maintain integrity<=
br>
&gt; and high availability for the up-to-the-moment recordings, given I<br>
&gt; RAID the local volume. Obviously, this still introduces a single point=
<br>
&gt; of failure... so I&#39;d like to flush as frequently as possible.<br>
&gt; Incidentally, it seems you can NFS export such a union system fairly<b=
r>
&gt; simply.<br>
<br>
</div>I&#39;m not sure I understand the purpose of this; are you trying to =
write<br>
new data from all of the &#39;remote&#39; locations, and you need those wri=
tes<br>
to &#39;finish&#39; quickly?<br></blockquote><div><br>No, I am writing from=
 a local audio/video server to a local repo, which needs to be very fast in=
 order to service live streaming in parallel with write on a case by case b=
asis. <br>
<br>That local repo would be in a R/W branch above the AFS R/O branch, so:<=
br><br>=A0dirs=3D/Read-Write=3Drw:/afs/path/to/read-only=3Dro aufs /union<b=
r><br>This way I can present the /union to the application server as a read=
/write repo for all its needs, including archival use, but still have AFS u=
nderneath for replication and distribution.<br>
<br>*sigh*<br><br>I wish OSD was primetime :)<br><br></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">
<div class=3D"im"><br>
&gt; But, I feel as if I am missing something... it has become clear that<b=
r>
&gt; releasing is a pretty intensive operation, and if we&#39;re talking ab=
out<br>
&gt; multiple gigabytes per release, I can imagine it being extremely<br>
&gt; difficult. =A0Is there a schema that i can use with OpenAFS that will<=
br>
&gt; help alleviate this problem? Or perhaps another approach I am missing<=
br>
&gt; that may solve it better?<br>
<br>
</div>Eh, some people do that; it just reduces the benefit of the client-si=
de<br>
caching. Every time you release a volume, the server tells clients that<br>
for all data in that volume, the client needs to check with the server<br>
to see if the cached data is different from what&#39;s actually in the<br>
volume. But that may not matter so much, especially for a small number<br>
of large files.<br></blockquote><div><br>Well thats the thing.. this is a l=
arge number of small to medium sized files that are being written continuou=
sly. In addition, there is a quite deep directory structure. I&#39;m trying=
 to get it flattened out to improve scaling, but at the moment it is taking=
 1.5 hours to rsync a subdirectory containing about 5G of data, but=A0 23,6=
81 directories, for example<br>
<br>releasing is a whole &#39;nuther animal... ;-)<br><br></div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">
<br>
To improve things, you can maybe try to reduce the number of volumes<br>
that are changing. That is, if you are adding new data in batches, I<br>
don&#39;t know if it&#39;s feasible for you to add that &#39;batch&#39; of =
data by<br>
creating a new volume instead of writing to existing volumes.<br></blockquo=
te><div><br>That&#39;s feasible..... but what if, for example, vol1 is moun=
ted at <b>/afs/foo/home/bar</b> and contains a thousand directories. The ne=
w content is a thousand more directories, but at the exact same level of th=
e tree. How would I handle that? As far as I can tell, OpenAFS only allows =
a volume being mounted on its very own directory, and you can&#39;t nest th=
em together like that.<br>
<br>How unfeasible would it be to create N volumes, where N &gt;=3D 500 per=
 shot? I would end up with many thousands of tiny volumes.. none of which I=
 have trouble with, but would that be scalable? Let&#39;s assume I have spr=
ead out db and file servers in such a way to equalize load.<br>
=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
And, of course, the release process may not be fast enough to actually<br>
do releases as quickly as you want. There are maybe some ways to ship<br>
around volume dumps yourself to get around that, and some pending<br>
improvements to the volserver that would help, but I would only think<br>
about that after you try the releases yourself.<br></blockquote><div><br>Th=
e idea of doing R/W &quot;checkpoint&quot; volumes that I only have to rele=
ase once in a while after the first release is very appealing... if you can=
 suggest a solution to the problem above.. I am all ears!! :) I would be VE=
RY happy to be able to allocate space, quota and location, on the fly, in b=
atchwise operations.<br>
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
--<br>
Andrew Deason<br>
<a href=3D"mailto:adeason@sinenomine.net">adeason@sinenomine.net</a><br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a><br=
>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</font></span></blockquote></div><br><br clear=3D"all"><br>-- <br><span sty=
le=3D"border-collapse:collapse;color:rgb(102,102,102);font-family:verdana,s=
ans-serif;font-size:x-small">Timothy Balcer</span><br>

--047d7b6774207d4eb704ce0a584e--