[OpenAFS] AFS design question: implementing AFS over a highly-distributed, low-bandwidth network

Steven Jenkins steven.jenkins@gmail.com
Thu, 15 Jan 2009 17:59:12 -0500

On Thu, Jan 15, 2009 at 2:03 PM, Chaz Chandler <clc31@inbox.com> wrote:
> Hello all!
> I am attempting to implement OpenAFS across a VPN with limited bandwidth between sites but relatively mobile users who expect to have their data available (and writable) at whichever site they are currently located.

This is a very interesting problem.  I'll make some observations, but
as I do not know all of the details or your infrastructure (or
applications or user base), some of these may not be relevant.

> The issue I am running up against is how to organize the AFS volume structure so that things like user home dirs, collaboration/group dirs, and writable system items (like a Windows profile, for instance) are optimally maintained in AFS.
> The set-up is:
> 1) Each site has an OpenAFS server (each with vl, pt, bu, vol, fileservers & salvager).  Currently, v1.4.8 on Linux 2.6.

You don't say so, but I'm assuming a single cell for the entire
infrastructure.  Is that correct?  Also, how many sites do you have,
and how often to you expect to grow/shrink the number of sites?

Furthermore, you don't specify your Kerberos infrastructure -- it
would be helpful to understand where that is placed, if you have
replicas in place, etc.

> 3) All sites are connected in a full mesh VPN (max of about 30KB/s for each link)

If your max is 30KB/s, what is your expected average and minimum, as
well as your expected latency?

Even if you have 30KB between sites, my first suggestion would be to
consider running multiple cells.  Putting the ubik-based servers in
each site (i.e., ptserver, vlserver, buserver) and attempting to run a
single cell across all sites would be very challenging, even ignoring
actual data access issues.  As Anne points out, quorum issues across
slow links can be difficult to deal with.

Also, as you mention, having to pull a 1M file across a 30KB link can
take a minute, which is not good from a user-experience perspective.
 That implies doing so needs to be a rare event, and you need to
architect around that.

I would suggest you dig deeper into the working sets of files that
your users need -- it may be that cache tuning can help.  On the other
hand, I don't hold much hope that cache tuning will be a silver bullet
given your tight bandwidth.

Something that might work out a little better is the work being done
on disconnected operation.  That might suffice for some of your use
cases (assuming the timing of that finishing and the features it will
offer is suitable for your needs).

> I'm seeking recommendations on:
> 1) How others have set up a regular release schedule to keep a large amount of data synced over a slow network (custom scripts, I assume, but is there a repository of these things and what are the general mechanics and best practices here?)

I do not know of any set of documented best practices or scripts,
although you should be aware of

- Morgan Stanley's VMS (Volume  Management System)
- Russ Albery's volume management utilities

Based on your description, you might consider having each site be a
separate cell, and then use incremental dump and restore across cells
for certain cases.  That would remove ubik traffic from the
site-to-site links and free up the links for remote RW access, with
dumps and restores being done during off hours.  More details on the
dump/restore idea below..

> 2) What sort of volume layout would one recommend, and how should frequently-updated data be stored?  Take, for instance, three examples:
> - A software repository: large volume with relatively static contents, occasionally has large additions or subtractions when a new piece of software is added or an old one removed.  Ideally, these updates should be able to be accomplished from any location.  Users don't need to write to it, but may need to read from it frequently at LAN speeds.

This sounds like it should be replicated, read-only data.

> - A collaboration dir: several users read and write a small amount (10s of MBs) on a daily basis from different locations simultaneously, but they expect LAN-type performance.

This sounds like it should be broken up into smaller pieces, so that
you can more easily determine where volumes go.  The data itself would
be RW, although you might architect something that will age data into
RO data (e.g., if people need to read reports generated a month ago,
but they don't need to change them, or they would be willing to do a
slightly different process for editing older data).

If your users were tech-savvy (e.g., developers), I'd also think
seriously about using a Version Control System instead of a networked
filesystem for this part of the problem.

Users in different sites wanting to read and write 10's of MBs of data
over 30K links simply may not be realistic given current architectures
and implementations.

More study of this case should be done: it's the real hard one.

> - A user dir: large amounts of data updated from a single location, but user may move to any other site at any time, potentially with up to a day of transit time in which a volume could be moved to the destination site.

I would consider building a system that would let me have an offline
copy of the user volumes in each location, and synchronize  them on
some regular basis, depending on usage patterns.  You could then also
provide a utility like 'move to site X' that the users could run which
would find the current location of that home directory, take it
offline, do an incremental dump & restore, then bring the new volume

An alternative to that would be disconnected operations: since I'm
guessing that your users will need their own data frequently, but
seldom will they need each others, it might work out that your users
can put their home volumes into the cache on their local system (this
would work best if the users had laptops that they carry from site to
site, but would  not work so well if there are fixed systems at each
site that they use), and then you could engineer something so that
when they re-connect to the network, automatically sync the volume
from their local system to the local site, updating the various
databases behind the scenes.

That assumes development work, however.  And I don't know if that
would meet your requirements.

Steven Jenkins
End Point Corporation