[OpenAFS] Some backup advice needed.
Fri, 21 Apr 2006 05:52:19 +0200
Steve Devine said the following on 2006-04-20 17:52:
> We use the native afs backup. We currently do a full backup of the whole
> cell weekly followed by daily incrementals until we cycle back to Monday
> midnight and start the next full.
> As the size of the cell (user vols ) grows our backup window is getting
> way too big. Often it stretches into late Wed before it completes.
> I am wondering how many are doing weekly fulls? I am considering a
> monthly full with incrementals the rest of the month but thats kinda
> scary if maybe your full is not good.
i hope this helps you. normally this would be 1000euro + a few days of
onsite consultancy in a fancy suit :-)
if you are able to meet your organisation's SLOs for backup success, data
retention, and random restore testing does work (you do random restores
don't you??) then you don't really have a problem. it just feels uncomfortable.
out of curiousity, how many hosts, data volume do you have? what backup h/w
do you use?
you have 3 variables you can control that are of interest:
#1 time window
#1 time window is easiest to cover, either backup more often, or less often.
less: move to a monthly full for your user volumes, all on 1 day -> 1/4
load, but still the same duration.
more: spread the full user volume backups out across the week -> 1/7 load,
each day. has the side-effect of reducing the per-day volume, and thus your
#2 backup less data. this is the cheapest solution. list the largest
volumes, or the ones with the oldest files (maybe a bit more effort in AFS
but you can build/buy tools to help with this), or with the smallest volume
of INCR backups/volume (i.e. least changes), etc etc etc, then agressively
target those files/people for migration to offline or near-line storage.
it helps to send a few pointed emails, remind people that storage doesn't
come cheap, & ideally provide the organisation mgrs with a "cost" for
keeping this stale data online. either move it to DVD/CD/tape or give them a
"holding" area that gets backed up only monthly (or whatever makes sense)
instead of weekly. e.g. change the volume names to "archive.whatever". for a
uni, a hall of shame by department, conspicuously posted by the chancellor's
office, is a start.
you have a few options here depending on cash. the goal of backup is to put
the bottleneck at the tape drive -- and still be capable of meeting your SLO
for restore. what's your bottleneck? where is it? good drives today can
handle 300GB/hour native, and I've had them up to 500 with a strong tail
wind. but if your data is flowing from a workstation, over 10Mb ethernet,
the drive will never achieve its stated throughput - not to mention restore
problems. you do do random restore testing, don't you?
how fast can you dump data to /dev/null? this is your upper limit. increase
this by trial&error of multiple streams (volumes) at the same time, or
spread the load out over multiple hosts, or get a bigger box, or a bigger
there are a few free tools to test this at HP's website
+ some pretty sound advice.
use a separate backup LAN
spread backup servers across subnets to reduce cross-router traffic
use multiple backup servers in each subnet
use trunking (FEC, AGP etc) to backup servers
connect them to CORE not EDGE switches
ethernet expected throughput 60% rated wire speed
LAN backups use less CPU than SCSI/FC
use 1 GB NIC / subnet
100Mb/s (single NIC) = 25 GB/hour
200Mb/s (dual NIC with FEC/AGP) = 40 GB/hour
1 Gb/s (single NIC) = 40 GB/hour
2 Gb/s (dual NIC with FEC/AGP) = 50 GB/hour
10 Gb/s (single NIC) = 200 GB/hour
note that 3 NICs trunked on many (solaris, hpux, windows) platforms seems to
give less throughput than 2 of the same type. go figure.
100FDX = 100 Mbits/sec maximum
= 100/8 Mbytes/sec
= 100/(8*1024) Gbytes/s
= (100 * 3600) / (8 * 1024) Gbytes/hour
= 0.6 * (100 * 3600) / (8 * 1024) Gbytes/hour at expected ethernet capacity
DRIVE SPEEDS FOR FILESERVERS W/OUT BLOCK SIZE CHANGES
drive SCSI SAN
lto1 80 70
lto2 120 100
sdlt2 100 80
sdlt1 70 55
dlt80 25 20
NB the attached primary storage makes a big difference -- above speeds were
measured in production, using U3 SCSI RAID arrays; commercial arrays
(EMC,HDS,XP) can provide much higher throughput (on quality server kit), and
also support multiple concurrent streams. block size can make a big
difference depending on the application.
my experience is that SAN tends to be a bit slower, depending on the type of
data being backed up. fiddling with block sizes can boost this up to around
5% of native SCSI speed.
NB the drives, HBA themselves & PCI buses all have limitations
max 2 drives / FC HBA
1 drive / SCSI U3 HBA
4 drives / PCI bus
1 LTO1 drive uses 400MHz CPU & 64MB RAM
1 LTO2 drive uses 800MHz CPU & 64MB RAM
1 LTO3 drive uses 1000MHz CPU & 64MB RAM
commercial backup software kicks arse. don't expect old tools like tar,
cpio, etc, ever to achieve the same throughput. if you can, demo a few. the
big ones are in no order, legato, symantec netbackup, hp dataprotector, ibm
tivoli, with wide platform support.
consider getting some secondary disks, spool your backups over backup LAN
onto that, & then backup to tape directly 1x per week to a small autoloader
with a very fast drive. you can put this mezzanine host into a separate
site/building, and still meet DR requirements. if you already have a SAN +
tape drives, and have more cash, look into the new virtual tape libraries
offered by the usual suspects.
i've not used amanda or bacula, but either of these opensource products
might be enough for you.
find the bottleneck. eliminate it. repeat. start off at the host end. what's
your max. throughput to /dev/null? use ftp between the 2 hosts - how fast
does it go? how is your LAN structured if you're doing LAN backups? test
your tape drive using something other than tar.
& you do random restores don't you? :-)
out of the frying pan and into the fire