[OpenAFS] OpenAFS on ZFS (Was: Salvaging user volumes)

Fri, 21 Jun 2013 13:07:46 +0100

We have a very positive experience with OpenAFS + ZFS. We run it on x86
hardware (different vendors) and on top of Solaris 11 x86. We've managed to
save lots of money due to ZFS built-in compression while improving on
performance, reliability, manageability, etc.

See my slides at
http://conferences.inf.ed.ac.uk/eakc2012/slides/AFS_on_Solaris_ZFS.pdf
		 or at
http://www.ukoug.org/what-we-offer/library/openafs-on-solaris-11-x86-robert-
milkowski/afs-on-solaris-zfs-losug.pdf

Some quick comments/recommendations when running OpenAFS on ZFS:

	- enable compression

		- LZJB - you should get up-to 3x compression ratio usually
with no performance
		  impact and often performance would actually improve

		- GZIP - much better compression, on modern 2-socekt x86
servers you still can
		  easily saturate more than 2Gbps when writing and much more
than reading

		- set record size to 1MB (Solaris 11 only) or leave it set
to 128KB
		  This usually improves compression ratio and often improves
performance as well
		  Unless you have a workload with lots of small random reads
or writes
		  with a working set much bigger than the amount of RAM
available
			- make sure you have a patch so AFS doesn't sync its
special files
			  after each meta-data update but rather at the end
of operation (like volume restores, etc.),
			  this should be integrated for some time, not sure
in which OpenAFS release though

	- do RAID in ZFS if possible

	  This allows ZFS not only to detect data corruption (yes, it does
happen) but also to fix it
	  and in an entirely transparent way to OpenAFS

	  For really important data and RW volumes, perhaps RAID on a disk
array
	  and then mirror in ZFS across two disk arrays.. 

		- if you want to use one of the RAID-Z in ZFS, and you do
lots of random physical reads to lots of files,
		  then make sure you run on Solaris 11 (pool version 29 or
newer with RAID-Z/mirror hybrid allocator),
		  or consider different RAID or HW RAID5 with ZFS on top of
it

	- disable access time updates

	  OpenAFS doesn't rely on them and you will be saving on some
unnecessary i/o,
	  you can disable it for entire pool and by default all file system
within the pool will inherit the setting:

		  zfs set atime=off pool

	- create multiple vicep partitions in each ZFS pool

	  This is due to poor OpenAFS scalability (single thread per vicep
partition to initial pre-attachement, etc.).
	  Having multiple partitions allows to better saturate available
I/O, this is especially true if your underlying 
	  storage is fast

	- ZFS on disk arrays

	  By default ZFS sends SCSI command to flush a cache when it closes
a transaction
	  (with a special bit set so a disk array should flush a cache only
if it is not protected currently).
	  Unfortunately some disk arrays will flush cache all the time
regardless if the bit is set or not
	  which can affect performance *very* badly. In most cases when
running ZFS on top of a disk array
	  it makes sense to disable sending the SCSI flush command entirely
- in Solaris you can do it
	  either via per LUN or entirely per host. Most disk arrays will
automatically go into pass-thru mode
	  if cache is not protected (battery dead, broken mirroring, etc.).
	  Depending on your workload, disabling cache flushes can
dramatically improve performance.

	  Add to /etc/system (and reboot a server or change via mdb):

		zfs:zfs_nocacheflush = 1

	- Increase DNLC size

	  If you store millions of files in AFS then increase DNLC size on
Solaris
	  By adjusting ncsize tunable in /etc/system (requires reboot)

	- put as much RAM as you can

	  This is an inexpensive way of greatly improving performance, often
due to:

		- caching entire working set in RAM, you will essentially
only see writes to your disks
		  Not only it makes reads much faster but it also make
writes faster as there is no i/o for reading data
		- ZFS always stores uncompressed data in RAM, so serving
most frequently used data doesn't require
		  Any CPU cycles to decompress it
		- ZFS compresses data asynchronously when closing its
transaction (by default every 5s)
		  OpenAFS writes data in async mode to all underlying files
(except for special files), if there is
		  enough RAM to cache all of the writes when restoring a
volume or writing some data to AFS,
		  then from a client perspective there is no performance
penalty at all from using compression,
		  especially if your workload is bursty. In most cases you
are more likely to hit bottleneck on your network 
		  than on CPUs anyway

	- if running OpenAFS 1.6 disable the sync thread (the one which
syncs data every 10s)
	  It is pointless on ZFS (and most other file systems) and all it
usually does
	  is it negatively impacts your performance; ZFS will sync all data
every 5s anyway

	  There is a patch to OpenAFS 1.6.x to make this tunable. Don't
remember in which release it is in.

All the tuning described above comes down to:

	Create more than once vicep partiotion in a ZFS pool
	zfs set atime=off pool
	zfs set recordsize=1MB pool
	Add to /etc/system: zfs:zfs_nocacheflush = 1
					 ncsize = 4000000 (or whatever value
makes sense for your server)

The above suggestions might not be best for you as it all depends on your
specific workload.
The compression ratios of course depend on your data, but I suggest to at
least try it on a sample of data as you might be nicely surprised.

Depending on your workload and configuration you may also benefit from using
SSDs with ZFS.

-- 
Robert Milkowski
http://milek.blogspot.com

> -----Original Message-----
> From: openafs-info-admin@openafs.org [mailto:openafs-info-
> admin@openafs.org] On Behalf Of Douglas E. Engert
> Sent: 17 June 2013 14:56
> To: openafs-info@openafs.org
> Subject: Re: [OpenAFS] OpenAFS on ZFS (Was: Salvaging user volumes)
> 
> In June of 2010, we were running Solaris AFS file servers on Solaris
> with ZFS for partitions on a SATAbeast.
> 
> AFS reported I/O error from read() that were ZFS checksums.
> 
> Turned out the hardware logs on that SATAbeast were reporting problems
> but would continue to serve up the bad data.
> 
> Since ZFS is doing checksums when it writes and then again when it
> reads, ZFS was catching intermittent errors which other systems might
> not catch.
> 
> Here is a nice explanation of how and why ZFS does checksum.
> It also points out other source of corruption that can occur on a SAN.
> 
> http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data
> 
> And this one that sounds a lot like our problem!!
> 
> http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta
> 
> >> And this is one of the reasons why ZFS is so cool :)
> 
> Yes it is cool!
> 
> >>
> > _______________________________________________
> > OpenAFS-info mailing list
> > OpenAFS-info@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-info
> >
> 
> --
> 
>   Douglas E. Engert  <DEEngert@anl.gov>
>   Argonne National Laboratory
>   9700 South Cass Avenue
>   Argonne, Illinois  60439
>   (630) 252-5444
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info