[OpenAFS-devel] vldb version 5

Andrew Deason adeason@sinenomine.net
Mon, 9 Dec 2019 16:47:16 -0600


On Mon, 9 Dec 2019 16:41:18 -0600
Andrew Deason <adeason@sinenomine.net> wrote:

> Motivation
> --
> 
> The immediate motivation for this work is a cell that will probably
> exceed the 32-bit file offset limit in ubik in the not-too-distant
> future. There are a few other things that also need to be changed to
> fix that (ubik protocols, APIs), but the vldb4 disk format itself also
> uses 32-bit offsets to refer to database entries inside hash chains,
> etc. The naive way to fix _that_ means changing all of the relevant
> fields to 64-bits and doing a full database conversion.
> 
> So, if we're doing that anyway, we might as well try to fix some of the
> other limitations in vldb4 at the same time by defining a completely new
> format. The main other benefits of this are to lift the various
> hard-coded limits on various structures (e.g. replication sites), and to
> make it easier to introduce new types of data into the database (e.g.
> ipv6 addresses, encryption keys).

So this raises the question of why are we defining another new format at
all? Ideally, we'd use some existing db like LMDB or SQLite. The problem
with these is that we cannot easily integrate them with ubik. Most
modern local disk db formats really want to interact directly with the
underlying backing files, in order to improve performance with various
caching/mmap()ing, and guaranteeing consistency with fnctl locks and
such.

I did try to look for something that tried to somehow separate the
database logic from the underlying storage, but I couldn't really find
anything. This is maybe possible with some simple/old dbm-style
databases, but those seem so simplistic to the point of not being very
helpful.

The other option is to avoid ubik entirely, and use some other embedded
distributed database (etcd), or rely on a site's existing sql/nosql
infrastructure. That's maybe possible, but that seems like a much larger
project that seems out of scope for this, and that potentially brings in
a whole new set of problems.

The way I tend to see that direction evolving is to allow different
configurable backends or something, so an administrator can choose
between creating a ubik-backed vldb, or a sql-backd vldb, etc. Doing
that still requires a local disk format as an option (especially for
sites that only need one dbserver), so this "vldb5" stuff still makes
sense even if that's done.

[...]
> Basic structure
> --
> 
> Data types are encoded in XDR (so, big-endian).

I'm not sure if plain XDR is a good idea. I'm trying to use XDR just so
we can use our existing XDR code for serialization, and maybe it's kind
of nice that we can describe the db structures using RPC-L.

But XDR is space-inefficient, especially for smaller bits of data, since
everything is encoded with 4-byte atoms at a minimum. All of our lengths
and such are 32-bit ints, even though pretty much everything fits in a
"record", and so almost everything will be at most 256 bytes (or
certainly under 64k bytes). On the other hand, space-efficiency doesn't
seem like a big problem or priority; I think flexibility with the db
types is more important.

Also, space-efficiency here doesn't directly mean a larger database.
Because of the structure with the db records, the db size depends more
on the chosen recsize, and how space-efficient the structures are
dictates how cramped that recordsize is, and how often we need to chain
together continuation records.

Defining a new format that is big-endian also maybe sounds crazy to some
people, and sounds outdated. I don't think it matters too much, but it
would be pretty easy to define a little-endian variant to XDR in our
tree, and still use most of the existing XDR code (iirc this can be done
just by defining a new xdr backend called e.g. xdrmemle).

It maybe feels more aesthetically pleasing to use different approaches
like using variable-length length encodings or using something like
protobuf itself... but just using XDR everywhere sounds simpler to me.

This also assumes using XDR does actually help keep things simpler.
Dealing with so many XDR types tends to mean a lot of boilerplate, which
is cumbersome, but I assume is still better than working with flat
structs.

[...]
> The size of each record is specified by a value in the root/header
> record; the "recsize". The recsize must be a power of two, as it's
> currently specified as log2 of the actual recsize. I'm currently
> guessing the typical value to be somewhere around between 256 bytes and
> 4k.

The recsize chosen, of course, has a big impact on how big the db is.
The size of the db can be important, since a bigger db makes ubik resync
events take longer. Using something like 1k or 4k feels like it's
future-proofing for more data to be added to records, but even just 256
bytes doesn't feel terribly small for existing volume records. That's
larger than the current pseudo-record-size for vldb4's volumes with flat
structs (148 bytes iirc); flat structs are usually more space-efficient,
but if our sites etc are variable-length, vldb5 entries can actually be
smaller for common volumes that don't have many sites.

[...]
> Fundamental info
> --
> 
> The first record in the db (recno 0) is the root record, VL5_TAG_ROOT_BE
> (0x05) in big-endian, so the first 4 bytes in the stream are 0x00 0x00
> 0x00 0x05.  For big-endian encoding, the first value in that record must
> be VL5_TAG_FUNINFO_BE (0x55555555), which contains information that's
> fundamental to interpreting the rest of the db: the recsize, and the
> encoding style / endianness.

The endianness field is sorta redundant, since you need to understand
the endianness to interpret the value anyway. But it's maybe a nice extra
sanity check that you're parsing things correctly; and the space is
"free" since we do at least need the recsize, and the recsize only takes
1 byte.

I don't know if all this fiddling about with the endianness is really
helpful; if we don't think we'll ever really care about changing
endianness, maybe this is pointless effort. But it's not terribly
complicated or hard to do.

[...]
> Volume entry lookups
> --
> 
> To look up volume entries, the proposed vldb5 uses b+ trees to find
> entries. There is one tree to lookup by 64-bit volume id, and one tree
> to lookup by volume name.

Why b+ trees, and not a simple hash table? vldb4 uses simpler plain hash
tables, but they have at least a couple of obvious problems:

- The hash table size is fixed, at 8191 buckets, so the hash chains can
  grow very long for large databases.

- The hash chains are chained through the volume entries themselves.
  This means that if you need to traverse 10 entries on a chain, that
  means reading 10 different volume entries (which are almost always not
  right next to each other).

Fixing the issues with chaining volume entries together is easy: we just
store the chain information somewhere else in a dedicated structure. And
fixing the hash table size is easy: we can just allow for a configurable
fixed size, or use an extendible hashing scheme to allow the table to be
resized without needing to rehash everything.

However, storing the hash table itself is trickier with the block-based
records. As soon as the number of hash buckets exceeds the size of a
record, we need to read multiple records just to index into the hash
table. Simple approaches tend to involve O(n) reads, which is back to
the issues with the vldb4 chains.

I'm fairly certain there are ways to make this work better, but doing
that sounds like it requires some clever tricks or other slightly
complex approaches. The b+ tree approach is more of a standard
implementation of the relevant algorithms, and doesn't need anything
special.

A hash table approach sounds easier to me if we could use multiple file
streams in ubik; then the hash table would be a separate file that could
be easily indexed into. The concept of multiple file streams does exist
in ubik, but we've never used anything besides file 0 (and file -1 for
the log), and various pieces of code assume file 0 is the only one. I
assume allowing multiple file streams again would involve a lot of work
in ironing out annoying quirks and bugs, so it doesn't sound appealing.

Comparing to other software: most local databases use a normal hashing
scheme for data like this, but the index/table is stored separately. B+
trees are of course popular, but are used primarily for their
sortability, which we don't care about.

If we assume a single file, then local filesystems are maybe a more
relevant comparison. In that space, b+ trees I think are more common for
modern filesystems (XFS, btrfs, JFS, HFS+ ext[234] with -O dir_index).
ZFS is the only example I am aware of that using an extendible hashing
scheme.


Another alternative is to use plain b-trees (where the non-leaf nodes
also contain values for exact matches). This has the potential advantage
where some entries can be looked up with fewer disk reads, since we
don't need to descend all the way to the leaves. Theoretically this
means we could arrange the tree such that frequently-accessed volumes
get stored in nodes closer to the tree root.

However, this requires tracking access counts for such volumes, and
causing additional writes to the db to rearrange the tree. That doesn't
sound great (at least done automatically), especially since different
vlservers may have different sets of volumes that are accessed
frequently. Deleting nodes in b-trees is also a little more complicated,
just in general.

So this doesn't seem worth the added complexity of adding any such
benefits; it seems easier to improve performance in this area via
application-level caching instead.

[...]
> For the name tree, the keys are the crc32c "hash" for the name. If there

Instead of hashing the name strings, it is possible to use the name
strings themselves as keys in the tree. But using a hash as the key has
a couple of advantages:

- The hash value is smaller than our strings, so we can store more
  entries per b+ tree node and get better fanout.

- The structure layout is a bit simpler, since the hash value is just an
  integer. (But for XDR this maybe doesn't matter so much.)

As for the specific hash algorithm, crc32c ("crc32-castagnoli") is a
popular choice in other projects, for a few reasons:

- Because we are using the full hash value (not truncating it), we don't
  care about the distribution of the function and avalanche effects,
  etc. We only need different strings to have different hash values,
  even if they only differ by a single bit. crc32c is actually better at
  collision avoidance than many other algorithms for common strings,
  even though it may have much worse distribution.

- There are no security ramifications of collisions.

- CRC hardware acceleration exists. This is more likely the "real"
  reason it's used for other projects; admittedly it doesn't matter so
  much for us. But more generally:

- It's boring: a common, well-known and well-understood algorithm.

We could use the opr jhash (the Jenkins lookup3 function), but I'm not
sure if this is really what it's for. jhash lives more in the family of
hash functions that seem more about cpu speed for in-memory hash tables
and such. For example, a newer jenkins hash, SpookyHash, apparently
produces different results for different endian CPUs; other algorithms
like CityHash are clearly targeted to x86 (and are much more complex!).

Comparing to other software: btrfs, iSCSI, and SCTP use crc32c. XFS and
vldb4 use some internal custom hash, and ext[234] with dir_index can use
TEA, md4, or some custom thing.

HFS+ and JFS use the actual key strings as keys, for directory entry
filename trees. JFS uses some kind of compression technique to reduce
the size of the keys that actually need to be stored; I didn't see
evidence of HFS+ doing this, but I didn't look very hard.

Another option would be a secure hash, like a SHA variant. As far as I
can tell, the only downside to this is cpu speed, which doesn't matter
so much for us. But crc32c seems good enough, and is somewhat standard,
so I don't see a reason to use anything else.

Another option is a crc64 variant (or anything with 64-bit results);
that would reduce collisions at the cost of increased key size. I assume
that's not really worth it now (even with ~4billion volumes, crc32c
should have relatively few full collisions), but changing the hash does
require redoing the whole tree, so maybe it's better to do this up-front
to avoid a later migration?

> is no collision for the name, the leaf points directly to the volume's
> record. If there are collisions, the leaf points to a "collision record"
> (VL5_TAG_VOL_NAMEHTREE_COLL_REC, 0x2C), which contains a simple array
> containing (volume name, recno) pairs, sorted by name.

Specifically, if there are collisions, the leaf points to a "collision
record" (VL5_TAG_VOL_NAMEHTREE_COLL_REC, 0x2C), which contains a list of
volume entries with their names, sorted by name.

If the number of collisions becomes large, this could become a problem,
because then we're just traversing a long hash chain in a flat list
(which can involve O(n) disk reads if we exceed the size of a record).
I'm assuming that's not a big concern, since that only happens when the
hash value fully collides. But another possibility here is to store the
collisions in another b+ tree, using the name strings themselves as
keys instead.

And of course, the other solution here is to not care so much about the
performance of such a lookup operation, since we can solve that by
caching the results in memory at the application level.

> Fileserver/partition indirection
> --
> 
> Volume records contain, among other things, the list of sites where that
> volume resides. We indicate a volume site's location by "partition id",
> an internal id number that gets mapped to vicepX, fileserver Y in a
> global table. And of course "fileserver Y" refers to a fileserver id
> that maps to a fileserver entry that contains the fileserver's address
> list, uuid, and other information.

This is just an extra layer of indirection, which makes it easier to
"move" a partition from one fileserver to another. The VL RPCs don't
really accommodate doing that, but we can at least make it so doing that
doesn't require O(n) database updates.


Also, some general notes on tradeoffs and performance:

The main priority with this design is for future flexibility, which
sometimes comes at the cost of i/o performance, CPU perf, and db size.

Currently, i/o performance should be better than vldb4 (by which I mean,
how many ubik reads we need to do to satisfy a request), but I don't
think it's worth spending too much effort on this. Any problems I'm
aware of with reading too much data from ubik should be solvable by
in-memory caches at the application level; so I'm just trying to keep
things easily cacheable.

Earlier on, I considered keeping some cache generation counters inside
the db to make it easier to detect when a record changes, to allow for
easier caching. However, after discussing this with others, I feel like
it's simpler to just do this at the level of the application and ubik
itself. For example: if we want to cache the mapping of fileservers to
addresses, how does the application know when that data changes, to
invalidate the cache? What we can do is allow the applicaiton to see
when a ubik address is written to, and the relevant cache is invalidated
when the fileserver's record containing that information sees a write.

As for CPU performance, that's primarily XDR serialization, endian
fixing, etc. Since this is done outside of ubik and so is completely
paralellizeable, I don't think this is a big concern.

Db size could be an issue; but as noted above, it's somewhat tunable via
the recsize. I feel like issues in this area can be mitigated by fixing
other layers (e.g. just making ubik resyncs less noticeable), so maybe
it doesn't really matter if the db gets much larger.

-- 
Andrew Deason
adeason@sinenomine.net