[OpenAFS-devel] vldb version 5

Mon, 9 Dec 2019 16:41:18 -0600

I've recently been working on a redesign of the vldb on-disk format
(version 5, "vldb5"), together with some others at SNA. This is still in
the early stages, but I wanted to provide a rough description of what
I'm working with so far, to solicit feedback and give others a chance to
raise objections.

In this email, I'm just trying to stick to describing the more
interesting aspects of the new format; followup email will explain a
little more about the relevant motivations, and possible concerns I
have. But I'm not trying to provide a full spec for the format here;
this is just informally describing the design and various features.

This is also not intended to cover other practical matters, like how the
vlserver will deal with db format upgrades/downgrades. This is just
about the new db format itself.

Feedback is appreciated.

Motivation
--

The immediate motivation for this work is a cell that will probably
exceed the 32-bit file offset limit in ubik in the not-too-distant
future. There are a few other things that also need to be changed to fix
that (ubik protocols, APIs), but the vldb4 disk format itself also uses
32-bit offsets to refer to database entries inside hash chains, etc. The
naive way to fix _that_ means changing all of the relevant fields to
64-bits and doing a full database conversion.

So, if we're doing that anyway, we might as well try to fix some of the
other limitations in vldb4 at the same time by defining a completely new
format. The main other benefits of this are to lift the various
hard-coded limits on various structures (e.g. replication sites), and to
make it easier to introduce new types of data into the database (e.g.
ipv6 addresses, encryption keys).

Code
--

To follow along with concrete code examples, I have a rough Python
script here
<https://github.com/adeason/afs-tools/blob/adeason/vldb5/debug/vldbconv>
for converting between vldb4, vldb5, and yaml. No actual vlserver
implementation or any C yet.

I also have a format description for vldb5 in the Kaitai Struct language
here
<https://github.com/adeason/afs-tools/blob/adeason/vldb5/debug/ubik_vldb5.ksy>.
This can be used with <https://ide.kaitai.io> to get a kind of
interactive explorer for the binary format. (It's pretty low-level, just
to decode the structs, no knowledge of what the data actually
represents.)

Basic structure
--

Data types are encoded in XDR (so, big-endian).

The database consists of a sequence of fixed-size blocks that I call
"record"s (VL5_Record in the above script). Each record starts with a
4-byte tag, indicating what the record is for (a volume, a fileserver,
etc). The tag is then followed by the record's payload which consists of
a sequence of "value" objects. Each "value" consists of a 4-byte tag, a
4-byte length, and 'length' bytes of payload. Because every value has an
explicit length, most values with an unknown tag can be ignored by the
vlserver, while still allowing for the remaining values to be parsed.

Records and values share the same tag namespace, and there are no
record-specific tags or anything like that. Overloading the constants
for tags is not allowed; we have 2^32 of them after all, which is more
than enough.

The size of each record is specified by a value in the root/header
record; the "recsize". The recsize must be a power of two, as it's
currently specified as log2 of the actual recsize. I'm currently
guessing the typical value to be somewhere around between 256 bytes and
4k.

Anything directly pointing to another item in the db does so by
referencing that item's record number/recno as 64-bit int (not the file
offset); recno N is simply at offset N*recsize. So theoretically,
increasing the recsize can be done just by moving some data around,
without needing to actually interpret any of the actual data inside
various structures beyond the header.

A small partial example:

Say we have a record for a volume. The record's tag is VL5_TAG_VOLUME
(0x1D), and contained in that record is value VL5_TAG_VOL_NAME (0x1E)
for the volume's name. So we have the data (in hex):

00 00 00 1D # VL5_TAG_VOLUME
00 00 00 1E # VL5_TAG_VOL_NAME
00 00 00 0C # value payload length (12 bytes)
00 00 00 07 # xdr-encoded string "vol.foo"
76 6F 6C 2E
66 6F 6F 00

And then other values for the volume would immediately follow, until the
end of the record is reached, or we encounter a value tag of 0x00
(VL5_TAG_NULL).

Fundamental info
--

The first record in the db (recno 0) is the root record, VL5_TAG_ROOT_BE
(0x05) in big-endian, so the first 4 bytes in the stream are 0x00 0x00
0x00 0x05.  For big-endian encoding, the first value in that record must
be VL5_TAG_FUNINFO_BE (0x55555555), which contains information that's
fundamental to interpreting the rest of the db: the recsize, and the
encoding style / endianness.

To indicate little-endian encoding, a different tag is used:
VL5_TAG_FUNINFO_LE (0x66666666), and the root record tag is instead
defined as VL5_TAG_ROOT_LE (0x05000000). (The code examples provided
don't deal with little-endian, though.)

If a vlserver fails to understand the fundamental info inside one of
those values, or if the first value isn't one of those _FUNINFO_ values,
the vlserver bails out with an error. Unlike most other values, we can't
continue to parse the database if we fail to understand this first one,
since it's needed to understand anything else.

Continuation records
--

If a record needs more data than fits in a single database record, we
can link together multiple records, using a VL5_TAG_CONTINUE_PTR (0x07)
value. The idea is that when something is parsing a record, if it comes
across such a value, we read the indicated continuation record(s), and
just concatenate the payload to the payload of the current record
(ignoring VL5_TAG_NULL values), and parsing continues.

The continuation record that gets pointed to still has a record header
(tagged with VL5_TAG_CONTINUE_REC, 0x08). There are no "extent"-style
records that e.g. claim the next 3 records; all records still have a tag
as the first 4 bytes.

Example:

recno 0:
00 00 00 05 # VL5_TAG_ROOT_BE
55 55 55 55 # VL5_TAG_FUNINFO_BE
00 00 00 04 # length 4
08 01 00 00 # recsize=2^8=256, big-endian
00 00 00 07 # VL5_TAG_CONTINUE_PTR
00 00 00 0C # length 12
00 00 00 02 # xdr-encoded array [1,2]
00 00 00 01
00 00 00 02
00 00 00 00 # VL5_TAG_NULL
...

recno 1:
00 00 00 08 # VL5_TAG_CONTINUE_REC
00 00 00 09 # VL5_TAG_CONTINUE_HDR
00 00 00 04 # length 4
00 00 00 00 # previous recno 0
XX XX XX XX # continuation data
...
00 00 00 00 # VL5_TAG_NULL

recno 2:
00 00 00 08 # VL5_TAG_CONTINUE_REC
00 00 00 09 # VL5_TAG_CONTINUE_HDR
00 00 00 04 # length 4
00 00 00 01 # previous recno 1
YY YY YY YY # continuation data
...
00 00 00 00 # VL5_TAG_NULL

When parsing recno 0, when the VL5_TAG_CONTINUE_PTR value is
encountered, the specified continuation records are read, and appended
to the payload of recno 0. So then the logic view of recno 0 looks like
this:

00 00 00 05 # VL5_TAG_ROOT_BE
55 55 55 55 # VL5_TAG_FUNINFO_BE
00 00 00 04 # length 4
08 01 00 00 # recsize=2^8=256, big-endian
XX XX XX XX # payload from recno 1
...
YY YY YY YY # payload from recno 2
...
00 00 00 00 # VL5_TAG_NULL

Volume entry lookups
--

To look up volume entries, the proposed vldb5 uses b+ trees to find
entries. There is one tree to lookup by 64-bit volume id, and one tree
to lookup by volume name.

For the id tree, the volume ids themselves are the keys in the trees,
and the leaves point directly to the volume's record.

For the name tree, the keys are the crc32c "hash" for the name. If there
is no collision for the name, the leaf points directly to the volume's
record. If there are collisions, the leaf points to a "collision record"
(VL5_TAG_VOL_NAMEHTREE_COLL_REC, 0x2C), which contains a simple array
containing (volume name, recno) pairs, sorted by name.

Each node in the tree is a db record (e.g. VL5_TAG_VOL_IDTREE_REC,
0x24), and we just store as many keys/children per node as we can fit in
a record. The root node in the name tree also contains a small value
(VL5_TAG_VOL_NAMEHTREE_ROOTINFO, 0x2B) to indicate some metadata
(currently just which algorithm is being used for the tree key hash
values).

Volume entries don't point/link to other volume entries in any way. To
list all volumes in the db, we just look through every record, looking
for volume records.

Fileserver/partition indirection
--

Volume records contain, among other things, the list of sites where that
volume resides. We indicate a volume site's location by "partition id",
an internal id number that gets mapped to vicepX, fileserver Y in a
global table. And of course "fileserver Y" refers to a fileserver id
that maps to a fileserver entry that contains the fileserver's address
list, uuid, and other information.

The mappings of partition/fileserver IDs to the relevant data are stored
in simple arrays referenced by the root record:
VL5_TAG_FILESERVER_LIST_PTR (0x14), VL5_TAG_PARTITION_LIST_PTR (0x17).
This data is intended to be cached by the application in memory (since
it rarely changes), so the layout of these structures is not so
important; no trees or hashes, etc, are needed.

General-purpose notes
--

One feature that is completely new to the db is the ability to add basic
arbitrary user-defined "note"s to objects in the database, such as
volumes or fileservers. A volume record, for instance, can contain a
VL5_TAG_NOTE_PTR (0x1A) value, which points to a VL5_TAG_NOTE_REC (0x1B)
record that just contains a string set by a user.

This is intended to either be an ad-hoc human readable string set by
administrators (e.g. "let user@example.com know before moving this
volume --adeason"), or could contain yaml/json/etc-encoded data for use
by automation tools on top of OpenAFS.

-- 
Andrew Deason
adeason@sinenomine.net