0040: olap survey, lobster, feldera, innovation, wizard papers, umbra papers, olap papers

Published 2023-09-29

I published a shallow survey of OLAP and HTAP query engines.

The last 2/3rds or so of this post contains all the supporting notes. Also a lot of papers on strategies for low-latency compilation.

lobster

A surprisingly neat little language, exploring a lot of ideas that I've been pondering.

Performance is apparently somewhat better than lua atm but the language design seems like it should allow ocaml-ish performance.

Heavy use of non-escaping closures, with a cute pythonic syntax:

def sierpinski(depth) -> void:
    if depth:
        gl_scale 0.5:
            for(directions) d:
                gl_translate d:
                    sierpinski(depth - 1)
    else:
        gl_polygon(directions)

Those closures can return to functions other than themselves. Return defaults to the nearest named block but can also specify a function name to return to. It's not clear how that actually works. But in the implemention of exceptions it kind of looks like it can scan the callstack at runtime to find the nearest try function to return to. I guess since closures can't escape the stack, and their usage is monomorphized, you can always figure out at compile time if the return is valid?

Also monomorphizes on both type and ownership (to improve refcount elision), which doesn't seem to hurt the compilation time much. Typechecks after monomorphization, like zig.

Memory model seems similar to julia - structs declared as immutable can be allocated inline / on the stack.

Includes a graphical debugger, which is wild.

Also worth watching this talk about their use of lobster in their game engine.

feldera

I somehow failed to notice that DBSP became a startup earlier this year.

I poked around a bit inside the code but there is a lot of it.

> scc
-------------------------------------------------------------------------------
Language                 Files     Lines   Blanks  Comments     Code Complexity
-------------------------------------------------------------------------------
Java                       409     49833     4616     11228    33989       2914
Rust                       373    154141    16441     20910   116790       6914
TypeScript                 251     17924     1448      1910    14566       1629
Python                     113     11168     1558      1525     8085        340
Markdown                    55      6145     1352         0     4793          0
LaTeX                       29     11674     1442       360     9872          0
Shell                       21      1495      125       173     1197        102
Plain Text                  20     10176        0         0    10176          0
YAML                        19      1144      139       293      712          0
TOML                        18       743       87        12      644          8
gitignore                   15       155       28        29       98          0
CSV                         11     27254        0         0    27254          0
SQL                         11       498       32        52      414         15
JSON                        10      1430        6         0     1424          0
SVG                          9         9        0         0        9          0
Dockerfile                   4       263       42        48      173         29
JavaScript                   4       488        5        45      438         18
TypeScript Typings           4        37        6         9       22          1
BASH                         3       151       19        33       99         17
License                      3        78       16         0       62          0
Makefile                     3        85       14        22       49          0
Patch                        3       105       11         0       94          0
CSS                          2        38        4         0       34          0
Freemarker Template          2       319       20        16      283          0
HTML                         2        23        0         0       23          0
Docker ignore                1        10        0         1        9          0
Jupyter                      1       454        0         0      454          0
Visual Basic for Ap.         1      2922        0         0     2922          1
XML                          1       238        5        16      217          0
-------------------------------------------------------------------------------
Total                     1398    299000    27416     36682   234902      11988

The java seems to be from calcite. I can see why they wouldn't want to boil that ocean (although in this talk they mention difficulties getting calcite to be compatible with other sql dialects, or to share semantics with feldera for constant evaluation). Having a whole separate java runtime hanging around is a bit clunky but seems fine if the intent is to be cloud-native from the start.

It looks like they're also jitting the query plans - https://github.com/feldera/feldera/blob/main/crates/dataflow-jit/. That is an ocean-boiling project but I can see the motivation - materialize pays a lot of interpreter overhead. But I suspect that query planning should be the first hill to die on - compilation won't save you from bad join ordering decisions, and we don't have a good theory for query planning for incremental maintenance yet.

No mention of storage, which I think was the most common feature demanded for materialize.

innovation

https://newsletter.mollywhite.net/p/the-venture-capitalists-dilemma

For all the talk of unbridled innovation, venture capital services only very specific types of innovation: those that stand to produce large exits for investors, and with relatively low risk, regardless of whether the business itself holds much promise or provides any societal benefit.

https://slate.com/technology/2023/03/silicon-valley-bank-rescue-venture-capital-calacanis-sacks-ackman-tantrum.html

For the past 10 years venture capitalists have had near-perfect laboratory conditions to create a lot of money and make the world a much better place. And yet, some of their proudest accomplishments that have attracted some of the most eye-watering sums have been: 1) chasing the dream of zeroing out labor costs while monopolizing a sector to charge the highest price possible (A.I. and the gig economy); 2) creating infrastructure for speculating on digital assets that will be used to commodify more and more of our daily lives (cryptocurrency and the metaverse); and 3) militarizing public space, or helping bolster police and military operations.

A fast in-place interpreter for WebAssembly

Wasm wasn't designed with interpreting in mind. Other wasm interpreters, including wasm3, first translate wasm to some internal bytecode that is easier to intepret. For large wasm modules this can hurt startup time and use a lot of extra memory. The interpreter in the paper instead augments wasm with some small metadata to speed up the interpreter.

Just eyeballing the graphs, the result is 2-3x slower than wasm3 but ~10x faster to start up and has ~5x less space overhead.

Branching instructions in wasm encode their destination in terms of relative nesting depth. They don't include information about where the next instruction is or how many values to pop from the stack. But that information is computed during validation, so the interpreter just needs to save it somewhere.

The lookup for that metadata needs to be fast and not use too much space, so neither a hastable or array (with the instruction pointer being the key) are suitable. The solution is neat: For branches that aren't taken, the next branch's metadata is always in the next slot. For branches that are taken, the metadata includes the relative offset for the next metadata.

Wasm is stack-oriented. For compilers this is usually just an encoding detail - the stack effects are all statically known so they're easy to map into dataflow. But an interpreter has to actually model the stack directly. Values are all stored unboxed in the stack. For the sake of the garbage collector a 1 byte type tag is added to distinguish gced pointers (precomputing stack maps would be too slow). Eliminating tags (eg if the gc extension is disabled) provided an 8% speedup. Stack overflow is detected by a guard page.

The interpreter is written in assembly and reserves a dedicated register for important variables.

Instructions are dispatched via 256-entry dispatch tables. Instructions are LEB-encoded, so any opcode that starts with a 1 dispatches to a special handler that decodes the LEB and then jumps back into a dispatch table, providing a 5% speedup over doing the decoding inline.

They implemented both direct and threaded dispatch. The latter provided a 14% speedup.

For profiling and setting breakpoints, a wasm opcode can be overwritten with a special instruction that causes the interpreter to execute a handler loaded from the probe table before continuing with the original bytecode. For debugging, a global flag can be set which causes the probe table to be used for every instruction.

Whose baseline (compiler) is it anyway?

An overview of baseline compilers for wasm, followed by a comparison to the authors new baseline compiler in Wizard.

(Reports that, anecdotally, the simplest baseline compilers often provide 3-10x runtime perf over an interpreter.)

Wasm is designed for single-pass compilation. Can maintain an abstract representation of the value stack, tracking register allocations and doing basic constant propagation (not just for constant-folding, but for better instruction selection eg using immediate operands).

The abstract representation for a value on the abstract stack is subtle. My first though would have been to write:

union(enum) { 
  constant: Value, 
  register: Register, 
  stack,
}

But consider a sequence like (local.get 0) (local.set 1) (local.get 1). If local 0 is loaded into eax and then written back in to the stack location for local 1, then the next (local.get 1) can just read from eax instead of issuing a store. So we actually want:

union(enum) { 
  constant: Value, 
  register: Register,
  // Has been saved to the stack but may also still be hanging around in a register
  stack: ?Register, 
}

This seems like a minor point, but it provided 25% speedup on some benchmarks!

Loops are tricky. When we branch back to the start of the loop we're going to need to restore the current abstract environment. There is some tradeoff between how much we get to assume at the start of the loop vs how much work we have to do to restore those instructions. And because we're going single-pass, we don't even know which locals the loop body might use or change. To make life harder, wasm compilers also have to deal with adversarial input so it's important that all the merging and shuffling calculations don't have any expensive edge cases.

Supporting debugging is tricky. Most of the compilers studied simply don't. Wizard uses the same stack representation for both the interpreter and the compiler, so that when switching to the interpreter for debugging it's easy to display the values of variables.

Wizard uses value tagging for gc. Other compilers either don't support the wasm gc extension, or reuse the stack map support from a js engine. To reduce the overhead of writing tags, wizard only stores them at yield points. The overhead vs no tags at all varied between benchmarks, 0.9-4.9%. Whereas always storing tags could be up to 3x slower - this seems surprising, but I suppose it means a lot more stores compared to the values themselves which often stay in registers.

Harmonizing Classes, Functions, Tuples, and Type Parameters in Virgil III

The language that Wizard is written in.

Brief and tantalizing mention of staged compilation.

Primitives, arrays, tuples, functions, classes. Single inheritance. No interfaces.

Type cast operator for conversions and downcasts.

Tuples are immutable and so are covariant.

Classes and functions can take type parameters eg List<T>. Type parameters are stored for use in casts, rather than erased.

Object methods are bound to the object. Interfaces can be emulated by returning an object of bound methods. Since the interface object can be parameterized by types, this also emulates abstract datatypes.

Closed adhoc polymorphism can be emulated with type tests/casts. Open adhoc polymorphism can be emulated by a dynamic list of dispatch functions.

Sum types can be emulated by subtyping and type tests/casts (at the cost of not checking exhaustiveness).

Classes are invariant in type parameters (because the representation varies) but function variance can be used as a workaround.

Virgil doesn't distinguish between the types of f(a: int, b: int) and g(a: (int, int)) which can lead to tricky calling conventions. Workaround is to make the calling convention unpacked scalars for all functions. A generalization of SROA. This requires monomorphization.

In their experience, monomorphization has not lead to too much code bloat. (But only ~100kloc of Virgil have ever been written).

Runtime and gc are also written in Virgil. (Presumably this relies on the absence of implicit heap allocation).

Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

The authors implemented a query compiler and vectorized interpreter in the same (toy) database system for an apples-to-apples comparison on (a subset of) TPC-H and SSB.

They verified that each implementation, despite being a toy, performs similarly to production-grade equivalents with the same strategy.

Overall performance is not that different. I wanted to compare codesize but it looks like they just manually typed out c++ that a compiler would have emitted, rather than actually writing the compiler. Also they only did this for 5 queries. (This makes sense as a cheap way to decide which strategy to use for their database, but wasn't super obvious from the paper itself).

The vectorized version:

In microbenchmarks, using simd for selection (picking the rows for which there is a corresponding 1 in the selection bitvector) lead to speedups of 8.4x. But in TPC-H Q6 they observe only 1.4x speedup. With multiple filters, all filters after the first are operating on very sparse vectors. When the input vector is only 50% full, both simd and scalar versions are dominated by cache misses and perform the same.

For computing hashes, simd delivers 2.3x speedup. But for looking up hashes in the hashtable cache misses dominate again. By the time input data reaches 10s of mb, the scalar and simd implementations perform the same.

These simd experiments were done by hand. When they relied on compiler auto-vectorization instead they saw no speedups.

They note that simd may be more useful for lightweighht compression.

Both systems scaled equally well to 20 threads.

The above numbers are all in-memory. Reading data from an ssd instead reduced the performance difference between the two.

Aside from the experiments above, they note that:

The approaches can be mixed. HyPer compiles queries but uses pre-compiled vectorized functions for scanning compressed columns. Peloton introduces materialization boundaries in compiled code to allow use of prefetching.

They don't give much attention to the problem of compilation latency, but in the next paper they ended up building their own low-latency compiler.

Tidy Tuples and Flying Start: fast compilation and fast execution of relational queries in Umbra

Their previous database HyPer, now used in Tableau, sometimes hits multi-second compile times for queries when using the LLVM backend.

Their new compiler has lower compile times than their previous bytecode interpreter but has runtime performance almost as good as their LLVM-based compiler. On TCP-H it outperforms postgres/duckdb/monetdb even on small datasets (albeit using 20 threads), and outperforms their previous compiler except on the largest datasets.

On TPC-H vs LLVM O3 they trade 108x faster compilation for 1.2x slower runtime. Compile times get worse with the size of the query - on a 2000 table join theyt get ~0.03s vs ~150s for LLVM.

'Tidy tuples' takes query plans to IR. 'Flying start' takes IR to native code.

Tidy tuples is written in abstract interpreter style. While it's built up in many layers, each layer is push-based - there is no materialization of intermediate IR. The api reflects ir/sql types into the cpp type system.

The IR is SSA. It's generation is mostly append-only per-block. The representation uses only three contiguous allocations, listing functions, basic blocks and instructions respectively. Instruction format is: opcode, result-type, operands (32 bit offsets).

The IR building does constant-folding and -deduplication at build time. A later pass does dead code elimination.

Checked arithmetic, array/struct access, null checks and a few other db-specific operations get dedicated IR instructions.

Flying start takes one pass to analyze value lifetimes (using the algorithm described here) and one pass to generate code.

Register allocation prefers to give registers to values in the mostly deeply nested loops. Replacing this heuristic with Linear Scan hurt compile times more than it helped runtimes.

The lifetime analysis supporting the register allocation heuristic costs 45% of their runtime. (I wonder how this owuld compare to the greedy strategy used in wizard above, which is always willing to spill old values for new.)

The getelementptr instruction is folded into load/store instructions to produce a single mov. Comparison and branch instructions are fused where possible eg if (x < y) ... can become a cmp followed by jl.

They use asmjit to emit native code.

LOC:

I'm not surprised that the sql translation layer is where most of the effort went.

Bringing Compiling Databases to RISC Architectures

Adding an arm backend to the above compiler.

The IR doesn't support aggregate types. Instructions which need to return multiple values are paired with 'ghost instructions' to retrieve values other than the first. This produces less instructions overall and makes isel easier later.

Notes the the linear scan lifetime analysis work because they avoid ever producing irreducible loops.

Register allocation is easier on arm: more general purpose registers, more 3-operand instructions.

Had to tweak IR to deal with weaker memory ordering and aligned accesses.

Isel has to fuse IR instructions to take advantage of arm instructions like multiply-add.

The arm backend adds another 8 kloc. Compare to their LLVM backend at 2.3 kloc or their IR interpreter at 2.6 kloc. Not a huge amount of extra effort for a much better point in the compile-time runtime tradeoff space.

Interesting that their interpreter isn't terrible either though - seems well within 5x the runtime of the llvm backend.

They post some impressive benchmarks vs duckdb, but don't mention how many threads they're using to outperform single-threaded duckdb.

On Another Level: How to Debug Compiling Query Engines

A simple and clever way to debug a compiler. When compiling the code they take a recording using rr. When debugging the generated code in gdb, every time the ip changes they ask rr to jump to the point in the compiler recording where that ip was emitted.

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

The in-memory format for HyPer. Trying to push both OLTP and OLAP performance.

Data is divided into blocks. Hot blocks are mutable. Cold blocks are compressed and immutable (except for setting 'deleted' flags).

Value layout is columnar-per-block.

The compression method chosen can vary per block, between:

Basic range scans can be evaluated without decompression. Values remain byte-addressable (no bit-weaving) so external indexes can point into the compressed block. This lightweight compression uses ~25% more space than the aggresive compression in vectorwise.

Each block records min/max values for each attribute and a 'Positional Small Materialized Aggregate' - a lookup table mapping the most-significant non-zero byte of a value to the range in the block where it might be found. So for 8/16/32 bit values the PSMA takes up 2/4/8 kb.

Column scans use the psma to reduce the range they have to look at. But in the worst case it's still a vectorized scan of the whole block - not like tree indexes where the worst case is much slower than a full scan.

Scans are pre-compiled rather than inlined into the compiled query - the different compression schemes would generate too many code paths. Scans generate a vector of matching tuples which are the compiled query then loops over tuple-at-a-time.

Within each scan, each predicate is evaluated using simd instructions. The first predicate produces a bitmask, which is turned into a list of absolute positions using an 8-bit lookup table. Successive predicates operate only on the tuples identified by the absolute positions and the resulting bitmask is turned into a shuffle on the list using the same 8-bit lookup table. (The 8-bit lookup table is 8kb which fits in L1). The speedup on a single scan from simd varies between 5x on 8 bit values to ~1.5x on 64 bit values (but bear in mind that compression can reduce the size of values). The speedup for successive scans is much lower - cost is dominated by random access to the matching values.

On a dataset of 15m records, the cost of a scan vs an point index lookup on 15m records was as much as 7000x. But if the data happened to be sorted by the attribute in question this drops to 17x using the min/max values alone, or 4x using the PSMAs.

Seamless Integration of Parquet Files into Data Processing

Want to run queries directly over remotely-stored parquet files without any conversion step.

The parquet format is flexible enough that writing the same data with different libraries can produce wildly different files eg varying between 20-33gb for the same data.

Reuse the db buffer manager to cache requests against the remote file.

Parquet files can contain useful metadata, but these are optional so the db needs a good fallback. Every time they touch a chunk of a parquet file, they update SMAs (for faster column scans) and hyperloglog sketches (for better query planning). So initial queries are typically whole-table scans but future queries can be smarter.

Mainlining databases: Supporting fast transactional workloads on universal columnar data file formats

Authors wanted to build an OLTP database with close-to-zero-cost interop with other tools that produce and consume Arrow.

(Why? They try exporting a 60m tuple table from postgres to pandas and it takes 284s minutes. In their Arrow-based db it takes 7-70s depending on the mutation rate.)

So just use Arrow internally? Problem is that Arrow is not well suited to mutable data. And also doesn't have anywhere to store row versions.

Cold blocks are just Arrow.

For hot blocks, they store pointers to version chains in an extra Arrow column which doesn't get exported. Each transaction as an undo buffer - the version chain contains pointers into this buffer. The buffer stores values row-wise so when querying hot blocks the query engine has to materialize everything into rows.

While hot, pointers to variable length data (eg strings) are allowed to point outside the block itself. Some additional metadata is needed to mark old data gaps in the original block. When a hot block is converted to cold all the latest value are repacked in contiguous memory.

Compacting hot blocks is expensive, so where possible they fill gaps by moving individual tuples in a transaction.

Compaction is potentially racy so it's modelled as a transaction itself.

Cold blocks can be exported directly. Exporting hot blocks requires materialization.

A Deep Dive into Common Open Formats for Analytical DBMSs

Comparing Arrow(/Feather), Parquet and ORC for internal database format (as opposed to converting at ingestion time).

Arrow and ORC require data to be fully loaded before queried. Parquet can stream data.

Encoding (without further compression): Parquet produces substantially better compression ratios on all data types. Results vary depending on the dataset, so allowing multiple compression methods like Data Blocks should dominate. Parquet does better than ORC/Feather on strings mostly because of it's larger chunk size.

Applying compression to the encoded format mostly reduces the differences. Feather is the exception because 'it lacks encoding support for integers' - not sure what that means.

Serde perf: They measure using Arrow as the in-memory format, which seems like it gives Feather an unfair advantage. Deserialization is fastest for uncompressed Feather, since there's nothing to do, but this is totally defeated by having to read more data from the disk.

Projection: Feather does poorly because it has to read the entire chunk before projecting a single column. Except for string columns, where the cost is outweighed by the advantage of not having to dictionary-decode. Feather also spends way too much time in locking.

Selection: Only Parquet can skip individual rows, but the performance advantage is outweighed by it's slow deserialization.

They get substantial improvements on query perf (~100x) from modifying the access apis to allow querying data without decoding.

I'd have liked to see the various design choices compared directly, without the massive complication of different compression strategies obscuring all the results.

Lessons:

Basically just read the Data Blocks paper.

Cloud-Native Transactions and Analytics in SingleStore

Trying to keep a bunch of specialized databases in sync is a pain in the ass. Would be nicer to have a single database that handles many workloads well.

Queries compiled to bytecode and then llvm. Starts out interpreting bytecode and swaps to compiled code when ready.

In-memory data in a row-oriented skiplist. MVCC for readers. Row locks for writers. Transaction log and background snapshots.

On-disk data in a column store. Batches of rows. Allows different encodings per segment. Metadata (including tombstones and SMAs) stored in-memory. Storing tombstones in-memory avoids the need to merge keys when reading from the lsm, which would suck for OLAP-style scans. Encodings all allow positional access (except LZ4, presumably).

On-disk segments form an LSM-tree with the in-memory store as level 0.

Optionally, cold data is periodically moved to a blob store. Since blob storage is cheap, they can typically keep months of snapshots.

In cloud setting, default is to not fsync - losing the node usually means losing the disk too.

Some query operators can be evaluated on encoded data.

Secondary indexes are two-level:

  1. A per-segment index maps values to a list of offsets.
  2. A global index maps hashes of values to the entry in the per-segment index. (To guard against hash collisions they check the value in the per-segment index).

The global index is itself an LSM-tree where each level is a hash-table. When segments are deleted, readers skip them. The lsm merge eventually removes references to deleted segments. This takes secondary index maintenance out of the critical path for writes.

Since lookups in the global index just produce a list of per-segment index entries, lookups against multiple secondary indexes can be combined easily even before looking at the segments themselves.

Multi-column indexes can be built on top of single-column indexes with another layer of indirection.

Secondary indexes can be used to enforce uniqueness constraints.

To update or delete a row, first it's moved in-memory. While the row-lock is held it's safe to mutate the corresponding tombstone bit in the metadata. (This design requires that all ongoing transaction data fits in memory. The transaction system couldn't be used for migrations as in the papers below.)

Query planning is adaptive. If probing the secondary indexes produces too many results then it will fall back to checking each segment. Filters are planned per-segment by timing the results on a small sample. Also the filters whose sampled selectivity is lowest are run first.

On both TPC-C and TPC-H they outperform their anonymous peers.

(Skipping the details of replication).

BullFrog: Online Schema Evolution via Lazy Evaluation

Represent the result of the migration as a view over the old database. While the migration is progressing, queries over the new table are converted into queries over the old table and the relevant tuples are migrated immediately. Any writes can then be performed against the migrated tuples.

I didn't go further into the implementation details because it seems really hard to predict which queries will be efficiently answerable through the view ie for which the set of relevant tuples in the old table isn't enormous.

Online Schema Evolution is (Almost) Free for Snapshot Databases

The basic idea is to leverage MVCC. Store the schema in tables, and then schema changes can be done via regular transactions.

Done naively this would trash performance. A migration of a single table would conflict with any writes to that table. It's bad enough if the write gets aborted, but aborting and retrying an entire migration could be incredibly expensive. Also the write set for a migration would be huge.

Usually in MVCC each table has an indirection array mapping record ids to their version chain. In Tessaract, each schema version has it's own indirection array. While the migration is underway, other writes operate against the previous schema version. When the migration completes, the transaction log is scanned to find any writes that it missed.

To ensure that the log scanning doesn't trail behind forever, the migration first publishes a tentative version of the new schema and indirection array. Then it only has to process log entries between when migration started and when it finished with the old indirection array. (Concurrent transactions must check both the old and new indirection arrays to make sure they don't read rows which have been modified in the old but not migrated to the new).

Rather than maintaining a write set, the migration always reads the latest versions it can find and then copies their timestamp into the new indirection array.

Microbenchmarks on a toy implementation look reasonable.

ZNG Specification

The row format for zed.

Roughly a superset of json, with first-class types.

The binary format includes an inline encoding of types. Later values of the same type can just refer to the type id.

A steam is a sequence of frames. Each frame contains either types, values or control data.

Frame starts with:

If the compression bit is set, there is a compression header:

Types frames contain a list of type definitions. Type ids are variable length integers. Primtive types have predefined ids.

Values frames contain, uh, values. Each value starts with a type id.

Compound values are encoded as a tree of primitives. This tree can be decoded without knowing the type: Each value starts with a variable-length integer. Either 0 for null or length+1 for other values, where length is the number of nested values.

Sets are sorted in byte-order. No mention of sorting for maps though.

Control frames are application dependent - the zng library passes them through directly.

Ids defined in a types frame are valid until end of stream. Chopping a large stream into smaller streams offers more opportunities for resychronization, at the cost of sending redundant type data.

First-class types embedded in data must be encoded in place - they cannot refer to type ids in the context or serialization would be context-dependent. (But it already is? The ids in the frame will vary depending on the context).

VNG Specification

The columnar format for zed.

Encodes a stream of heterogenous values, each of which has a concrete type.

Already have ZNG, the row-oriented binary format. VNG is itself encoded in terms of ZNG.

To convert to ZNG, scan backwards from the end until find a record with type:

{
  magic:string,
  type:string,
  version:int64,
  sections:[int64],
  meta:{
    skew_thresh:int64,
    segment_thresh:int64,
  },
}

(How do we scan backwards? Is ZNG self-synchronizing?)

This is the trailer. It provides the sizes of the data and reassembly sections.

The data section is divided into segments. Each segment is a stream of primitive (ie non-compound) ZNG values.

There are no row groups. Different columns can cut their segments off at different lengths. This is intended to make scans of low-selectivity types efficient.

The reassembly section lists the unique concrete types of the values (types are first-class in ZNG).

For each type there is a 'reassembly map' of type <any_column> where:

<any_column> = ((
    <record_column>,
    <array_column>,
    <map_column>,
    <union_column>,
    <primitive_column>,
));

<record_column> = |{
  <field_name>, 
  {
    column:<any_column>,
    // RLE bitvector indicating null values
    presence:<segmap>,
  },
}|;

<array_column> = {
  values:<any_column>,
  lengths:<segmap>,
};

<map_column> = {
  key:<any_column>,
  value:<any_column>,
};

<union_column> = {
  columns:[<any_column>],
  tags:<segmap>,
};

<primitive_column> = <segmap>;

<segmap> = [
  {
    offset:uint64,
    length:uint32,
    mem_length:uint32,
    compression_format:uint8,
  }
];

(Why does only record_column have a presence vector? Can arrays and maps not contain nulls?)

No details given on compression formats, but in the code the only option is LZ4.

Finally there is a single list of integers in [0,unique_type_count) determining the type of each value in the stream.

Thoughts:

Heterogenous data is an interesting challenge. Optimizing for materialization and parallel scans makes us want to group by rows, but optimizing for space and for type-specific scans makes us want to group by type.

Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Storage layer for Databricks lakehouse product. (Lakehouse = structured query pipelines over unstructured data lake. Alternative to separate ETL pipeline.)

Structured layer over blob store eg S3.

Even for OLAP need writes for eg GDPR compliance. But blob stores typically provide no cross-key consistency guarantees. Early Databricks tried to get away without ACID but consistency and performance problems were half of their support workload.

Basic idea: Store WAL and metadata checkpoints in the blob store too. Background process compacts the log into immutable data files.

For best perf on blob stores, need large sequential io and lots of parallelism.

Data stored in parquet. WAL is json but checkpoints are parquet.

For (possibly stale) reads:

  1. Read last checkpoint.
  2. List journal entries since last checkpoint.
  3. Read journal entries from last checkpoint to most recent entry in list.
  4. Read from data objects.

Due to eventual consistency, some reads in 3 and 4 might have to wait until underlying objects become visible.

For writes, first write all data to new files. Then atomically CAS a new journal entry (or retry transaction on failure). Some blob stores provide a suitable atomic primitive, but on S3 they have to use a separate coordinator. Transaction throughput is poor, but good enough for OLAP.

They also reuse the WAL as a (high-latency) message bus.

Some schema migrations (eg adding a column) are just a WAL entry and can avoid rewriting old files.

WAL is per-table, so no cross-table transactions.

No mention of garbage collection. Seems tricky.

Photon: A Fast Query Engine for Lakehouse Systems

Query engine on top of Delta Lake.

Written in C++ because they hit limits with GC performance and JIT deopts in the JVM.

Chose vectorized over compiled. Easier to build. Easier debugging/profiling. Better support for adaptive execution. Can still get some benefits of specialization by having manually fused operators for common combinations.

Columnar data grouped by row. Pool of batches for reusing allocations. Variable-length data in a per-batch arena.

Operators switch implementation per-batch, depending on eg whether the batch contains nulls, whether the rows are sparse or dense, whether all strings are ascii-only.

Velox: Meta's Unified Execution Engine

Facebook has too many custom data systems. Typically the frontend and storage engine vary but the execution layers all look similar. Velox is a reusable execution layer.

Fairly standard type system. Serialization format that covers all types. Type system is extensible by clients.

Vectors are an extension of Arrow. More encoding options. Support random writes. Uses Umbra-style strings.

Lazy vectors to avoid eager materialization. (Can't find any details on their representation).

Decoded vector - uniform representation that can be cheaply built from any encoded vector (same as unified vector in duckdb?). Avoids having to specialize all operations over all encodings.

Adaptive execution: Reorder AND/OR to put most efficient filter first. Switch to ascii-only version of string operations.

Experimental support for compiling queries.

Wrapper that takes scalar function and flags for determinism and null behaviour. Produces an efficient vectorized function.

duckdb

Couldn't find much written info so relying on these two talks:

Single-file embedded OLAP database for data-science. In addition to SQL exposes eg r/python dataframe api.

Arrow-based data in-memory. Co-designed with Velox team.

Encodings include a sequence type - useful for auto-increment keys.

Unified vector - values and selection vector - can be produced from any other vector with minimal copying. (Doesn't work for sequence vector?) Vectorized filters produce selection vectors anyway, so this is nice.

Struct/list/map types. Maps stored as list of kv structs. (How do these work with unified vector? Do only individual columns map to unified?)

Switched from pull-based to push-based: Easier to multi-thread (morsel-driven). Can coaleasce small vectors between operations. Can share scans between operators (although complicated by projection/filter pushdown). Easier to suspend execution eg for async io or query cancellation.

Most operators not parallel-aware, only sinks and sources.

Sink interface:

Operators return one of NEED_MORE_INPUT, HAVE_MORE_OUTPUT or FINISHED. In the case of NEED_MORE_INPUT the intermediate state is store locally in the operator.

Implement union operator by having two pipelines with the same sink. Schedule Finalize after both pipelines finish. Similarly for full outer join - schedule hash-table probe for nulls before scan of other side.

Two copies of header in data file for atomic writes. Separate WAL. MVCC in-memory only (like singlestore?).

Tables partitioned in row groups. Unit of checkpoints and execution parallelism.

More encoding options on disk. Encoding is per-column per-row-group.

No compaction yet.

Less need for UDFs because of zero-copy between duckdb and r/python/whatever.

Small core. Most functionality in extensions loaded at runtime. Extensions tied to internals, so must be correct database version.

Support for spilling to disk during execution. Aim to degrade gracefully rather than just switching algorithms. Eg radix-partitioned hash-joins.

snowflake

Again not much written info. Relying on https://www.youtube.com/watch?v=bveqnSk15JQ.

Columnar. Vectorized. Push-based.

Codegen via LLVM for serde between workers.

'Semi-structured' data. Dynamically typed data stored in binary blobs. Infers schemas where possible and maps to columnar views.

SMAs for pruning.

'Unistore' - similar to singlestore where transactions work against in-memory rows which are compacted to columnar files.

Amazon Redshift Re-invented

Columnar data. SMAs.

Push-based vectorized scans. Remainder of query is compiled to pull-based C++.

Before hash-table operators, builds a small buffer of tuples that fits in L1. Issue prefetches for probes when tuples enter the buffer so hopefully no cache miss when popping from the buffer.

Adaptive execution eg choosing bloom filter size at runtime based on amount of data seen.

BtrBlocks: Efficient Columnar Compression for Data Lakes

Everyone is trying to make data warehouse work on open formats, but as we've seen in the papers above open formats have some design issues.

Parquet and ORC have poor lightweight compression, so they require heavyweight compression of entire files, which tanks decompression bandwidth.

BtrBlocks is a format for cold data (eg in S3, rather than for inside the execution engine). Unlike DataBlocks, doesn't try to maintain random access or evaluate queries directly over encoded data.

Uses only lightweight compression. Achieves compression ratios almost as good as parquet/orc+zstd, but with >4x better decompression bandwidth and similar compression bandwidth. Outperforms parquet/orc-zstd on all fronts.

Leaves metadata/statistics/indices to a separate format.

Groups of 64k rows. Encoding can vary per group. Groups are unit of parallellism.

Supported encodings:

These encodings can be stacked eg start with frequency, using roaring for the most-common bitmap and use FFST for the non-most-common values.

To choose encoding:

  1. Collect statistics (min, max, unique count, average run length)
  2. Filter out non-viable encodings.
  3. For each viable encoding, estimate compression ratio using sample of 10 runs of 64 values.
  4. Pick the encoding with the best estimate.
  5. If the output can be encoded further, go back to 1.

Choosing process takes 1.2% of total compression time. By default, the maximum number of recursive encodings is 3.

This encoding set was tailored to the Public BI benchmark but also performs well out-of-sample on TPC-H.

The pseudodecimal scheme is novel. Previous work ignored compression of floats since most datasets used decimals, but data science and machine learning workloads tend to use floats.

Discusses various optimizations for decompression.

Notes that testing compression algorithms on synthetic datasets like TPC-H is dubious. Demonstrates that PBI has much more structure and compresses better. Eg contains urls instead of random strings, is less normalized so has fewer foreign key ids.

Their sampling strategy produced results within 2% of optimal for 77% of blocks, and average +3.3% vs optimal overall.

Outperforms several unnamed proprietary databases on compression ratio, but wasn't able to compare compression/decompression bandwidth.

Notes that decompression bandwidth is usually measured in terms of uncompressed bytes per second. But for reading over the network, compressed bytes per second is also relevant. For parquet+zstd this was 25gbps in their experiment, preventing them from saturating the 100gbps network.

Should they just extend parquet? The simd algorithms only explain 17% of their 147% improvement over parquet in decompression, so the cascading encodings are the main improvement. Adding these to parquet would effectively be an entirely new format.

Code is at https://github.com/maxi-k/btrblocks

Filter Representation in Vectorized Query Execution

The output of a filter scan can be represented by either a bitmap or a selection vector.

Selection vectors tend to be cheaper when the selectivity is very low, because iterating over a mostly-empty bitmap is wasteful.

For very cheap operations after a filter, it might be cheaper to ignore the filter and apply the operation to every tuple anyway.

For simd-friendly operations, there are decent gains from combining operations with the filter using either masking or gather ops.

For selection vectors, branching implementations are faster when brand misprediction is <10%.

Overall, selection vectors with partial operations dominated for any operation that doesn't have an efficient simd implementation. For those that do, using bitmaps and ignoring the input filter was sometimes better. But it doesn't look like they compared with eager materialization using simd, which would allow for simd operations without selection logic but at the cost of extra memory traffic.

Dremel: Interactive Analysis of Web-Scale Datasets

Columnar storage of nested types. Each field gets it's own column which contains nullable values. Each value in the column also has:

This isn't great for random access - to evaluate a single path .links.backwards[3] we have to stream through all the records in a block.

During evaluation we can start with scans on the nested columns and then aggregate the rep/def levels into the output of each expression in the query. So while the format is weird it has this closure property - operations on columnar nested data also produce columnar nested data.

The nulls seem crucial here. Eg if we want select count(.links.backwards) we want to be able to produce 0 for records with no backwards links, and we have to keep track of which record those zeroes belong to.

CORES: Towards Scan-Optimized Columnar Storage for Nested Records

Want to combine filters on nested columns without record reassembly.

Similar layout to duckdb/velox.

Maintain a bitset for some level. Can efficiently move up and down levels using rollup/drilldown operators.

Can do boolean operations on different levels using rollup/drilldown first, but worth specializing the combination.