0032: undroppable tombstones, forest fuzzer, manifest race, hash_log, zig coercions, zig pointer hops, zig object notation, domain knowledge, built from broken, database internals, papers

at tigerbeetle
- undroppable tombstones
- forest fuzzer
- manifest race
- hash_log
musing
- zig coercions
- zig pointer hops
- zig object notation
- domain knowledge
reading
- built from broken
- database internals
- papers...

I restarted my habit of reading a paper every morning and I'm just lazily dumping all my notes into the end of each devlog. My reading queue is already running low, so I'd love it if people sent me their favourite systems papers :)

undroppable tombstones

I noticed when writing up my notes last month that tombstones shouldn't have been able to accumulate anyway, because in that test there was only one level in the tree.

So I read the code that decides whether or not to drop tombstones and found that due to a one character thinko it always returns false. I printed out the return value in the fuzzer to be sure, but it felt pretty fun to catch a bug just by reading.

I tried in #396 to add some asserts to catch tombstones that make it to the bottom of the tree, but it's surprisingly hard to get right. The tree is allowed to have tombstones in the bottom level, if they were written there when it wasn't the bottom level and then later the tree shrank. I tried tracking the maximum size of the tree but preserving this across crashes and restarts was more effort than it was worth.

forest fuzzer

DJ did some work last month to allow testing different configs. In particular, using smaller table sizes in fuzzers makes it much easier to exercise compaction and resizing logic across all the levels of tree.

The forest fuzzer immediately started crashing, but the surface issues were all bugs in the fuzzer itself.

The first was forgetting to prefetch accounts before modifying them (#370). Space is reserved in prefetch and then used in put, so without prefetch the forest complains that there isn't enough space available in put.

The second was mutating fields of the account that are not allowed to be mutated (#391). The error for this one was less helpful. The in-memory buffers for the lsm have enough capacity allocated for all the ways they could be used by the accounting state machine. Since the state machine can't mutate some fields of accounts it doesn't allocate enough capacity for tombstones in the secondary indexes for those fields. But it took me a while when working backwards from the capacity error and double-checking the capacity calculations to realize what was going on. In (#410) I added some extra asserts that would catch this bug every time, instead of only when enough unique keys are in the workload.

manifest race

With the fuzzer working again we found an actual bug (#390).

Each tree has a manifest - a list of all the tables in that tree, what level each table is on and where it's index block is stored on disk. The manifests themselves are stored in memory for efficient lookups, but they also have to be written to disk for crash recovery.

The on-disk storage is just a log. Changes are appended at the head and the tail is lazily garbage collected. The garbage collector reads a block from the tail and checks each of the entries in that block against the in-memory manifest to check if that entry has been overwritten by a more recent entry. If so the entry is thrown away, otherwise it's added back to the head of the log.

Rather than each tree having it's own in-memory manifest, they all share a single hashtable. Which is fine because it's keyed by index block address and two trees can't be using the same index block address (picture a dark cloud ominously covering the sun as I say this).

The crash is happening during garbage collection. The garbage collector finds a entry and looks up the index block address in the in-memory manifest to see if this entry is the latest version, but finds that index block address isn't in the manifest at all.

thread 11595 panic: attempt to use null value
/home/jamie/tigerbeetle/src/vsr/superblock_manifest.zig:0:0: 0x3128e8 in vsr.superblock_manifest.Manifest.remove_table_extent (fuzz_lsm_forest)

(The function is called remove_table_extent but the garbage collector actually removes the value from the in-memory manifest iff it's the latest version. It's a weird api. But that's not really relevant to the story.)

The fuzzer runs in release mode so that it can check more seeds, so the first thing I do with a fuzzer crash is run the same seed in debug mode to get a proper stack trace. But in debug mode it doesn't crash. I test both release and debug a few more times just to verify that it isn't non-deterministic - it always crashes in release mode and never crashes in debug mode.

So this is going to suck.

I want to figure out where release and debug first diverge, so I enable logging in both and diff the logs. The first difference is really early on - line 1677 out of ~50m. Way earlier than the actual crash, so this might be two separate bugs.

Rather than jumping right into debugging I try to bisect the git history. This is kind of a pain because a) I have to backport all the fuzzer fixes above and b) changes to config or io scheduling change which seeds crash, so I have to run lots of seeds at each commit.

After a whole day I bisect it as far back as DJs config changes. Backporting those would be way too much work so I'm stuck there. All I know is that the bug has existed for a while but wasn't caught until the improvements to fuzzer coverage, which is not super helpful.

So println debugging on the release build seems like the next best option. I spend a lot of time reading the code to learn all the things I explained above and then start printing out all the modifications to the in-memory manifest for index block address 28704 (the one that is missing at the time of the crash).

...
update(258985583951513099105654245986647289974, 28704, 29035, 37)
update(258985583951513099105654245986647289974, 28704, 34187, 14)
update(258985583951513099105654245986647289974, 28704, 34187, 18)
remove(258985583951513099105654245986647289974, 28704, 29035, 37)
update(43895868926173851389892452333004948983, 28704, 24175, 30)
update(43895868926173851389892452333004948983, 28704, 24175, 44)
update(43895868926173851389892452333004948983, 28704, 24175, 47)
remove(43895868926173851389892452333004948983, 28704, 24175, 30)
remove(43895868926173851389892452333004948983, 28704, 24175, 44)
remove(43895868926173851389892452333004948983, 28704, 24175, 47)
remove(258985583951513099105654245986647289974, 28704, 34187, 14)

The above is operation(tree_hash, index_block_address, manifest_address, entry_index). If we group these by tree it's clearer what's happening:

update(treeA, 28704, 24175, 30)
update(treeA, 28704, 24175, 44)
update(treeA, 28704, 24175, 47)
remove(treeA, 28704, 24175, 30)
remove(treeA, 28704, 24175, 44)
remove(treeA, 28704, 24175, 47)

update(treeB, 28704, 29035, 37)
update(treeB, 28704, 34187, 14)
update(treeB, 28704, 34187, 18)
remove(treeB, 28704, 29035, 37)
remove(treeB, 28704, 34187, 14)

For each tree you see some updates, later followed by removes for the same manifest_address/entry_index pairs as garbage collection catches up. But the original ordering is messed up:

Tree B inserts 28704 into the manifest, uses it for a while and then releases it.
The grid garbage collects 28704 and reuses the address in tree A.
Tree A inserts 28704 into the manifest, uses it for a while and then releases it.
Tree A garbage collects the old log entry for 28704 and removes it from the manifest.
Tree B garbage collects the old log entry for 28704 and tries to remove it from the manifest, but it's already been removed.

The problem is that garbage collection of the logs is very lazy, and can lag behind address reuse in the grid.

I came up with a couple of ways to fix this, but in the end the least invasive was to change the manifest key from index_block_address to (tree_hash, index_block_address) (in #407). I'm still not super happy about the whole flow here but it's not on the top of my list of things to refactor.

hash_log

This still doesn't explain why the same seed doesn't crash in debug mode. The fuzzer is single-threaded, doesn't do any io and use fixed seeds for all random decisions. There are a few places in the code where we do different things depending on the build mode, but none of them are reachable from this test. So the most likely candidates are undefined behavior or miscompilation, neither of which I want to leave uninvestigated.

Tracking down the exact point where release and debug behavior diverges seems hard. I can't just compare coredumps because they'll have different pointers returned from allocation, and the compiler is allowed to change the order of memory writes as long as it isn't observable by the code itself.

So first I hope to just get lucky and find the bug by searching for likely candidates:

I don't find any code that iterates over a hashtable (order could be non-deterministic due to random hash seed).
I don't find any code that compares or sorts anything containing pointers or functions (order could be non-deterministic due to virtual memory mappings from the os or aslr when loading code into memory)
The latency in test/storage is decided randomly using some floating point math. For a while I think that this differs between builds, but it turns out to be caused by different order of calls to the rng.
Pending reads/writes in test/storage are queued in a std.PriorityQueue instead of our own priority queue which backports some fixes from zig master. But fixing this does not change the scheduling.
The manifest log briefly exposes uninitialized grid blocks, but they doesn't seem to be read and hiding them does not change anything.

So my hope-to-get-lucky search results in a bunch of miscellaneous improvements, but does not fix my bug.

The only approach I have left is grinding. I've already used logging to find an upper bound on the divergence point, so I try adding more logging to bisect the time dimension.

Doing this diffing by hand is kind of tedious and repetitive though, so after a few hours I decide to automate it. I wrote a tool called hash_log that allows emitting hashes anywhere in the code, and depending on the setting either write those hashes to a log, or asserts that they match the hashes in an existing log. So I record the hashes in release mode and check them in debug mode, or vice verse. This lets me crash both builds at the first point that the hashes diverge and then I can compare the state in gdb.

To begin with I emit hashes of every read/write start/completion, including the grid block and all the arguments. This gives me a really early divergence point in grid.write_block during compaction.

# release
>>> p zone
$2 = grid
>>> p offset_in_zone
$3 = 32768
>>> p buffer
$1 = {
  ptr = 0x7f2451bc4000 "\266\035T\353Ai\301B9\222qa\225\314ک\005\356\bF\300\205\341\247\031F\"\326\fJ\fb",
  len = 4096
}

# debug
>>> p zone
$2 = grid
>>> p offset_in_zone
$3 = 32768
>>> p buffer
$1 = {
  ptr = 0x7f9ce5101000 "?\267C\003|\211{\333&\032c\301\204$\340\216\204\260b\351Hy\341ӵ\337\351\255A7\304l",
  len = 4096
}

Both are writing to the same address, but the data they are writing differs.

I print out all the compaction and merge iterator state and both match exactly. So the compacted data in the buffer should match.

Printing out the block type tells us that this is a bloom filter block. According to the docs the layout of the buffer should look like this:

/// Filter block schema:
/// │ vsr.Header │ operation=BlockType.filter
/// │ […]u8      │ A split-block Bloom filter, "containing" every key from as many as
/// │            │   `filter_data_block_count_max` data blocks.

After much googling I manage to get the cast syntax correct in gdb and print out the headers:

# release
>>> p (struct 'vsr.Header') *buffer.ptr
$17 = {
  checksum = 225775601449524079601983905606000713142,
  checksum_body = 130328153064544020489453123034442952197,
  parent = 0,
  client = 0,
  context = 0,
  request = 0,
  cluster = 32,
  epoch = 0,
  view = 0,
  op = 9,
  commit = 0,
  timestamp = 0,
  size = 4096,
  replica = 0 '\000',
  command = block,
  operation = (root | register),
  version = 0 '\000'
}

# debug
>>> p (struct 'vsr.Header') *buffer.ptr
$6 = {
  checksum = 189914190582483452734140919832484493119,
  checksum_body = 144575434465226147614775280795209871492,
  parent = 0,
  client = 0,
  context = 0,
  request = 0,
  cluster = 32,
  epoch = 0,
  view = 0,
  op = 9,
  commit = 0,
  timestamp = 0,
  size = 4096,
  replica = 0 '\000',
  command = block,
  operation = (root | register),
  version = 0 '\000'
}

Differing checksums, but the rest of the header matches. The checksums are set right before writing and I don't see how that code could be wrong, so the problem is probably the data itself.

# release
>>> p (char [1024]) *(buffer.ptr + sizeof(struct 'vsr.Header'))
$19 = '\252' <repeats 15 times>, "\256\252\256\252\252\252\252\252\252\252\252\252\272\252\252\252\252\252\252\256", '\252' <repeats 11 times>, "ꪪ\252\252\252\252\256", '\252' <repeats 47 times>, "\272\252\252\252\252\252\252\253\252\252\252\252\253\252\252\272\252\252\252\252\252\252\272\252\252\252\256", '\252' <repeats 420 times>...

# debug
>>> p (char [1024]) *(buffer.ptr + sizeof(struct 'vsr.Header'))
$7 = " \000\000\000\200\000\000\000\000\002\000\000\000\000\000\004\000\004\000\000\000\000 \000\000\000\000\020\000\000\002\000 \000\004\000 \000\200\000\202\000\000\000\000\000@\002\240\000\000\000\000\004\000\200\000\002\b\000\000\000\202", '\000' <repeats 34 times>, "\b\000\000\000\020\000\000\000\000\000\002\001\000\000\000\000\001\000\000\020\000\000\000\000\000\000\020\000\000\000\004", '\000' <repeats 417 times>...

The debug data is what I expect from a low-occupancy bloom filter - mostly zeros with some bits set. The release data looks like junk.

So I look at the code that creates these bloom filter blocks and I can see where it sets the bits, but not where it initializes the block to 0.

I did recently make the grid initialize blocks before handing them to writers, but only during tests.

if (constants.verify) {
    std.mem.set(u8, completed_write.block.*, undefined);
}

Usually in debug and release-safe builds setting a variable to undefined writes 0xcc to all the bytes of that variable, which is rarely a valid value of anything and leads to crashes quickly if you try to use an undefined value. But it seems like my usage here with mem.set gets optimized out in release builds. Replacing undefined with 0 immediately fixes the divergence.

So the bug was:

In release mode, the grid does not initialize blocks before handing them to writers.
For most block types we overwrite the entire block, but for filter blocks we weren't initializing the filter or writing padding afterwards.
This meant that filter blocks in release mode contained too many set bits, leading to more false positives, which changes the number of reads required in tree lookups, which disturbs all the downstream io scheduling.

This is a dangerous and hard-to-catch bug, so rather than fix just the filters I decided to systematically zero-initialize all blocks on init and at every point of reuse, even in production builds. This costs a lot of memory bandwidth in a zero-copy system, but luckily we're still io-bound at the moment so there wasn't a noticeable perf impact. If I ever have to undo this I'll probably replace the zero-init with some kind of type wrapper that forces overwriting the entire block before allowing reading from the block.

In #412 I also added a hash_log test of release vs debug to CI, to catch any similar bugs in future, and added a lot more state asserts to compaction to double-check that it wasn't ever reusing blocks without going through the grid's zero-init.

I'm pleasantly surprised by how well the combination of hash_log and gdb worked out. I expected this bug to be a lot harder to solve.

zig coercions

Zig has lots of implicit coercions, especially from anonymous literals to named types. Eg in the following code we coerce from the anonymous literal .foo to the FooBar type.

fn foo_bar(b: bool) FooBar {
    if (b)
        return .foo
    else
        return .{ .bar = 0 };
}

But what if we changed the return type of foo_bar to !FooBar (either a FooBar or an error). We can coerce .foo to FooBar and FooBar to !FooBar, but these coercions can't be chained.

./test.zig:7:39: error: type '@typeInfo(@typeInfo(@TypeOf(foo_bar)).Fn.return_type.?).ErrorUnion.error_set!?FooBar' does not support struct initialization syntax
        return .{ .bar = 0 };
                  ^

To make this compile in zig 0.9 we'd have to switch to the fully explicit syntax.

fn foo_bar(b: bool) FooBar {
    if (b)
        return FooBar{ .foo = {} }
    else
        return FooBar{ .bar = 0 };
}

A ton of these coercion papercuts were fixed in zig master in the last few weeks (eg).

zig pointer hops

If you see a.b in zig, the equivalent in c could be either a.b or a->b.

For high-performance code I'm finding that I prefer the c notation. When I see a.b.c.d in zig, I don't know if that's 1 addition (<1ns) or 3 cache misses (300ns?).

zig object notation

In #14390 the zig devs decided to use a subset of zig syntax as notation for config files.

This is the kind of brain-saving notation I was talking about in the shape of data. When you're editing Zig Object Notation you can use the same editor mode, same shortcuts, same string escaping, same number notations etc as zig. They might also support reading it at comptime using @import("data.zon"), in which case working with the config data will feel exactly like working with native structs and enums - no need to learn a separate api.

domain knowledge

This has been on my mind a lot lately.

https://mastodon.social/@danluu/109641478157726177

...having a respect for domain knowledge, domain complexity, meticulously looking at data, etc., is a much more reliable way to find wins as an outsider than being super smart and reasoning from little to no knowledge.

built from broken

Many people start suffering from joint pain in their middle age and are told it's due to ageing or just general wear and tear.

The author points out that:

Joint injuries are anti-correlated with exercise volume in the general population.
In atheletes, injuries are strongly predicted by the ratio of chronic workload (eg last 4 weeks) to acute workload (eg last week).
Non-industrialized populations (eg Hadza) don't suffer from degenerative joint injuries in middle age despite spending many hours per day on their feet as late as 80yo.
In healthy animals, tendons are much stronger than muscles and so tendon injuries are very rare.

The understanding of tendinopathy has also changed in the last decade or so:

While muscles get stronger from a wide range of workloads, tendons only respond well to high, repetitive, long-duration loads.
The supercompensation curve for muscles peaks at 1-2 days, but for tendons peaks might be as long as 3-4 days.
When tendons repair, the collagen fibers measure mechanical stress to know which way to align. No stress = poorly repaired tendons.
Synovial fluid soaks into surrounding tissues over time and is forced back into the joint my mechanical stress. No stress = unlubricated joints.
Continually overloading tendons leads to degenerative tendinopathy. Cells die and holes form.

This suggests a different picture than 'wear and tear'. Combining a mostly sedentary lifestyle with a few weekly bouts of high-intensity exercise leads to long-term tendon degeneration. Eventually (ie in your 30s and 40s) tendons are weaker than muscles and start to tear.

The author recommends:

Don't go more than an hour during the day without moving and stretching.
Aim for at minimum a few hours of low intensity movement (eg walking) per day.
Use strength and mobility exercises to correct postural problems caused by too much sitting.
Use heavy slow resistance training to build tendon strength.
Use high-rep endurance training to prevent injuries due to postural collapse when exhausted.

I couldn't find citations for all of the claims but as far as I can tell the actually recommendations are very much in line with the current consensus in sports science.

There's a set of workout plans at the end of the book whic we've been following for a few weeks. We also gave away our sofa, chairs and dining table in favor of a large foam mat and (eventually) some low folding tables.

It's too early to know how well the strength exercises will work, but just eliminating chairs for a month has already lead to a surprising reduction in knee and shoulder pain and a noticable increase in hip flexion.

database internals

I finally finished.

It feels more useful as a survey rather than a textbook. Whenever I hit a topic I didn't know about, I didn't feel like the chapter actually helped me understand it. But having a rough map of research and options is useful for starting a reading list.

Viewstamped Replication Revisited

For such a short paper this was a very long read.

Papers about consensus algorithms are traditionally impractical and full of bugs. This paper is better than most. I've been told of two bugs - one in replica recovery and client recovery (unfortunately no citations for either). But the core algorithms have stood up to a lot of fuzzing so far.

Normal processing is basically 2PC (assuming that replicas are processing log entries only after commit) but the leader can reply to the client after the 1st phase.

Dueling leaders are avoided in a neat way. The leader is determined by view_number % replica_count. If two replicas on the same view both initiate a view change they will agree who the new leader is. If two replicas on diferent views both initiate a view change, the later view will win. On the downside, there is no guarantee that the new leader is up, so we may have to keep initiating view changes until the number increments to a working node.

The replica recovery process requires transferring either the state or the logs. There is a throwaway mention that this optimized by recovering from disk or via out-of-band backups if the state is too large. The recovery process as described also can't handle a whole-cluster shutdown. Handling both of these turns out to be very subtle in practice, especially in the presence of disk errors (eg).

Reconfiguration is handled by a special log entry that, when executed, switches to the new configuration. There is some debate within tigerbeetle whether this is the best solution, most of which I haven't read yet.

A Generalised Solution to Distributed Consensus

Recasts consensus as an append-only algorithm summarized by a lattice. The main contribution here is actually allowing the acceptable quorums to vary per round to produce various new and existing variations of paxos.

The lattice part seems under-used. Suggests some ideas that I'll explore in the next few months.

Dissecting, Designing, and Optimizing LSM-based Data Stores

Very thorough introduction. Too dense to summarize.

Constructing and Analyzing the LSM Compaction Design Space

Breaks compaction policies down on 4 independent axes. Shows that these choices hugely impact throughput and latency, and that there isn't a single best choice across all workloads.

Tiering has particularly poor tail latency.

SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores

Demonstrates that lsms have high tail latency when memory buffers fill up and the flush gets queued behind other compaction work.

Prioritizing flushes and level 0 compactions in the io scheduler massively reduces tail latency and increases how long they can survive workload spikes without producing latency spikes.

Don't stack your Log on my Log

Many databases are log-structured. Many filesystems are log-structured. SSDs internally are log-structured (to hide the fact that they can't erase individual blocks). These logs interact poorly.

Garbage in a higher log still looks like valid data to a lower log, so eg your ssd garbage collection might waste time moving data that your filesystem is waiting to throw away.

Garbage collection requires some space to be reserved for moving. If each layer is reserving eg 50% overhead then the total overhead is 238% (1.5 ^ 3 = 3.375).

If the block sizes for two logs don't match this leads to fragmentation in the lower log, causing space amplification and write amplification.

The desired benefit of log-structured systems is sequential writes. But if the level below is also log-structured it might scatter those writes into available open blocks.

Apart from matching the block size there isn't much we can do as database developers until richer device interfaces appear - the core problem is that ssds are pretending to be 4kb block devices for backwards compatibility. But getting consensus on new standards will take a long time.

An Analysis of Data Corruption in the Storage Stack and An Analysis of Latent Sector Errors in Disk Drives

Two studies of disk errors in a 1.5m disk deployment. These are disks that are inspected and tested before deployment, so presumably better than average.

Checksum mismatches affected 0.042% of enterprise disks per year. Latent sector errors affected 1.4%.

Errors can be permanent or transient.

Errors show high temporal and spatial locality. This implies that when storing multiple copies of an important data-structure, the writes should be spread out in space and time to avoid all being eaten by the same error.

We are aware of actual cases when the disk firmware replied successfully to a write that was never written to stable media.

These are studies of non-ssd, non-nvme disks. I have some more recent studies queued up.

Redundancy Does Not Imply Fault Tolerance

Distributed data systems store many redundant copies of data. So if there is a fault on one drive, the system should be able to recover the data from other nodes. Right?

Right?

The authors test many systems and find that in most of them a single disk error in the right place can cause data loss or whole-cluster crashes.

Redis doesn't checksum data at all. If there is an error on the leader it will be replicated to followers rather than repaired. Since corruption in metadata can lead to crashes, replicated corruption can cause all replicas to crash.

ZooKeeper detects most corruptions and responds by crashing and repairing from the new leader. But some corruptions can cause a partial crash in the leader - the cluster stays up but never accepts new writes.

Cassandra only detects corruptions if compression is enabled. Corrupted data can be replicated to other nodes. Corruption can cause a variety of crashes and silent errors.

Kafka leaders incorrectly truncate corrupted logs. Their followers try to do the same, but hit an assertion and die.

RethinkDB sometimes silently returns corrupted data and will never recover it.

MongoDB detects corruption and crashes but doesn't always attempt to repair. In one weird edge case it returns incorrect query results.

LogCabin detects corruption, crashes and repairs from the new leader.

CockroachDB mostly detects corruption and crashes, but doesn't attempt to repair. But sometimes corruptions cause data loss, incorrect query results, whole-cluster crashes etc. 'Overall, we found that CockroachDB has many problems in error handling'.

The lesson: you have to test disk faults. You will not get it right by just sprinkling some checksums and hoping for the best.

Protocol-Aware Recovery for Consensus-Based Storage

A nice counterpart to the above. Demonstrates how to write a replicated state machine which recover from disk errors gracefully.

Separates persistent state into:

Log (operations for state machine)
Snapshots (of state machine state)
Metainfo (data unique to this node)

Errors in each are handled differently:

Metainfo - small enough to just store multiple copies on the node.
Snapshots - store hashes elsewhere, repair corruption by asking other nodes.
Logs - store small entries that are checksummed and also point to full entries elsewhere.

Errors in the log are tricky. A mismatched checksum could mean that the entry was corrupted, or that the node crashed while writing it. These two case can be distinguished because the small entries fit in a single block (no torn writes).

Corrupted entries are trickier yet. If the entry was committed, we must recover it. But this can be slow - quorum intersection guarantees that at least one node has the entry, but that node might be us and we corrupted it! If the entry was not commited, we would prefer to discard it. So there are three cases:

We hear from one node that has the uncorrupted entry -> repair.
We hear from a quorum that have not seen that entry, not even corrupted -> discard.
We hear from all nodes and neither of the above is satisfied -> cluster down.

Snapshots are synchronized in a similar way to VSR reconfiguration. Old snapshot chunks are eventually garbage-collected, so in the worse case recovery requires transferring an entire snapshot.

The algorithm was tested with both fuzzing and several model-checker implementations.

Simple Testing Can Prevent Most Critical Failures

Sampled 198 failures from Cassandra, HBaase, HDFS, MapReduce and Redis. Classified 48 of them as catastrophic. Of those 48:

92% were caused by incorrect error-handling
58% would have been caught by full statement coverage
35% were caused by error handlers that did nothing, aborted the entire process or contained comments like 'TODO' or 'FIXME"

Also 43% could be reproduced on one node, 86% on two nodes and 98% on three nodes. 74% were deterministic.

I think much of this can be explained by the fact that developers are typically trained on ephemeral, mostly stateless, single threaded systems which may crash in response to external faults. Moving to long-lived, stateful, distributed systems which are expected to tolerate faults requires very different techniques. But these are becoming more widely known - I expect that an analysis of bugs for eg foundationdb would not find the same pattern.

XFT: Practical Fault Tolerance Beyond Crashes

Crash Fault Tolerant systems assume that disks, memory and the network never corrupt data. Byzantine Fault Tolerant systems handle data corruption but also arbitrary adversarial control of the network, and as a result are too expensive to operate.

Cross Fault Tolerance requires safety and liveness so long as a quorum of nodes are non-byzantine and synchronous. Strictly stronger than CFT. Neither weaker nor stronger than BFT.

Their XFT Paxos implementation has the each member of the quorum independently calculate the new log head during view change, rather than trusting the leader.

They don't mention fuzzing or model-checking, so I assume it's incorrect.

Practical Hardening of Crash-Tolerant Systems

Data corruption in the control plane can be very dangerous - often propagated instead of detected.

Mentions that ECC errors are higher than previously expected, but from a brief read their citations (1, 2) suggest that these errors are almost always detected and so don't contribute to non-crash failures.

Their wrapper library protects against memory errors by storing two copies of all in-memory data, executing every operation at least twice and inserting additional checks of control-flow integrity. This seems wildly expensive and also wouldn't protect against most of the example failures they use as motivation, which are largely operator error and disk failures.

Can Applications Recover from fsync Failures?

TLDR no.

Most filesystems respond to fsync failures by marking dirty pages in the os cache as clean, even though they may not have been written to disk.

Filesystem behavior differs. Depending on the filesystem:

The fsync call which raises an error might not be the one that failed (ext4 in data mode).
The data you wrote may be thrown away or may persist in the cache.
Files may be left with holes, which may or may not be detectable by fsck.

Naturally, none of the databases tested can handle this menagarie. Redis, ever my favorite, doesn't even check the fsync return code.

Crashing and restarting is not a sufficient fix because on restart you're still reading from the cache, which may not reflect the actual contents of the disk. May be sufficient if combined with using direct io for everything, on top of a filesystem that handles errors well.

Given the choice, btrfs is the most well-behaved of the filesystems tested. Ext4 in data mode should be avoided.

Published in 2020, so details may have changed since.

The Case for Determinism in Database Systems

Typically in a replicated database, the leader executes a transaction and ships the resulting changes to the database to the followers. These changes can be significantly larger than the transaction itself.

If your transactions were deterministic then you could just ship the transaction itself to the followers and have them all execute it independently. This is easy if you just single-thread everything, but then you lose the benefits of concurrent execution (eg tolerating long-running transactions without blocking the whole database).

The authors propose a scheme for concurrent execution that is deterministic - every node will commit transactions in the same order.

The benchmarks compare against their own implementation of 2pc and there is no single-threaded baseline.

Evolution of Financial Exchange Architectures

New architecture - deterministic state machine, replicated by some consensus algorithm.

Tuned for low p100 latency and high throughput. Custom data-structures. Binary codecs, readable in place, zero-copy.

Spectre and meltdown mitigations make syscalls, page faults and context switches much more expensive. NVME SSDs and io_uring help. Other new hardware improves latency but not bandwidth, so need much more batching and concurrency.

How I Learned to Stop Worrying and Trust the Database

Testing FoundationDB.

Deterministic code. No threading. Network and disk can be mocked. Fuzzer schedules network/disk from rng with fixed seed. So any crashing seed can be reproduced on any developers machine.

Good for safety bugs. Hard to catch performance or availability bugs this way - rely on chaos testing instead.

See also buggify - bias fuzzing towards the scary part of the state space by occasionally throwing away safety margins, injecting recoverable errors, adding extra latency, randomizing config etc.

Kangaroo: Theory and Practice of Caching Billions of Tiny Objects on Flash

Facebook wants to cache lots of small objects (100s-1000s bytes) on flash because it's much cheaper than dram.

Bitcask-style logs use too much dram for the index. Set-associative caches have too high write amplification.

Put a small log in front of a large set-associative cache to act as a write buffer.

On evicting from the log, evict all keys that match the same set and write them all at once. If there aren't enough keys to make it worth writing to the set, either reinsert into the log or just throw the key away.

Use a clock cache within each set to avoid having to store eviction metadata in dram.

The log itself is segmented to amortize pointer size. The dram index for the log exploits deliberate hash collisions to efficiently retrieve all keys of the same set.

The result has strictly better miss ratio and write amplification than existing designs, especially when the ratio of dram to flash is low - facebooks hardware is moving in this direction.