0034: perf handover, compaction unchained, crash harder, sketching the query engine, focus catchup, android update policies, legopunk, a world without email, nobody cares, segcache, bloomRF, existential consistency, ssd parameters, fantastic ssd internals

Published 2023-03-31

The talks from systems distributed will be aired live over the next month or so. You can see the schedule here.

perf handover

Pretty soon I'm going to have to switch from working on performance to designing the query engine. I figured the highest leverage thing I could do with my last few weeks is to make it easier for whoever is working on performance next to focus on the most important problems.

I started by massively expanding tracing in #552 and #562. We now track queue depth and read/write latency throughout each part of the io pipeline, hit/miss counts for all caches and filters, and spans for every async task.

Having all of these on the same timeline, with easy searching, filtering and aggregation, makes it so much easier to understand the behaviour of the system by correlating different events. I spent a couple of days just browsing around in tracy and spotted a bunch of promising problems.

Between tracy and perf I assembled fairly convincing proof that we're not remotely io-bound. I don't have a solid enough understanding of cpu architecture to distinguish between being cpu-bound and memory-bound, but whether I sample by cpu cycles or by cache misses in perf we find the same bottlenecks - blake3 (hash-chaining to detect disk errors), wyhash (hashtables and bloom filters), sorting (converting in-memory buffers to on-disk data blocks). (Merging blocks during compaction was also on this list initially, but see below.)

It's difficult to disable the blake3 hashing because we rely on it for ids, but switching from the zig implementation of blake3 (~600mb/s) to the highly-tuned c/asm implementation (~6000mb/s) produced a 17% throughput boost! We probably won't merge that because it complicates cross-platform builds, but it definitely demonstrates that we have a lot to gain from improving the zig implementation.

I also tried increasing the default disk cache size to the size we expect to use in production, so that our other performance work isn't confounded by poor cache performance, but this is currently in limbo pending some decision about how the non-default cache size will be configured.

Lastly, I noticed in benchmarks with low load that the batch latency never falls below 5ms. Turns out that during my earlier changes to the benchmark I also introduced a minimum 5ms sleep between batches. Embarrassing. Fixing this required converting the entire benchmark to continuation-passing style, inverting most of the logic and adding ~100 lines of boilerplate. A really clear demonstration of why async/await is such a popular feature for programming languages.

compaction unchained

There are two major tasks in tigerbeetle. In the foreground, incoming operations are executed against a snapshot of the database and mutations are accumulated in new in-memory buffers. In the background, we flush the old in-memory buffers to disk and compact the lsm trees on disk. Every constants.lsm_batch_multiple operations these two tasks synchronize - we pick a new snapshot and swap the old and new buffers.

But both of these tasks are sharing the same thread. To prevent compaction from causing latency spikes, compaction work was divided into fixed-size ticks which were supposed to be divided evenly in the gaps between operations. But in practice, it's actually pretty hard to estimate how many ticks of work we need to do. It depends on how much compaction work is needed for each tree, how much overlap there is between tables on different levels, how many tombstones will be dropped during compaction etc. Also the tick interface was readiness-based - when you call tick it checks to see if enough io has completed that it can do any work. So the number of ticks required per operation wasn't even deterministic.

The overall effect of this was that all the compaction work tended to just pile up after the last operation, just before the two tasks synchronize.

My first step in fixing this was to switch from a readiness-based interface to a completion-based interface. Rather than repeatedly calling tick to move compaction work forwards, we just let them run at their own pace. I had planned to add some prioritization to the io system to ensure that operations didn't get stuck behind compaction, but our queues tend to be pretty shallow and quickly cleared so this ended up being unnecessary at the moment.

As a bonus, I was able to convert all the iterators and the merging to operate over large batches of values instead of individual values, resulting in some really tight loops. Merging completely vanished from the perf profile.

I was frustrated again though by the lack of async/await. Take table_data_iterator for example. 120 lines of code to do this:

for (addresses, checksums) |address, checksum| {
  const block = await grid.read(address, checksum, .data);
  yield block;
}

A lot more work and a lot harder to read. The latter really bites - I spent two days debugging a crash that turned out to be caused by misplacing a single line while converting a complicated state machine from readiness to completion. Without all the cps noise the code might have fit on a single screen and I would be far less likely to misunderstand the control flow.

crash harder

While trying to debug the above I realized that the forest fuzzer only ever tests crashing and restarting the database (eg as if the power failed) when the io queue is empty, which is one of the less scary scenarios. I tweaked the fuzzer in #586 to be able to crash in the middle of an operation. This didn't catch any new bugs - the vopr already had similar power - but it will give me more confidence as I continue to rework compaction scheduling.

sketching the query engine

I started sketching out the design of the query engine.

The constraints are severe - for every query we need to guarantee:

Threading that needle is going to be an interesting challenge.

focus catchup

I've been delaying doing any more work on focus until I upgrade zig, and I've been delaying doing that until I had time to figure out why it causes a weird rendering bug.

But jakubvf kindly sent me a pr for all the upgrades, so I finally got around to looking at the bug and it turned out to be a simple change in aligmnent/padding which was easy to track down and fix.

I still don't expect any big features in the near future, but with that blocker out the way I'm back to my old steady stream of tiny tweaks and fixes.

android update policies

My phone stopped getting security updates two years, and stopped getting even lineage updates this year (not to mention doing major version updates on lineage is painful). So I begrudgingly have to throw away a perfectly good piece of hardware.

I found AndroidAuthority has a complete list of the update policies for each major manufacturer. I was able to figure out that the best security-updates-for-money seems to be the pixel 6a - it will get updates until at least July 2027 and open-box models can be found for 300 CAD.

legopunk

In other news, framework announced a ton of new hardware, including a 16 inch laptop with upgradeable gpu and hot-pluggable input devices.

I was initially jealous that the new 13 inch laptop comes with a matte screen, but then I realized that this is framework we're talking about - I can just upgrade to the new screen!

What a contrast to my phone.

a world without email

I'm maybe the wrong audience for this. Cal spends most of the book on arguments that unstructured email/chat make for poor prioritization and fragmented concentration, which I'm already sold on. What time he spends on actual alternatives is mostly devoted to talking about existing practices in software engineering (project boards, issue trackers, customer-support ticket systems etc).

Also misses what I think is one of the stickier contributors to use of chat - workplaces (especially remote) have stripped most social interaction out of the day. Chat is a meagre substitute, but it's the main option. I don't think it's a coincidence that methodologies like xp that focus the most on structured workflows also lean heavily on pair programming.

This was at least a nudge to update my leechblock filters to block slack and gmail for most of the day, as well as the notifications page on github (I still want to be able to review code that's in my queue, I just don't want to be distracted by seeing more stuff arriving in the queue when I haven't emptied it yet).

Idly fantasizing about an iron-like tool for task management, that works offline and only syncs overnight.

nobody cares about our concurrency control research

Academic work on concurrency control focuses on serializable execution of stored procedures. Andy Pavlo surveys DBAs and finds that most rarely use stored procedure or serializable transactions (the most common level seems to be read-committed).

segcache

(Plus blog posts and videos at pelikan.io)

Fun to compare to kangaroo. Despite seemingly similar problems, different priorities lead to wildly different designs.

Segcache focuses on timely expiration and memory efficiency. Achieving both is hard - tracking which keys are about to expire usually requires some data-structure with entries per-key, which means high memory overhead when keys themselves are small. Memcached needs 56 bytes of metadata per key and doesn't guarantee timely eviction. Segcache needs 5 bytes (amortized) of metadata per key, always expires keys before their ttl and achieves a much lower miss ratio in production.

Segcache groups divides all possible ttls into 'buckets'. Within each bucket keys are stored in a linked list of fixed-size append-only 'segments', sorted by creation time. This completely prevents internal fragmentation. To expire keys they only need to check the front of the linked list for each bucket.

When out of space, segcache picks a ttl bucket, takes the first n segments and merges them into 1 segment in a single pass, stochastically dropping the keys with lowest read frequency.

Keys are indexed in a bulk-chained hashtable. Locking is per-chain (<= 6 keys). For each key the segment id+offset, frequency counter and the first 12 bits of the hash are packed into a 64 bit entry.

The frequency counter is a novel approximation which is accurate for counts < 16 and increasingly less accurate for larger counts. Keys with large counts are always going to be retained anyway, so accuracy is most important when comparing keys which might be evicted. To avoid cache pollution by request spikes the counter is reset to 0 when segments are merged (to approximate a windowed count). To reduce locking costs, the counter is rate-limited by a per-chain last-access-timestamp.

bloomRF

Proposes a bloom filter variant that performs well for range queries across a larger range of query parameters than previous designs, and that can be built online.

A regular bloom filter take k hashes of the key and sets for (0..k) |i| filter[hash(i, key)] = true.

bloomRF sets hash(i, key) to hash only a prefix of the key. So we might choose eg hash(i, key) = hash(first_n_bits(i * 8, key)). That means that we can look up a dyadic range like [0xCC0,0xCCF] by checking if filter[hash(0xCC)]. Arbitrary ranges can be built out of O(k) dyadic ranges.

That's still O(k) hash misses though, so to tame the constant factors we can tweak the hash function. When hashing the first eg 16 bits, we'll actually only hash the first 10 bits and then concatenate the last 6 bits. This means that sequences of 64 contiguous ranges will end up with hashes all on the same 64 bit word which we can check with a single instruction. So the actual number of cache accesses ends up being something like k+1.

So far pretty simple. But getting good performance out of this requires some amount of tweaking of number of hashes, size of prefixes and size of contiguous ranges that I didn't really follow on the first pass.

The performance evaluation in rocksdb seems promising, with better latency and false-positive ration for ranges up to 2^36. I'd like to have seen what happens to the graph beyond that range though, since performance looks like it might just fall off a cliff.

existential consistency

Facebook uses a graph database which provides sequential and read-after-write consistency within a single cache and eventual consistency across caches. How much could they reduce errors by making that database linearizable? (They don't consider any stronger consistency model because they can't measure non-local consistency experimentally).

The first experiment samples all reads and writes for a subset of keys. They report a very low percentage of read anomalies.

But they note that the ratio of writes to reads is 450:1 and that changing their error margin for clock skew can increase the number of anomalies by 10x (switching from false negatives to false positives, so the real number is probably bounded somewhere between). If we combine those, we see that up to 0.00039% * 10 * 450 = 1.755% of writes cause violations of linearizable consistency. They note that writes aren't evenly distributed too, so it kinda sounds like the actual result here is that conversations on hot posts/photos see many read anomalies but read-only traffic on older posts sees none.

They also note that systems which are sensitive to read anomalies either set flags requiring stronger consistency or use other databases, so what we're measuring here is only those systems which either tolerate anomalies well or just happen to have workloads which don't cause many anomalies. So it's hard to generalize from these results.

The second experiment actively probes agreement between different caches. Disagreement can be caused by latency alone, so the actual percentage of disagreement is not very meaningful. But sudden spikes in disagreement are very strong alarm signals and usually indicate some operational failure.

Both experiments are very cool, even if I'm somewhat skeptical of the interpretation of the first. It's very common to see people handwave about eventual consistency, and very rare to see anyone actually try to measure the impact, let alone systematically monitor it over time.

parameter-aware io management for ssds

SSDs pretend to be simple block devices but are much more complicated internally. Need to understand their layout to get the best performance, but manufacturers are secretive.

Shows how to experimentally discover internal parameters:

Using these discovered parameters to tune the linux block layer and io scheduler leads to dramatic performance improvements on standard io benchmarks.

This paper is from 2010 and uses SATA though, so specific results may not still hold.

fantastic ssd internals

A more recent paper in the same vein as the above. They publish a tool that automatically infers parameters.

Highlighted findings:

The authors were not able to validate any of their conclusions because manufacturers point-blank refuse to provide any information at all.