0007: yet more internal consistency, re: how safe is zig, async performance, local-first software, fuzzers and emulators, deterministic hardware counters, zig goto

Published 2021-04-03

Yet more internal consistency!

I settled on a simple example that stresses most of the failure modes I'm aware of. The final view, total, should always contain a single row with the number 0. This makes it super easy to spot consistency violations.

I have reproducible examples for some systems.

Next steps:


At the moment many users are reporting issues getting joins to work out-of-the-box.

The only detailed description I've found of how the kafka streams join works. The Kafka devs are strongly opposed to using watermarks to track progress, so how do they know when to emit output from joins and when to wait for more input.

The answer? Wall-clock timeouts.

The author of zig responded to my How safe is zig? post. Given that the General Purpose Allocator currently has no benchmarks and that there are reasonable concerns about it ever being performant, I don't think it's accurate to call it a solved problem. But it was remiss of me not to at least put an asterisk in the table pointing to future plans, given that the HN collective memory is likely to compress the entire post into "lol zig sux".

I was disappointed that the vast majority of the discussion never made it past the table, and I didn't see anything at all about the relative frequency of vulnerability root causes which was the main thing I wanted to learn more about and the whole inspiration for the post.

I did learn that putting language-flamewar-adjacent material above the fold guarantees that noone will read below it.

There's been some noise recently about whether async language additions are worth the complexity. I'm intrigued by this conversation where Tyler Neely argues that most of the case for async is based on out-of-date benchmarks, and that later improvements to the linux kernel largely removed the performance difference.

There are a few places in materialize that use async, solely because of all the available libraries switched to async, not because we actually expect to have eg 100k simultaneous postgres client connections. It's by far the most difficult code I ever had to touch and it would have been simple with threads. So naturally I'm interested in any argument that implies I shouldn't have to deal with that again.

Peter van Hardenberg gave a talk on local-first software. There isn't much new in there if you've already read the original post, but he made a comment in the Q&A that stuck with me:

SaaS software is routinely destroyed by market forces

I've been musing lately that progress in many consumer fronts is being held back by this. Rather than ratcheting improvements, we get an ebb and flow as some company produces a new tool that starts out very limited, improves over time, and finally is acquired and shuttered. Then we start again from scratch.

As a programmer I largely work on my own machine using software that I choose. I can automate my interactions via bash or sometimes even X. I can write my own tools that integrate with existing ones (eg because anki stores data in sqlite I was able to write a tool that extracts cards from my notes and syncs them with anki). I could give those tools to other people, maybe even make a business out of them. Much of my important data has outlasted the software that originally produced it (eg my website started with a homespun erlang app, then octopress, then plain jekyll, then zola). I can pin versions and upgrade when I wish (eg I have two zola sites on my machine with incompatible versions and I can't be bothered to change either). Even if the author goes on an incredible journey I can keep using the software for a long time.

None of this is the case for SaaS software. You can't interact without except via approved apis. Automation via bots is usually against the terms of service. It upgrades itself just before your big presentation. It disappears as soon as google starts to view it as a potential competitor and it takes all your data with it.

This is commonly presented as an economic or social problem - SaaS is just more profitable. But PvH makes a good case that this is a technical problem, and one that could be solved.

This is the first video in a series where the author live-codes a fuzzer that can outperform AFL. I didn't watch the whole 30+ hours, but from skimming at 2x speed I picked up the gist:

The end result is that while their test cases per second are somewhat lower than AFL on a single core, they get linear speedup with 172 threads on 96 cores.

To avoid performance regressions in complex software it's useful to have benchmark tests in CI. But this only works if the noise in the measurements is sufficiently small, otherwise it's hard to figure out exactly which commit is responsible for the regression. I've read before about sqlite using cachegrind to get deterministic results but I recently stumbled across a discussion of the heroic effort to get deterministic hardware counter measurements for rustc.

Zig doesn't have goto. The strongest case for adding it is for instruction dispatch in interpreters, where having only a single dispatch point makes branch prediction harder. This issue has a fascinating discussion on whether llvm can be reliably nudged into generating the correct code, and whether it even matters as much as it did 10 years ago.

Relatedly, I heard 3rd hand of someone trying and failing to reproduce classic demonstrations of BTB pollution on recent intel cpus. So it may be that this just isn't an issue any more.