0007: yet more internal consistency, re: how safe is zig, async performance, local-first software, fuzzers and emulators, deterministic hardware counters, zig goto

Yet more internal consistency!

I settled on a simple example that stresses most of the failure modes I'm aware of. The final view, total, should always contain a single row with the number 0. This makes it super easy to spot consistency violations.

I have reproducible examples for some systems.

Materialize works fine, as expected, although getting event-time data in and out is a pain so I had to do something hacky with the new temporal filters feature. The other option is using the CDC protocol encoded in avro messages in kafka which seems like it's going to be a pain.
In flink, using the table api, the value in total oscillates back and forth and typically only ~0.04% of the outputs are correct. This is roughly what I expect to see in an eventually consistent system. What I didn't expect is that the size of the oscillations seems to increase linearly wrt to the number of inputs. Also ~13% of the outputs are the number 439 (the exact number varies, but every run has a large frequency spike on a single number). I have no explanation for this. I also found a bug in left joins.
In kafka streams, I found data loss just streaming data in and out without any computation. I'm still unable to run any more complicated computation. It must be something weird about my environment or configuration, because people use kafka streams, but I was able to reproduce it on a fresh ubuntu vm too. Neither the slack or the mailing list have been able to help. So far about 2/3rds of the time on this project has been spent trying to persuade kafka streams to do anything, so unless someone else figures out the problem I'm not going to spend any more time on it.

Next steps:

Test differential dataflow. I don't expect this to fail, but it will be a nicer example than the hacky stuff I did for materialize.
Test the flink datastream api. This is a bit tricky, because it doesn't have unrestricted joins that I need for the running example. I could either use an interval join with a really big interval, or do the same computation with a union and groupby. Not sure yet which one is the fairest comparison.
Test spark structured streaming.
Try ksqldb, using the official docker image provided by confluent. Hopefully this avoids whatever weird failure I'm having with kafka streams.

KSQL#5377

At the moment many users are reporting issues getting joins to work out-of-the-box.

The only detailed description I've found of how the kafka streams join works. The Kafka devs are strongly opposed to using watermarks to track progress, so how do they know when to emit output from joins and when to wait for more input.

The answer? Wall-clock timeouts.

The author of zig responded to my How safe is zig? post. Given that the General Purpose Allocator currently has no benchmarks and that there are reasonable concerns about it ever being performant, I don't think it's accurate to call it a solved problem. But it was remiss of me not to at least put an asterisk in the table pointing to future plans, given that the HN collective memory is likely to compress the entire post into "lol zig sux".

I was disappointed that the vast majority of the discussion never made it past the table, and I didn't see anything at all about the relative frequency of vulnerability root causes which was the main thing I wanted to learn more about and the whole inspiration for the post.

I did learn that putting language-flamewar-adjacent material above the fold guarantees that noone will read below it.

There's been some noise recently about whether async language additions are worth the complexity. I'm intrigued by this conversation where Tyler Neely argues that most of the case for async is based on out-of-date benchmarks, and that later improvements to the linux kernel largely removed the performance difference.

There are a few places in materialize that use async, solely because of all the available libraries switched to async, not because we actually expect to have eg 100k simultaneous postgres client connections. It's by far the most difficult code I ever had to touch and it would have been simple with threads. So naturally I'm interested in any argument that implies I shouldn't have to deal with that again.

Peter van Hardenberg gave a talk on local-first software. There isn't much new in there if you've already read the original post, but he made a comment in the Q&A that stuck with me:

SaaS software is routinely destroyed by market forces

I've been musing lately that progress in many consumer fronts is being held back by this. Rather than ratcheting improvements, we get an ebb and flow as some company produces a new tool that starts out very limited, improves over time, and finally is acquired and shuttered. Then we start again from scratch.

As a programmer I largely work on my own machine using software that I choose. I can automate my interactions via bash or sometimes even X. I can write my own tools that integrate with existing ones (eg because anki stores data in sqlite I was able to write a tool that extracts cards from my notes and syncs them with anki). I could give those tools to other people, maybe even make a business out of them. Much of my important data has outlasted the software that originally produced it (eg my website started with a homespun erlang app, then octopress, then plain jekyll, then zola). I can pin versions and upgrade when I wish (eg I have two zola sites on my machine with incompatible versions and I can't be bothered to change either). Even if the author goes on an incredible journey I can keep using the software for a long time.

None of this is the case for SaaS software. You can't interact without except via approved apis. Automation via bots is usually against the terms of service. It upgrades itself just before your big presentation. It disappears as soon as google starts to view it as a potential competitor and it takes all your data with it.

This is commonly presented as an economic or social problem - SaaS is just more profitable. But PvH makes a good case that this is a technical problem, and one that could be solved.

This is the first video in a series where the author live-codes a fuzzer that can outperform AFL. I didn't watch the whole 30+ hours, but from skimming at 2x speed I picked up the gist:

AFL and similar fuzzers don't scale well wrt to number of core because of:
- kernel contention in the fuzzer itself (eg writing corpus to disk)
- shared resources in multiple copies of the program being tested eg lockfiles, ports
So instead:
- keep all fuzzer state in-memory
- write an emulator instead of running code directly
  - (they mention writing x86 and arm emulators at their day-job, but for the live-stream they use risc-v)
- fake syscalls to avoid kernel contention
- jit to c to reduce emulator overhead
- insert additional checks in the emulator for eg byte-level memory sanitizer
- run the program to be fuzzed until it first interacts with the input, then snapshot the state of memory and restart from there
- track unique case via last few values of the program counter - anything more complex tends to lead to counting the same bug many times
- track branch coverage via a hash of program counter before and after every jump instruction

The end result is that while their test cases per second are somewhat lower than AFL on a single core, they get linear speedup with 172 threads on 96 cores.

To avoid performance regressions in complex software it's useful to have benchmark tests in CI. But this only works if the noise in the measurements is sufficiently small, otherwise it's hard to figure out exactly which commit is responsible for the regression. I've read before about sqlite using cachegrind to get deterministic results but I recently stumbled across a discussion of the heroic effort to get deterministic hardware counter measurements for rustc.

Zig doesn't have goto. The strongest case for adding it is for instruction dispatch in interpreters, where having only a single dispatch point makes branch prediction harder. This issue has a fascinating discussion on whether llvm can be reliably nudged into generating the correct code, and whether it even matters as much as it did 10 years ago.

Relatedly, I heard 3rd hand of someone trying and failing to reproduce classic demonstrations of BTB pollution on recent intel cpus. So it may be that this just isn't an issue any more.