Thoughts on benchmarking streaming systems

I spent some time thinking about how to benchmark streaming systems. Though I didn't go through with the project, the notes might still be useful for others.

What is the point of benchmarking?

Decision making, capacity planning. I have X queries with Y data and Z slo - what software do I use and what hardware do I buy?
Focus engineering effort on bottlenecks.
Inform the design of future systems by understanding design tradeoffs.

Output of a good benchmark:

Latency histograms at various throughputs.
Behavior over time eg how long until steady state, distribution of pauses.
Find the throughput at which the system can't reach a steady state.
Figure out which hardware or software resource is the current bottleneck.
Find the least expensive hardware configuration at which the system outperforms some hand-coded baseline (see COST).

The above output is per hardware configuration, per query set, per data set. Far too many variables to explore exhaustively. We need a model of performance.

What are the bottlenecks? Where are the walls?
Enable back-of-the-envelope math.
Use experiments to try to falsify the model.

Existing public benchmarks do none of this. They usually contain one (simple) query, one data set, one hardware config. They're only useful for marketing.

The Yahoo Streaming Benchmark is overly simplistic and mostly measures JSON parsing.
LDBC is reasonably well designed but only targets graph/RDF databases and as a result has little uptake.
Various academic benchmarks keep emerging - it's a good way to get published - but they typically focus on incredibly shallow computations like word counts or graph motifs.

Benchmarking practices:

Validate measurements.
- eg test loadgen vs a dummy server with precomputed answers to ensure that loadgen is not a bottleneck
- eg suspend the system to test for coordinated omission
Include good baselines.
- eg test vs simple single-threaded hand-coded program
- eg test vs repeated queries on a good batch system (like memsql)
Separate loadgen from test machine to avoid interference.
- (kinda tricky because brings cloud networking into play again)
Measure ingest separately from compute.
Use best available serialization - otherwise just measuring json parsing.
Test that system output is correct.
Have multiple queries in flight, not just a single synchronous loop - test queue management.
Use bare metal hardware to avoid interference from other cloud hosts and to get access to performance counters.
Make sure benchmarks are reproducible with a single command, down to the hardware configuration and software versions.
Separate results for consistent systems and inconsistent systems - don't incentivize trading correctness for performance.
Seek review from system authors before publishing.

Focus on benchmarking vertical scaling first:

Most peoples problems can be solved on one big machine
- eg aws i3en.metal: 96 cpus, 768GB ram, 8 x 7.5TB NVMe SSD, 10.848 USD/hour
- eg scaleway HM-BM1-XL: 36 cores, 768GB ram, 8 x 1TB NVMe SSD, 1.72 EUR/hour
- eg fasthosts: 12 cores, 192GB ram, 1 x 1TB NVMe SSD, 0.38 GBP/hour
Have to understand one machine before understanding many.
Cheaper to test, which means more tests and better understanding of the space of options.
Easier to reproduce - cloud networking is notoriously variable.

Micro benchmarks:

Identity pipeline (stresses de/serialization, networking - should be able to saturate regular nic).
Data skew (both pre-shuffle and post-shuffle).
Join skew
Stragglers (if I keep pausing one worker, will the rest of the system wait on it?).
Install new query, see if old query has hiccups.
Run multiple queries in parallel. Are they scheduled fairly? Does contention decrease total throughput as number of queries go up?

Macro benchmarks:

OLTP, maybe TPC-C or LDBC social network? (stresses atemporal joins)
OLAP, maybe TPC-H? (stresses grouping, aggregation)
Graph stuff, maybe LDBC? (stresses atemporal joins, iterative computation)
Summaries over user sessions (stresses temporal joins, event time processing, late arriving data, eviction of old state)
Test restarts / failure for one of the above (how big does the dataset have to be before the overhead of fault tolerance pays for itself?).