Miscellaneous ideas

Published 2024-03-23

a better sql

As much as I complain about sql, it's historically been too entrenched to be worth trying to compete. Building a database is really hard, and adopting a new database is really risky, so both vendors and customers have strong incentives to be risk-averse. ORMs, BI tools, database vizualizers etc all add to the incredibly strong network effects.

The unbundling of OLAP databases changes the tradeoffs. With more ecosystems moving to open data formats like Arrow/Parquet/ORC and libraries for query execution and optimization like Velox/DataFusion/Polars, it's possible to adopt a new query language without also adopting a whole new database.

Last week I helped a friend who was struggling to write a relatively simple query against some sparse data in hive. Despite understanding exactly why the query was awkward in sql, my first attempt was also wrong. Even with the correct query, hive's query planner duplicated the CTEs so that each reference to a CTE in the query ended up recomputing the CTE from scratch. This kind of interactive data exploration and analysis seems like a good crack in which to insert a non-sql wedge: the query planning and execution is not as high-quality as in the database world, the data is already accesible to 3rd party tooks, queries are being written by humans rather than BI tools, and queries are written often and interactively so the pain is repeatedly felt.

There are a scattering of alternative query languages around. Roughly three categories:

I've sketched rough design principles elsewhere, and recently I've been thinking about exactly what limitations are necessary on a language to make query planning tractable. Languages aimed at replacing sql mostly seem to starting from the same original foundation of first-order relations over scalar types. But sql was relentessly pushed towards nested relations (subqueries, lateral joins), complex types (rows, arrays, json), functions, and embedded procedural languages (pl/pgsql, t-sql etc). Might as well design those features in from the start rather than paste them awkwardly to the sides afterwards. I think it's possible to push much closer to existing functional/imperative languages without losing query planning.

The business model is trickier though. If I don't want to build an entire database then I'm shipping a library, which means no SaaS pricing model. Consulting and support seems like the most likely outcome, although it might also be possible to make a thin-crust open-source model work.

in-browser database

There's been some small demand for this for a while (eg datascript, pouchdb). Recently, embedding sqlite.wasm is also becoming popular (eg electric sql, sqlsync, riffle, powersync).

It's usually pragmatic to just add a layer to an existing thing rather than make a new thing. But every now and then we have to look at the giant wobbling pile of layers and wonder if it's time to start from scratch. We're shipping ~1m of wasm and js in order to run a bytecode interpreter inside a wasm jit, writing btrees to a virtual file-system on top of localStorage (which in firefox is itself built on top of sqlite). Paying for a transaction log and concurrency control in a system that probably doesn't have any meaningful concurrency (eg a typical react app would only mutate persistent state in event handlers, and only query state while updating the virtual dom). And on top of all that you still have to glue sql strings together in javascript.

What is sqlite actually providing here? I think the core requirement is managing the tension between server<->client sync and client<->dom sync: existing backend databases prefer normalized data, but frontend frameworks want tree-shaped input data. Sql isn't ideal for traversing that divide, since support for working with tree-shaped data (eg json) was such a late addition to the language. There's almost an advantage to sharing a query language with the backend database, but most of the projects above are using postgres on the backend so dialect differences become a problem anyway.

Perhaps sqlite.wasm could be supplanted by a designed-for-the-browser database providing:

The tricky part is sync and conflict resolution - there are still many competing models and it's not immediately obvious whether it's possible to support them separately or whether one of them has to be baked into the core of the database. Similarly, if sqlite on the backend becomes more popular then it might have an advantage from being able to share queries between frontend and backend.

The options for business models look much like the last idea.

wasm/wasi tooling

A coverage-guided fuzzer needs to analyze code to insert instrumentation at branches, and to be able to quickly snapshot the state of execution. A reversible debugger like rr needs to be able to capture all interactions with the outside world and all sources of non-determinism. Deterministic simulation testing needs the same, plus the ability to virtualize the outside world.

Wasm and wasi really lend themselves to this kind of tooling. Wasm is easy to parse and analyze, is almost completely deterministic (barring NaN representation and optional support for threads), is sandboxed to a single region of memory, and can only interact with the outside world via imported functions. Wasi is explicitly designed to be easily virtualized. The stage is set for wasm ecosystems to have amazing testing tooling.

What exists right now is not amazing. The only fuzzer I can find is from this 2021 paper hooking afl++ up to wavm. For debugging, the best bet is to either compile to native code and use gdb/lldb/rr or to load the program in chrome.

A high quality open-source testing tool would lend itself to a SaaS partner business, doing continuous fuzzing from CI builds and reporting errors by dropping you into a time-travel debugger at the failure point. As the wasm component model matures it might even be possible to exercise libraries automatically using their wit definitions.

The risk is that wasm-specific tools aren't sufficiently better/faster/cheaper than compiling the wasm to native code and using native tools. Clusterfuzz and antithesis are steep competition.

streaming systems

I spent some time in 2021 testing existing streaming systems. I actually wanted to test fault handling, but almost all of the streaming systems I tested produced nonsense results even without faults. Occasionally I even saw violations of eventual consistency, where the incorrect results did not correct themself once I stop supplying input data.

I haven't seen much impact from this work. New streaming systems are still appearing and none of the marketing mentions consistency at all.

If I tested this again I'd focus only on sql systems. It's arguable that in streaming systems where the programmer is directly building topologies out of the underlying primitives then the programmer should understand all the ways in which those primitives can fail to produce the expected results. But when the programmer is just supplying a sql query then there is no way for them to reason about the underlying behaviour. If a streaming sql query produces results that would be impossible to see when running the same query in a batch system on any subset of the input data, then there is no way for the programmer to reason about the system at all.

My totally uninformed guesses would be:

Based on Kyle Kingsbury's experiences with Jepsen, I expect that shifting the industry towards more correct systems would be a grinding uphill battle. I don't think I have that kind of grit or calm professionalism. But there is potential for someone to do incredibly valuable work here. If you want to be the Kyle Kingsbury of streaming systems, I will happily help.