Many groups are working on running sql queries in incremental/streaming systems. Query planning in this context is not a well understood problem. This post is a quick pointer towards the many open problems. It's not intended to be complete or authoritative, because I haven't engaged with the problem much beyond noticing that it's hard and deciding not to attack it yet.
Pretty much every sql database does cost-based planning. Here is a very high-level view of the process:
- Turn the sql query into a query plan. That might look something like this:
project ├── group-by │ ├── left-join (hash) │ │ ├── scan users │ │ ├── scan posts │ │ └── filters │ │ └── user_id = users.id │ └── aggregations │ └── count │ └── user_id └── projections └── count_rows
- Use a list of rewrite rules to generate all possible equivalent plans.
- Collect statistics about the distribution of data in the database eg row counts, histograms of values in each column. (This usually happens in the background.)
- Use these statistics to estimate the size (cardinality) of the intermediate results at each point in the plan.
- Combine these cardinality estimates with heuristics about the costs of various operations to estimate the cost of each plan.
- Run the cheapest plan.
(If you want to learn more about query planning I recommend starting with CMU 15-721.)
Current streaming systems
Materialize and ksqldb (as of this Nov 2020 talk) both use heuristic planners. They work like this:
- Turn the sql query into a query plan.
- Apply a fixed list of rewrite rules repeatedly until the plan stops changing.
- Run the resulting plan.
Heuristic planners are really easy to write, which is why most databases start there. But because they know nothing about the data they are limited to only including rules that are beneficial in almost all cases.
Figure 11 of Query Optimization Through the Looking Glass shows how different choices of join ordering affect the runtime of some simple queries. In some cases the difference between the best plan and the worst plan is 5 orders of magnitude, and very few of the plans are within even 1 order of magnitude of the best. So that's what we might be leaving on the table with a heuristic planner.
Flink has a cost-based planner (source) which is a step in the right direction. I'm still puzzling over the code but afaict it uses a batch cost model (most of the stream nodes don't override the inherited cost model, but in the ones that do only one mentions update rate and that is hardcoded). It looks like alibaba are working on improving planning in streaming settings though.
As far as I know that's pretty much the state of the art in the open-source world.
Why can't we just use the same techniques for streaming queries as we do for batch queries?
Queries are long lived. The distribution of data might change over time, and we might not even have any data when we first install the query. A plan that was optimal yesterday might be a disaster today.
Queries are stateful. A faster plan might also require more memory. Switching to a different plan while running can incur expensive work to build the state required for the new plan. In the worst case, if historical input isn't still available then we might not be able to compute the new state at all.
Optimization has multiple objectives. In the classic database model the only thing we are optimizing for is total runtime. In a streaming system we can tradeoff between throughput, latency, memory usage, bandwidth, installation time, failure tolerance etc. In elastic systems this is even more complicated because our resources aren't fixed - maybe the faster plan also costs more money.
There are multiple queries. Queries can contend with each other for resources eg a shuffle-heavy plan might cause delays for other queries on the same network. At the same time, we can also consider sharing resources like indexes or even entire sub-graphs between queries. To best take advantage of this we need to optimize across multiple queries simultaneously. It might sometimes make sense to choose plans that are individually suboptimal if it allows more sharing between them. But this means that the cost estimate is no longer linear - the total cost of two plans is not just the sum of the costs of each plan - which makes the search problem much harder.
Data statistics are more complicated. It's not enough to just know the distribution of data at any one time. We also need to know how it evolves over time. Eg consider a table that grows continuously vs a table that stays small but has a high rate of change - they may have the same number of incoming updates but we should plan very differently for these cases.
Plans may get very deep. It's increasingly common to stack views on views on views. Unfortunately combining cost estimates exactly is currently intractable so we rely on simplifying assumptions (eg no correlation between different predicates) that introduce some degree of error. The deeper the plans get, the more these errors multiply.
I haven't done any kind of systematic literature review recently, so the directions below are kind of a random sampling of things I happen to be have stumbled over.
Exploring the plan space seems like a first step. It seems likely to behave differently than in a batch setting. For example, in a batch query when you're joining a big relation against a small relation, you want to build an index on the small relation and scan the big relation against it, rather than the other way around. But in a streaming setting you have to build indexes on both relations anyway so this entire dimension of decision making is taken off the table. For another example, for join ordering in a batch setting we want to minimize the size of the intermediate results, but in a streaming setting we also want to minimize the change rate of the intermediate results which might be achieved by an ordering with larger intermediate results. So we need to figure out which are the most consequential decisions for streaming plans.
Replanning queries has a rich academic literature in the context of long-running batch systems. Likely this could be adapted to incremental/streaming settings. I'm somewhat bearish on this approach because of the sheer operational terror of having your cluster suddenly decide to shift itself into a different and not-yet-tested configuration. But maybe we can find some control theory -like framework that bounds the costs of transitions.
There is also a lot of academic work on multi-query optimization. I don't really know anything about it yet other than that it exists, but it seems like an obvious seam to mine.
I've seen some work into robust cardinality estimation (to oversimplify - propagate error bars in estimates) but less into robust operators. If we build plans from operators that work well across large ranges of input distributions and that don't have catastrophic failure modes, then we won't be as reliant on accurate statistics. The example that came up in Query Optimization Through the Looking Glass is the nested loop operator in postgres which is faster in a small part of the input space but disastrously slower in the rest of it, making it a risky choice if your estimates might be inaccurate. There's also some folk wisdom that worst-case optimal joins are more robust to poor join orders than traditional binary joins, but I haven't seen this systematically tested. I think this is on the whole an under-explored direction.
I've also seen some work on worst-case cardinality estimation (eg). Rather than take a point estimate of the input data and emit a point estimate of the output cardinality, the approach is to take bounds on the possible input distributions (in the form of statements about the relative entropy of different subsets) and calculate a tight upper bound on the largest possible output cardinality across the whole space of possible inputs. I like this approach because it works better the more information it has but it doesn't make disastrous decisions when it has little information. It's effectiveness though is going to depend on whether we care more about finding the best plans or avoiding the worst plans, which in turn will depend on the distribution of costs across the space of plans, which brings us back to exploring the plan space.
Finally, I've done some tentative exploration into avoiding query planning entirely. If the space of query plans is simple enough then it's feasible to allow the programmer to choose the plan, either by providing hints or in my case by having a direct mapping from the query language to the plan. For the Join Order Benchmark it seemed that having a human spend 30s per query making the obvious calls is sufficient to get pretty good plans compared to postgres. But there were too many confounding factors in that test to be sure and I haven't revisited the work since to find out how the chosen plans compare to the best possible plans in the same execution engine.