Unexplanations: query optimization works because sql is declarative

Published 2024-02-21

Here is a simple sql query.

    select sum(posts.likes)
    from posts
    where posts.user_id = users.id
  ) as post_count
from users;

If we read this query literally, we're asking to perform two nested loops - the first over users and the second over posts. Executed naively this would take O(|users| * |posts|) time, but a query planner can apply clever transformations to turn this query into a hashjoin and groupby/aggregate, for a total cost of O(|users| + |posts|) time.

postgres> explain select users.id, (select sum(posts.likes) from posts where posts.user_id = users.id) from users;

                             QUERY PLAN
 Seq Scan on users  (cost=0.00..90649.75 rows=2550 width=12)
   SubPlan 1
     ->  Aggregate  (cost=35.52..35.53 rows=1 width=8)
           ->  Seq Scan on posts  (cost=0.00..35.50 rows=10 width=4)
                 Filter: (user_id = users.id)

Ok, postgres actually does perform two nested loops - it's not very good at subquery optimization yet. Let's try a more aggressive query planner.

cockroach> explain select users.id, (select sum(posts.likes) from posts where posts.user_id = users.id) from users;

            tree           |       field       |   description
                           | distributed       | true
                           | vectorized        | false
  render                   |                   |
   └── group               |                   |
        │                  | aggregate 0       | id
        │                  | aggregate 1       | sum(likes)
        │                  | group by          | id
        └── render         |                   |
             └── hash-join |                   |
                  │        | type              | left outer
                  │        | equality          | (id) = (user_id)
                  │        | left cols are key |
                  ├── scan |                   |
                  │        | table             | users@primary
                  │        | spans             | FULL SCAN
                  └── scan |                   |
                           | table             | posts@primary
                           | spans             | FULL SCAN

That's better.

Compilers for most programming languages can't perform such dramatic transformations, but compilers for sql can. If we ask why, most people answer that it's because sql is declarative.

Declarative programming is a non-imperative style of programming in which programs describe their desired results without explicitly listing commands or steps that must be performed.

-- wikipedia

If we rewrote the query above in an imperative language like python, the structure would be totally different.

# extra whitespace for clarity
      for post in posts 
      if post.user_id == user.id
  for user in users

Huh, that kinda looks the same as the sql query. Maybe list comprehensions are also declarative?

Let's try again.

result = []
for user in users:
  total = 0
  for post in posts:
    if user.id == post.user_id:
      total += post.likes
  result.append((user.id, total))

Now that looks imperative! Although all we've done is desugar the list comprehensions and inline the definition of sum, which from the compilers point of view isn't that much of a change.

But in the imperative version the python compiler can't optimize away the nested loops! Whereas in the declarative version the python compiler, uh, still can't optimize away the nested loops.

Maybe the problem here is that python is just too loosey-goosey. Wikipedia says that haskell is a declarative language - let's try that instead.

    id user, 
    sum [
      likes post
      | post <- posts,
      user_id post == id user
  | user <- users

Time that sucker - it's still quadratic. But ghc is a much more sophisticated compiler than most sql optimizers. Why can't it do the same transformations?

It looks like "query optimization works because sql is declarative" is just getting us confused.

I like to call this kind of answer an unexplanation. It sounds like an explanation. It's satisfying enough that many people will accept it and repeat it. And if an answer is repeated often enough then everyone knows that the answer is correct.

But if you try to use that answer to actually do anything then it falls apart. It's not wrong, exactly, but it's not precise enough to make useful predictions.

Instead of waving around vague adjectives, let's imagine we wrote compiler that can take these nested list comprehensions...

result = (
      for post in posts 
      if post.user_id == user.id
  for user in users

...and turn them into this efficient join:

totals = defaultdict(int)
for post in posts:
  totals[post.user_id] += post.likes
result = []
for user in users:
  result.append((user.id, totals[user.id]))

Is this transformation correct?

Before you answer, let's look at the rest of the code.

class User:
  def __init__(self, id):
    self._id = id

  def id(self):
    self._id += 1
    return self._id

Getting user.id also adds one to the id.

class Likes:
  def __init__(self, count):
    self.count = count

  def __add__(self, likes):
    return Likes(self.count + (2 * likes.count))

  def __radd__(self, int):
    return Likes(self.count + (2 * int))

  def __sub__(self, likes):
    return Likes(self.count - (2 * likes.count))

Addition of Likes is not commutative or associative.

class Post:
  def __init__(self, user_id, likes):
    self.user_id = user_id
    self._likes = likes
  def likes(self):
    Likes.__add__ = Likes.__sub__
    return self._likes

Getting post.likes changes how addition behaves for the Likes class in future calls.

class Posts:
  def __init__(self, posts):
    self.posts = posts

  def __iter__(self):
    return self.posts.__iter__()

Iterating over posts randomly shuffles the contents of users.

class Id:
  def __init__(self, int):
    self._id = int

  def __hash__(self):
    return id(self)

  def __eq__(self, other):
    return self._id == other._id

The Id class hashes by identity but compares by value, so the groups in totals will get split up in some non-deterministic way.

users = (User(i) for i in itertools.count())

Lastly, there are an infinite number of users. The original query returns a generator that can be consumed lazily, but the transformed query never returns.

Now that we've seen the rest of the code, it's clear that in this case the transformed query won't produce the same results as the original query.

Since python is dynamically typed and late bound, the compiler can't know in advance if we're going to call code as unreasonable as above, so it can't possibly reason about whether the transformation is valid or not.

So our minimum requirements for interesting program transformations include:

  1. Static typing (so that we know what values look like).
  2. Static dispatch (so that we know which functions will be called).
  3. Some way to reason about the properties of functions (eg that a == b implies hash(a) == hash(b) or that addition is commutative).
  4. Some way to reason about which values are finite and which functions terminate.
  5. Some way to reason about the time and space cost of functions (so that we can predict if a given transformation is worthwhile).
  6. Some way to reason about the effects caused by calling functions (eg by only allowing pure functions, or by tracking effects and aliasing in the type system).

Python fails on all of these requirements.

Haskell satisfies 1-3 pretty well, but we can still write tricky code that makes 4 and 5 difficult (eg space leaks are notoriously hard for humans to reason about, let alone for the compiler which doesn't understand the intent of the code). And haskell's pure functions mostly satisfy 6, but they can still throw exceptions.

user_id (Post uid likes) = if (uid == 0) then (error "Oh no") else uid

Although, to be fair, sql functions can also throw exceptions. That does mean that many optimizations can change the result of the query, but we try not to think about it.

sqlite> select count(foo.bar/0) from (select 1 as bar) as foo where foo.bar = 0;

postgres> select count(foo.bar/0) from (select 1 as bar) as foo where foo.bar = 0;
ERROR:  division by zero

So on the one hand, despite being described as declarative, haskell makes it very difficult to automatically perform these sql-like optimizations. And on the other hand if we designed an imperative language that satisfied 1-6 (perhaps using effect types or mutable value semantics), then that language probably would be amenable to sql-like optimizations.

Drawing a broader lesson, the original 'explanation' seemed quite reasonable on the surface but the vagueness of the term 'declarative' was doing a lot of work. Operationalizing the question by imagining how we would try to implement a query optimizer for various different languages forced us to deal with the concrete details. The more concrete we get, the harder it is for fuzzy thinking to hide.

This is important not because we actually want to run a query optimizer on python, but because it helps us think clearly about the design of future query languages and programming languages.

Find more unexplanations here.