Imp: decorrelation

This is a part of a series - start at the beginning.

In SQL, when a subquery references a variable from the surrounding query, this is called a correlated subquery.

If you were feeling a bit strange, you could use a correlated subquery to express a join. These two queries are identical (as long as bar.b is a primary key):

select bar.b
from foo left join bar
on foo.a = bar.b

select (
  select bar.b from bar where bar.b = foo.a
) from foo

The natural way to think about the semantics of a subquery is as a nested loop - "for each row in foo, run select bar.b from bar where bar.b = foo.a". But this is typically a bad way to execute a subquery, because much of the performance of a relational database comes from being able to execute operations in bulk to amortize interpreter overhead and maximize cache efficiency. So most databases try to transform correlated subqueries into an equivalent non-correlated query.

This process is called decorrelation and there a lot of papers about how to do it not terribly most of the time. Most databases do ok with simple queries but still have many cases where they resort to nested loops. Probably the most complete approach is the one used in HyPer but in my experience at materialize this approach often blocks later optimizations by producing diamond patterns in the plan DAG that are hard for the optimizer to reason about.

It's a pretty hard problem to solve, but heavy use of subqueries is not actually that common in SQL so it isn't that important to do a perfect job.

Correlated imp

In imp, we can express a join using ->, which despite mimicking a closure has semantics very similar to a SQL subquery.

foo (a -> bar a)

But unlike with SQL, in imp this is the only way to express a join so we had better do a good job with it. And nested loops aren't compatible with my plans for incremental maintenance, so there isn't even an escape hatch. We have to decorrelate everything, all the time.

I have an approach that:

seems to be correct (has stood up to many cpu hours of libfuzzer)
avoids the problematic diamonds produced by the HyPer approach
has a natural cost model
produces an intermediate representation that is a natural input for planning via hypergraph decomposition

I haven't yet paid much attention to plan quality and I don't yet know if this approach can be adapted to SQL, so this is just the first step.

I'll focus on just the core ideas. The surrounding details are fiddly but not surprising.

Idea 1 - logical intermediate representation

There is a subset of imp that strongly resembles datalog or first-order logic, using -> to define top-level values and in the body only using expressions that return none or some.

Eg the example above could have been written as:

(a ->
  (foo a)
  &
  (bar a))

This subset is called the Logical Intermediate Representation. I'll write it using the regular imp syntax, but internally it's defined by:

// A definition of a value
pub struct ValueLir {
    pub name: Name,
    pub args: Vec<Name>,
    pub body: BooleanLir,
}

// A set of constraints, written as an expression which returns None or Some
pub enum BooleanLir {
    None,
    Some,
    Union(Box<BooleanLir>, Box<BooleanLir>),
    Intersect(Box<BooleanLir>, Box<BooleanLir>),
    ScalarEqual(ScalarRef, ScalarRef),
    Negate(Box<BooleanLir>),
    Apply(ValueRef, Vec<ScalarRef>),
}

// A reference to a value - a set of tuples of scalars
pub enum ValueRef {
    Name(Name), // something bound by `let`
    Native(Native),
}

// A reference to a scalar
pub enum ScalarRef {
    Name(Name), // something bound by `->`
    Scalar(Scalar),
}

Notably, this subset never contains a -> applied to anything else, so anything in this form must already be decorrelated.

We can transform almost any imp expression into LIR. Given an expression e with arity a, first generate a fresh variables and rewrite e as vars... -> e vars.... Then recursively apply algebraic simplifications that are valid whenever an expression is applied to some scalar variables:

fn vlir(expr: Expression) -> ValueLir {
  let names = make_fresh_names(arity(expr));
  let body = blir(expr, names);
  expr!(names... -> body)
}

fn blir(expr: Expression, args: Vec<Name>) -> BooleanLir {
  assert!(arity(expr) == len(args));
  match expr {
    expr!(none) => expr!(none),
    expr!(some) => expr!(some),
    expr!(s) if is_scalar(s) => {
      let arg0 = args[0];
      expr!(s = arg0)
    }
    expr!(a | b) => {
      let a = blir(a, args);
      let b = blir(b, args);
      expr!(a | b)
    }
    expr!(a & b) => {
      let a = blir(a, args);
      let b = blir(b, args);
      expr!(a & b)
    }
    expr!(a x b) => {
      let a = blir(a, args[..arity(a)]);
      let b = blir(b, args[arity(a)..]);
      expr!(a & b) // not a typo!
    }
    expr!(!a) => {
      let names = make_fresh_names(arity(a));
      let a = blir(a, names);
      expr!(!!(names... -> !a))
    }
    expr!(n -> a) => {
      let arg0 = args[0];
      let a = blir(a, args[1..]);
      expr!((arg0 = n) & a)
    }
    expr!(a b) => {
      let min_arity = min(arity(a), arity(b));
      let names = make_fresh_names(min_arity);
      let (a_names, b_names) = if a_arity > b_arity {
        (names ++ args, names)
      } else {
        (names, names ++ args)
      };
      let a = blir(a, a_names);
      let b = blir(b, b_names);
      expr!(!!(names... -> a & b))
   }
}

This is pretty easy to prove correct from the denotational semantics.

Unfortunately, the pseudocode above is missing a definition for let. (And =, but it's a similar problem so let's focus on let). The LIR only contains globally defined values, but imp code might have locally defined values. (This is something like using a CTE inside a subquery in SQL). How do we get rid off them?

Idea 2 - abstract environments

Here is some code with a locally defined value bar.

foo (a ->
  let bar = ... in
  when (bar = 42)
    a)

We could just inline it, but that will lead to exponential explosions in code size in more complicated programs. Besides, we don't want to have to evaluate it from scratch each time we refer to it.

A better idea is to hoist it out into a global definition, by computing the value of bar for every value of a that we will need. Then when we refer to it, we can just grab the correct value for the current value of a in scope.

let bar = every_value_of_a (a -> a x ...) in
foo (a ->
  when ((bar a) = 42)
    a)

But how do we calculate in advance which values of a we will need later on?

If we ran the naive interpreter on the example, it would arrive at a -> multiple times. Each time it would have a different scalar value bound to a. The set of all scalars bound at that point is the minimal set of scalars for which we must calculate bar.

One way to think about this is that where the naive interpreter executes code depth-first with a traditional function stack, we want to transform the code so that we can execute it breadth-first - representing every reachable loop iteration as a single set of bindings.

This creates a natural cost model - the eventual output of all these transformations shouldn't ever do asymptotically more work than then the naive interpreter because it traverses the exact same set of bindings, just in a different order.

But how do we represent these breadth-first environments? How would we combine them? How do we transform them into code?

Idea 3 - interleaving

After many many dead ends, I finally figured out the ideal representation is the BooleanLir from above. Any environment can be defined by a boolean constraint on the possible values of the free variables. Then constraints can be easily composed by &.

Eg the example above becomes:

let bar = (a b -> (foo a) & (... b)) in
foo (a ->
  when ((bar a) = 42)
    a)

Where (foo a) is the environment.

But the whole point of creating these abstract environments is that we need them in order to generate the BooleanLir representation for let. It seems like a circular dependency, but it turns out that the two can be interleaved quite nicely:

fn blir(expr: Expression, args: Vec<Name>, env: BooleanLir) -> BooleanLir {
  assert!(arity(expr) == len(args));
  match expr {
    expr!(none) => expr!(none),
    expr!(some) => expr!(some),
    expr!(s) if is_scalar(s) => {
      let arg0 = args[0];
      expr!(s = arg0)
    }
    expr!(a | b) => {
      let a = blir(a, args, env);
      let b = blir(b, args, env);
      expr!(a | b)
    }
    expr!(a & b) => {
      let a = blir(a, args, env);
      // can only reach b if a is not empty
      let env = expr!(env & a);
      let b = blir(b, args, env);
      expr!(a & b)
    }
    expr!(a x b) => {
      let a = blir(a, args[..arity(a)], env);
      // can only reach b if a is not empty
      let env = expr!(env & a);
      let b = blir(b, args[arity(a)..], env);
      expr!(a & b) // not a typo!
    }
    expr!(!a) => {
      let names = make_fresh_names(arity(a));
      let a = blir(a, names, env);
      expr!(!!(names... -> !a))
    }
    expr!(n -> a) => {
      let arg0 = args[0];
      // it would normally be painful to figure out what values n takes
      // but here bound to arg0, which carries the value across from whatever this expr is applied to
      let env = expr!(env & (arg0 = n));
      let a = blir(a, args[1..], env);
      expr!((arg0 = n) & a)
    }
    expr!(a b) => {
      // make sure `a` has finite type
      let (a,b) = if is_function(get_type(a)) {
        (b,a)
      } else {
        (a,b)
      }
      let min_arity = min(arity(a), arity(b));
      let names = make_fresh_names(min_arity);
      let (a_names, b_names) = if a_arity > b_arity {
        (names ++ args, names)
      } else {
        (names, names ++ args)
      };
      let a = blir(a, a_names, env);
      let b = blir(b, b_names, expr!(env & a));
      expr!(!!(names... -> a & b))
   }
   expr!(let n = a in b) => {
     let a_names = make_fresh_names(arity(a));
     let a = blir(a, a_names, env);
     // export a into some global data-structure
     globally_define(n, expr!(a_names... -> env & a));
     let b = replace_in(b, expr!(n), expr!(n args...));
     blir(b, args, env)
   }
}

This is only a sketch and it handwaves away some tricky bookkeeping, but hopefully the idea is clear.

I still have a lot of open questions about the details of the transformation eg is it worth materializing shared parts of the environments to avoid redundant computation? But it's hard to make progress on these sorts of questions until I finish enough of the plumbing to create some decent benchmarks, so they'll have to wait.

Testing

Going from this Logical IR to a Physical IR that we can actually execute is another complicated problem, but luckily one that is already pretty well understood. For now I've implemented the simplest possible PIR and interpreter just for the sake of unlocking testing, but there will definitely be posts on this in the future when I get around to doing something reasonably clever.

Here is a repl that shows the terrible PIR.

Speaking of testing, now that we have both a naive interpreter and a PIR interpreter, we can compare them. libfuzzer is a fantastic tool that uses coverage information from the program binary to generate inputs that explore the state space. Using it from cargo-fuzz is a simple as writing a function that takes some input bytes and maybe crashes.

pub fn fuzz_eval(data: &[u8]) {
    if let Some(expr) = build_expr(data) {
        if let Ok((_typ, Value::Set(set1))) = eval_looped(expr.clone()) {
            let (_typ, set2) = eval_flat(expr).unwrap();
            assert_eq!(set1, set2);
        }
    }
}

The trick is all in how the bytes are interpreted. If treated as a string, we can drive imp all the way from the parser, but I find the fuzzer works better if we isolate the portion under testing as much as possible. So here is an idiom that seems to be pretty effective:

fn build_expr(data: &[u8]) -> Option<Expression> {
    let mut stack = vec![];
    let mut bytes = data.into_iter().peekable();
    while let Some(byte) = bytes.next() {
        let expr = match byte {
            0 => Expression::None,
            1 => Expression::Some,
            2 if bytes.peek().is_some() => {
                Expression::Scalar(Scalar::Number(*bytes.next().unwrap() as i64))
            }
            3 if bytes.peek().is_some() => {
                Expression::Scalar(Scalar::String(format!("str{}", bytes.next().unwrap())))
            }
            4 if stack.len() >= 2 => {
                Expression::Union(box stack.pop().unwrap(), box stack.pop().unwrap())
            }
            5 if bytes.peek().is_some() => {
                Expression::Name(format!("var{}", bytes.next().unwrap()))
            }
            // etc...
            _ => return None,
        };
        stack.push(expr);
    }
    stack.pop()
}

That's it for this post. Next post will probably be on (carefully) adding loops to the language.