Imp: boxes

This is a part of a series - start at the beginning.

Previous posts had fancy interactive examples. Keeping these in sync with constant language changes was taking a lot of time, so I'm abandoning them for now. They'll return when the language has settled enough to write docs.

This is the thorniest part of imp. So far we've been dealing with first-order relations - sets of tuples of scalars. These can all be compiled down into a static dataflow graph and then incrementally maintained.

I want to also be able to express higher-order relations - relations which contain other relations. Depending on your point of view you can think of this feature as providing nested relations and first-class functions, or as providing the ability to abstract over common patterns in the dataflow graph.

So let's say we add some new syntax. [expr] evaluates the expr and then puts the resulting relation in a 'box' - an object which contains a relation but otherwise behaves like a scalar value ie we can put it inside other relations. The counterpart is [arg] -> expr which behaves like the regular arg -> expr but unboxes arg, restoring the original relation.

So. Job done, time for tea, right? Right?

Problem the first

If we write:

let id1 = a -> a in
let id2 = b -> b in
[id1] | [id2]

How do we decide whether the result contains one box or two?

We could say that two boxes are equal whenever the relations they contain are equal, but equality of Turing-complete functions is not computable. Restricting boxes to containing known-finite sets would makes this work but would mean that we lose first-class functions.

The other extreme would be to say that boxes are only ever equal to themselves. But what does that even mean in a language that doesn't have an evaluation order? We can't just say that a new box is created every time we evaluate the box expression because we don't define when the expression gets evaluated. And if we did it would make incremental maintenance much harder.

So we need to be careful about where the identity of a box comes from.

First, we'll pretend that let n = a in b was syntax sugar for [a] ([n] -> b) all along! Now the value of every variable in the core language is always a scalar and we can reconstruct the value of any expression at any point in time knowing only the values of the scalars in scope.

When we make a box, what we actually store is not the relation itself but:

The location of the expression in the source code
The value of every scalar that is used in that expression (some of these scalars might themselves be boxes, so we get a tree of values)

Two boxes are equal when they were created by the same location in the code and when they close over the same scalar values.

A side effect of this is that every relation stores how it was originally created. This will turn out to be crucial later when we get to incremental maintenance and we need to know how to flow diffs through the dataflow graph. I think this might actually be the core requirement for efficient incremental maintenance - being able to compactly store and reconstruct the entire state of the program at any point.

Problem the second

The imp type system rules out programs that are too hard to evaluate. The type of an expression tracks the arity of the resulting relation, and also whether that relation is provably finite.

So, what is the arity of ([a] -> a)? This is a function that takes a box and returns the contents of the box, which might have any arity.

We could insist that every such unboxing is accompanied by a type annotation like ([a :: (number . text)] -> a). That would solve the problem.

But it's actually very useful to be able to write functions that operate on inputs of arbitrary types, so I'm also experimenting with type-checking such expressions lazily. Without type annotations ([a] -> a) would be given the type type_of(<expr at line 1 col 1>) and only further type-checked when applied to another expression, at which point we know the type of a. (This is very similar to the way the var type works in Zig.)

This requires also tracking the type of the relation inside each box, but that's useful for incremental maintenance anyway - if we know the origin of a box at it's unboxing point then we can statically determine the resulting dependency edge in the dataflow graph.

Problem the third

Russell's paradox is "does the set which contains all sets that don't contain themselves, contain itself?".

let rp = [a] -> !(a ([b] -> a == b)) in
rp [rp]

The root of the problem is that unrestricted set comprehension is too powerful and leads to contradictions. If we denote ([a] -> ...) naively as ranging over all possible relations then the denotational semantics will be inconsistent.

The thing is, we don't care about all possible relations. We only care about ones that the program in question can actually produce. In the process of checking lazy types the type-checker enumerates all the box types that each unbox could actually be applied to. Each of the box types denotes a set of well-founded sets, so we can just retroactively declare that that the range of the unbox was the union of all those box types.

That resolves the theoretical problem, and I'll save for a future rainy day the practical problem of turning non-well-founded programs into a helpful error message instead of a type-checker stack overflow.

Boxes are a surprisingly useful primitive. We've already seen them produce nested relations and first-class functions while preserving incremental maintenance, but with a little more work we might also get product types and first-class modules for free:

let module = [
  "add" . [a b -> a + b] |
  "addmul" . [a b c -> a + (b * c)]
] in
module:add 1 2

The idea here is that a:b is new syntax that would desugar to a ([a] -> a) "b" ([b] -> b). Since each box has a unique type and we track those types, we should be able to constant-fold it away at compile-time and insert a static call to the correct function.

Similarly:

let complex = i j -> ["i" . [i] | "j" . [j]] in
let add = a b -> complex (a:i + b:i) (a:j + b:j) in
add (complex 2 1) (complex 2 0)

Since the box only stores the box id and the closed-over scalars this actually implies a pretty decent physical representation, while still preserving the illusion that it's a relation.

There's still some design work left to do around how to name box types, infer the results of dereferencing and surface good errors, but I think the idea is fundamentally sound.