Imp: heterogenous types

This is a part of a series - start at the beginning.

Unlike sql, the denotational semantics for imp are carefully written to allow a set to contain rows of different lengths. The reason for that is that we can use such sets to model structs, tagged unions, first-class modules etc.

// defining a module
my_module:
| "inclinations", (
    | "alice", "reasonable"
    | "bob", "silent"
    | "eve", "evil"
)
| "exprs", (
    // tagged unions
    | 0, "constant", "bob"
    | 1, "constant", "alice"
    | 2, "union", 0, 1
)
| "plus_one", ?a a+1;

// usage

my_module "inclinations" "eve"
// = "evil"

my_module "exprs" 0 "constant"
// = "bob"

my_module "exprs" 2 "constant"
// = none

my_module "plus_one" 2
// = 3

The type system doesn't currently allow this, because one of the goals for imp is to have the type system catch silly mistakes. Accidentally mixing rows of different lengths is a mistake I make often. So I want to find some way to allow intentional uses of mixed row lengths and types, but still catch as many accidental uses as possible.

Take 1

My original plan was to have special syntax that desugared to the code above but would be treated specially by the type system:

// defining a module
my_module: (
  inclinations:
    | "alice", "reasonable"
    | "bob", "silent"
    | "eve", "evil"
  exprs:
    // tagged unions
    | 0, (constant: "bob")
    | 1, (constant: "alice")
    | 2, (union: 0, 1)
)
// =
// | "inclinations", "alice", "reasonable"
// | "inclinations", "bob", "silent"
// | "inclinations", "eve", "evil"
// | "exprs", 0, "constant", "bob"
// | "exprs", 1, "constant", "alice"
// | "exprs", 2, "union", 0, 1
//
// has type:
// (
//   inclinations: text, text
//   exprs: number, (
//     constant: text 
//     union: number, number
//   ) 
// )

// usage 

my_module :inclinations "eve"
// = "evil"

my_module :exprs 0 :constant
// = "bob"

my_module :exprs 2 :constant
// = none

my_module :plus_one 2
// = 3

I like that this syntax piggybacks on the existing def syntax and that there is a nice visual symmetry between creating heterogenous values and unpacking them.

Having separate syntax for dealing with heterogenous values vs homogenous values also makes it much easier to catch mistakes at compile-time because there is a clear indication of intent. "foo", 1 | "bar", "quux" will produce a type error but foo: 1; bar: "quux" will produce a heterogenous value, even though the denotational semantics for both are the same.

In this version the syntax for dereferencing is expr :name - the code :plus_one by itself isn't a valid expression. This works fine, but it means that if we want to write code that is generic over it's input then the keys are just string and there is nothing for the type-checker to specialize on.

But the biggest problem is that the types here are determined by syntax rather than by the values. That's fine in a programming language, but in imp I have a database which can sync changes between peers. With this design if I hand you an arbitrary database over the wire then there isn't an obvious way to assign a type to it, because it's just a pile of strings.

I could require the database to also have an attached schema, but then I have to deal with syncing incompatible schema changes between different peers. Depending on how flexible the schema language is, there might not even be a compatible schema eg if Alice adds (foo: 42) and Bob adds (foo: "yo") then there is no way to express the result in the type system above. I also think it's really powerful to be able to write type-checked queries over arbitrary input data without having to first define a schema for it. So compulsory schemas are definitely not an attractive solution.

Take 2

Rel has a different approach to this. First, it allows arbitrary type unions. Second, the syntax :foo produces a value with type :foo, exactly like Julia's value types. Rel calls these RelNames, but in the imp code they're currently called tags.

If imp worked this way the example above would look like:

my_module:
  | :inclinations, (
    | "alice", "reasonable"
    | "bob", "silent"
    | "eve", "evil"
  )
  | :exprs, (
    | 0, :constant, "bob"
    | 1, :constant, "alice"
    | 2, :union, 0, 1
  )
)
// =
// | :inclinations, "alice", "reasonable"
// | :inclinations, "bob", "silent"
// | :inclinations, "eve", "evil"
// | :exprs, 0, :constant, "bob"
// | :exprs, 1, :constant, "alice"
// | :exprs, 2, :union, 0, 1
//
// has type:
// | :inclinations, text, text
// | :exprs, number, :constant, text
// | :exprs, number, :union, number, number

my_module :inclinations "eve"
// = "evil"

my_module :exprs 0 :constant
// = "bob"

my_module :exprs 2 :constant
// = none

my_module :plus_one 2
// = 3

(We could still keep the special syntax for both definitions and types, but I wrote it without in these examples for maximum clarity.)

This solves both of the problems in take 1. The compiler can specialize generic code on the type of the tags, and we can assign a type to any arbitary set by just unioning the types of each row.

But now catching mistakes becomes very hard. Rel does very little static checking at the moment, likely because it's hard to discern whether a type mismatch is intentional or a mistake. Here are some examples translated into imp:

db:
  | :foo, 42
  | :bar, "yo";
inc: ?a a+1;
inc_foo: ?k ?v
  k :foo then :foo, v+1 else
  k :bar then :bar, v;

inc "foo"
// = none
// Can infer at compile time that this will never produce output.
// This code is almost certainly a mistake.

inc (1 | "foo")
// = 2
// Can infer at compile time that the string input will be discarded, but that the number might be used.
// This code is probably a mistake.

inc_foo db
// = :foo, 43 | :bar, "yo"
// `inc_foo` is specialized separately for :foo and :bar.
// In the :foo specialization can infer at compile time that the `k :bar` branch will never produce output.
// In the :bar specialization can infer at compile time that the `k :foo` branch will never produce output.
// This code is probably intentional.

inc_foo (db | :foo, "foo")
// = :foo, 43 | :bar, "yo"
// As above, but...
// In the :foo,text specialization can infer at compile time that the `v+1` will never produce output.
// This code is probably a mistake.

inc_foo (db | :quux)
// = :foo, 43 | :bar, "yo"
// As above, but...
// In the :quux specialization can infer at compile time that inc_foo will never produce output.
// This might be a mistake, or we might be deliberately discarding the :quux rows.

db :foo
// = 42
// Can infer at compile time that the :bar part of db will be discarded, but that the :foo part of db will be returned.
// This code is almost certainly intentional.

db :quux
// = none
// Can infer at compile time that this will never produce output.
// Might be a mistake, or might just be that the :quux rows in db haven't yet been inserted, or have all been recently deleted.

I implemented this and played around with it for a while. The flexibility is great, but I find it easy to make mistakes and very hard to debug them.

The problem is that some of the intention has been lost. Narrowing a type might be deliberate or might be an oversight. A expression that never produces output might a mistake or might just be a branch that isn't taken in this specialization or with the current database.

Think

I thrashed around on this for a full week and tried a lot of different mechanisms. It's easy to get focused on one set of problems, get excited about a mechanism that solves those particular problems and then forget to check that it solves all the others.

So let's actually write down all the constraints:

We want to be able to work with arbitrary sets of rows without requiring an upfront schema.
We want a guarantee that if two peers individually make valid changes to their dbs and then sync their changes, the combined change is also valid. This strongly limits what kinds of validation we can have at this layer. (See I-confluence).
We want to support short-lived interactive usage with minimal programmer overhead.
We want to be able to catch as many errors as possible in long-lived code.
From the point of view of the type-checker, it's hard to distinguish correct code from mistakes without additional information about intent.

This implies that a good solution would:

Be able to give a type to any valid data (1).
Allow arbitrary type unions, so that we can at least refer to and correct conflicts (2).
Only emit errors when the program is completely unrunnable (3).
Emit warnings for any code that looks remotely suspicious (4).
Allow adding additional information about intent to silence warnings (4 and 5).

Take 42

This is where I ended up.

We start with the bare bones of take 2:

The type of any expression is a union of row types eg :constant, text | :union, number, number.
The type of a tag is its value eg the type of :foo is :foo.

We only produce errors for code that we cannot execute eg testing the equality of two maybe-infinite sets.

We produce warnings if:

The type of an = or & is empty and neither input had empty type.
In an apply expression:
- The left input type is finite and non-empty, the right input type is finite and non-empty and the result type is empty
- The left input type is finite, the right type is non-finite and one of the row types in the left type has no matching type in the right type.
- The right input type is finite, the left type is non-finite and one of the row types in the right type has no matching type in the left type.
The condition of a then expression has a row type with length > 0.
The input types to native functions (eg +) have the wrong types.

Of the examples above, only db :foo doesn't produce warnings under these rules. The other examples are all suspicious.

To quell the compilers suspicions we can add extra information using some new syntax:

a is? b tests whether a is a subset of b. It returns (a & b) = a but can often be inferred to be none at compile time. It never produces warnings. This is useful for branching on type.
a is! b asserts that a ia a subset of b. It returns the value of a but produces a runtime error if a is not a subset of b. Produces a compile-time warning if the inferred type of a is not a subset of the inferred type of b. This is useful for the programmer to sanity-check their expectations.
a as b returns a & b but has the inferred type of b. This is useful for taking typed views of the database so that downstream code will not be broken by changes to the inferred schema over time.

To express types we also add some new infinite sets: number and text.

Let's see how we can rewrite some of the examples above to silence the warnings:

db:
  | :foo, 42
  | :bar, "yo";
inc: ?a a+1;
inc_foo: ?k ?v
  // use `is?` to test type
  k is? :foo then :foo, v+1 else
  k is? :bar then :bar, v;
  
inc ((1 | "foo") as number)
// = 2
// Used `as number` to explicitly discard the text row, so no warning.

inc_foo db
// = :foo, 43 | :bar, "yo"
// has type: :foo, number | :bar, text
// The type test using `is` does not produce warnings

inc_foo (:foo, 42)
// = :foo, 43
// has type: :foo, number
// Correctly infers that the :bar branch of inc_foo can't be reached with this input

inc_foo ((db | :foo, "foo") as (:foo, number | :bar, text)
// = :foo, 43 | :bar, "yo"
// The `as` filters out the `:foo, "foo"` row.

inc_foo ((db | :quux) as (:foo, number | :bar, text)
// = :foo, 43 | :bar, "yo"
// The `as` filters out the `:quux` row.

inc_foo db is! (:foo, number)
// Produces a runtime error because `:bar, "yo"` is not contained in `:foo, number`
// Produces a compile time warning, because the inferred type of `inc_foo db` is `:foo, number | :bar, text`

db: db as (:foo, number | :bar, text | :quux, number, text);
db :quux
// = none
// has type: number, text
// Uses `as` to tell the compiler that we expect that db might contain rows of type `:quux, number, text` in the future

I'm really excited with how this turned out. We get a schemaless relational database with sound type inference, accurate warnings in interactive usage and gradual typing for long-lived code. Plus an expressive query language that can emulate strongly typed first-class modules and algebraic types but can still easily work with unstructured data.

As a bonus, I also allow tags to be numbers. This is useful for referring to columns eg:

db:
| :foo, 42
| :bar, "yo";

db@ columns (:1, :0)@
// = 
// | 42, :foo
// | "yo", :bar
//
// has type: 
// | number, :foo
// | text, :bar

db@ columns (:1 | :0)@
// = 
// | :foo
// | :bar
// | 42
// | "yo"
//
// has type: 
// | :foo
// | :bar
// | number
// | text

db@ pivot
// (type number, row number, column number, value)
// =
// | :0, 0, :0, :foo
// | :0, 0, :1, 42
// | :1, 0, :0, :bar
// | :1, 0, :1, "yo"
// 
// has type:
// | :0, number, :0, :foo
// | :0, number, :1, number
// | :1, number, :0, :bar
// | :1, number, :1, text