Zest: notation and representation

(This is part of a series on the design of zest. See the list of posts here.)

I want to be able to describe all data with a consistent notation when writing code, printing values or in graphical inspectors (see the shape of data).

On top of this notation, I have two goals that are in tension. I want to have some degree of control over the memory layout of values, rather than adopting the pointer-heavy universal layouts used in most garbage-collected languages. But I also want to be able to easily move data between processes, and also seamlessly live-reload code without erasing the heap (see there are no strings on me).

nominal vs structural

Layouts in most systems languages are dictated by nominal types eg a struct in c/rust/zig. Nominal types don't play well with the goal of a universal notation eg what happens if I copy a printed value from one repl to another repl that doesn't have a definition for that type? Or what if it has a conflicting definition? Or a definition with the same structure but different invariants?

One of the fundamental characteristics of a nominal type is that two separate definitions are not equal, even if they have the same name and structure. This means that the entire concept of a nominal type is tied to a compilation model and to a process lifetime. If I compile a program once and run two copies of the executable, should type foo from the first copy be considered equal to type foo from the second copy? What if I compile the same code at two different optimization levels? Or for two different platforms?

So usually communication in nominally typed languages is managed by defining a mapping in and out of some other notation like json, where the mapping from json to the nominal type is responsible for checking and enforcing the invariants of that type. The identity of the type only exists within the boundary of a single process, so it's invariants have to also be enforced at that boundary.

Relying on serialization is tricky in interactive contexts too. If I read in the notation for a foo but I depend on two different versions of the foo package, which one should I construct? In code this is usually solved by specifying a type in the deserialization call. But in a repl, inspector or debugger this is less straightforward - I might copy some data from one tool and paste it into another to analyze, only to find out that what I pasted can't be deserialized in a different context.

By contrast, structural types have a notion of equality that is not tied to process lifetime. If I take two totally different programs I can still meaningfully ask if type foo in one program is the same as type bar in the other program. Better still if I can serialize the type foo and send it along with the data!

So in zest I've ended up with two layers: every value has a json-like notation paired with a structurally-typed representation that dictates layout.

notation

The notation is limited to numbers, strings and objects.

Numbers are sequences of one or more digits, optionally with a decimal point and/or minus sign.

Strings are something something unicode...

'Hello, world'

Objects contain zero or more key-value pairs. Order doesn't matter. Keys may not be repeated. Trailing commas are optional.

[
  'name': 'Alice',
  'age': 23,
  'job': 'Radio operator',
]

If a key is a string which is also a valid identifier ([a-z][a-z0-9\-]*) then the quotes can be omitted.

[
  name: 'Alice',
  age: 23,
  job: 'Radio operator',
]

Keys are not limited to strings.

[
   0: 'zero',
   1: 'one',
   [0: 'a', 1: 'b']: 'a and b',
]

If a key is omitted, it will be assigned an integer value starting at 0. So the above can be written as:

[
   'zero',
   'one',
   ['a', 'b']: 'a and b',
]

All omitted keys must come before all explicit keys, so the below is invalid.

[
   'zero',
   'one',
   ['a', 'b']: 'a and b',
   'illegal!'
]

representation

Each representation maps from some subset of all possible notations to a binary representation in memory. For example, numbers could be represented as integers or as floats:

// 64 bit signed integer
i64[2]

// 64 bit float 
f64[2]

Every notation has a default representation which doesn't need to be explicitly written:

Numbers are f64 if they have a decimal point and i64 otherwise.
Strings are string (utf8 bytes).
Objects are structs.

So 2 is equal to i64[2] but 2.0 is equal to f64[2].

Structs have a fixed set of keys which are laid out contiguously in memory in some implementation-defined order.

// ok
struct[
  name: string,
  age: i64,
  job: string,
][
  name: 'Alice',
  age: 23,
  job: 'Radio operator',
]

// error: cannot convert 23.5 to i64
struct[
  name: string,
  age: i64,
  job: string,
][
  name: 'Alice',
  age: 23.5,
  job: 'Radio operator',
]

// error: key 'job' is missing from [name: 'Alice', age: 23]
struct[
  name: string,
  age: i64,
  job: string,
][
  name: 'Alice',
  age: 23,
]

// error: key 'height' is missing from struct[name: string, age: i64, job: string]
struct[
  name: string,
  age: i64,
  job: string,
][
  name: 'Alice',
  age: 23,
  job: 'Radio operator',
  height: 152
]

// ok
struct[0: f64, 1: f64][0: 3.5, 1: 2.7]

// ok
struct[f64, f64][3.5, 2.7]

Lists may only contain keys that are consecutive integers starting at 0 and are laid out contiguously in memory.

// ok
list[string][]

// ok
list[string]['a', 'b', 'c']

// error: cannot convert 3 to string
list[string]['a', 'b', 3]

// ok
list[string][0: 'a', 1: 'b', 2: 'c']

// error: list would have a gap at key 2
list[string][0: 'a', 1: 'b', 3: 'c']

Maps only constrain the representation of their keys and values.

// ok
map[i64, string][]

// ok
map[i64, string][0: 'zero', 1: 'one']

// ok
map[i64, string]['zero', 'one']

// error: cannot convert 0 to string
map[string, string][0: 'zero', 1: 'one']

// error: cannot convert 0 to string
map[string, string]['zero', 'one']

A union represents one of a finite number of single-field objects, represented in memory by an integer tag.

// ok
union[strings: string, nums: i64][strings: 'hello']

// error: cannot convert 'hello' to i64
union[strings: string, nums: i64][nums: 'hello']

// error: union[strings: string, nums: i64] is missing key floats:
union[strings: string, nums: i64][floats: 3.14]

That's probably not all the reprs I'll ever need, but it's a good start.

equality and equivalence

To say that objects may not have repeated keys we have to define equality. So:

Two values are equal only when they have equal reprs (this makes equality much cheaper - important in hashmaps).
Two numbers (with the same repr) are equal when they have the same digits in their representation, ignoring leading and trailing zeroes (insert worrying floating-point handwave here).
Two strings (with the same repr) are equal when that have the same sequence of codepoints (ie ignoring normalization).
Two objects (with the same repr) are equal when they have their sets of key-value pairs are equal.
Two reprs are equal when they begin with the same keyword and, if parameterized, their parameters are equal objects.
- i64 != f64
- list[string] == list[string]
- list[string] != list[i64]
- struct[string, i64] == struct[0: string, 1: i64]
- struct[name: string, age: i64] == struct[age: i64, name: string]

I'd like to also have a notion of equivalence which considers only notation. I think that it can be defined as a ~= b iff convert(a, repr(b)) == [some: b] and that this should be symmetric. The intuition is that if the conversion fails then the notation for a can't be represented by repr(b) so they can't possibly have the same notation.

nominal-ish types

Sometimes in languages like erlang, which has a universal notation and no nominal types, I'll run across some data and have no idea how to interpret it. Notation alone is not enough.

324142245123

We can provide extra context without any extra memory cost by using a single-key struct.

[java-util-Instant: 324142245123]

Now we know how this is intended to be interpreted, and we can't accidentally use it somewhere that expects an integer.

It would be nicer if we could the use the above as the in-memory representation, but have a different representation for the human reader:

[java-util-Instant: '1980-04-09T15:30:45.123Z']

This would effectively provide a mechanism for extending the (currently fixed) set of representations. But suppose we copy that into a repl - would it be represented as-is or would the repl somehow know to convert it to the millis representation internally. We're back where we started with nominal types - either we have to have a global namespace with only one version of each type, or we risk having different interpretations in different contexts (eg repl vs debugger).

elsewhere

The separation of representation and notation was inspired by zig's anonymous literals. There is a kind of structurally-typed universal notation lurking within zig, as evidenced by zig object notation. But it can't be fully realized because a systems language has to deal with pesky annoyances like circular data structures and pointers to unmapped memory.

The idea of a universal notation is most well known through json in javascript, but there it struggles when mixed with classes, which are effectively a kind of nominal type.

Nominal types are handled better in clojure, whose tagged literals are the inspiration for the above thoughts on nominal-ish types. Clojure accepts the context problem - being based on the jvm it needs to have a way to interact with nominal java types anyway.

Unison identifies that the root problem is the mapping from names to types. If we instead give each type a unique id and serialize the id, we can effectively have a global namespace without having to worry about name collisions. This does then require a certain amount of tooling to be able to write code - noone is memorizing all the unique ids. But it effectively moves the context problem out of the language and into the auto-completer, where it seems more manageable.

The authors of zed have written about the difficulties caused by needing context in order to work with data at all. For example, you might have a big pile of json data that you want analyze using sql. But in order to load it into a sql database you first have to specify a schema. How can you figure out what schema would fit your data? If only you had some sort of tool you could use to analyze your data...

One useful feature of nominal types is that you can use private constructors to enforce invariants. Being able to prove an invariant in a tiny section of your codebase and then assume it everywhere else is really powerful. But if you change the code you might change the invariant, making this fundamentally incompatible with live code reloading. An alternative model is found in databases. If we squint at sql constraints in the right light we can see that they associate invariants with a variable rather than with a type. Accordingly, sql databases are quite happy to change constraints without deleting and recreating all their data. What might that look like in a programming language?