Zest: syntax

Published 2024-04-16

(This is part of a series on the design of a language. See the list of posts here.)

Popular advice for designing a language is to focus on semantics and worry about syntax later. So it might seem ill-advised to write about syntax before writing about semantics. But a) I think syntax design is underrated - it has a huge impact on the subjective feel of working with a language, and b) it's hard to write about semantics without first explaining the syntax.


What do I want from syntax?

Familiarity and easy skimming are somewhat in tension, because algol-family languages use each symbol for a wide range of different roles.

I'd like to make these categories more thematically aligned but that often conflicts with muscle-memory from other languages. I've mostly ended up with fairly sensible categories without introducing too much novelty, except for the ambiguity between field access and decimal points which was too ingrained in my muscle memory.


Numbers are as expected. I haven't made decisions yet about syntax for binary, octal, hexadecimal, scientific notation etc.

42
3.14

' is for strings. Slightly faster to type than ".

'Hello world'

Strings can't contain literal newlines.

// This is a syntax error!
'Hello
world'

This means that no tokens in the language can span a newline, making incremental syntax highlighting much easier.

I've been convinced that it's valuable to allow arbitrary non-utf8 byte-strings using integer escapes but I haven't decided on the syntax yet. (The default string type would still be utf8.)


Names are lower-case and kebab-case.

Arithmetic operators must be surrounded by whitespace, so.

Snake-case would have avoided the ambiguity, but is a little slower to type.


= is for binding names to values.

x = 1
y = x + 1

I considered reusing : for binding as well as for struct keys (below) but:


[] is for compound literals, which are all built out of structs.

This is a struct:

['name': 'Alice', 'role': 'Example person']

If a key is a string which is also a valid name, then the ' can be omitted.

[name: 'Alice', role: 'Example person']

If we actually want to use a variable for a key we can disambiguate.

k = 'name'
[{k}: 'Alice', role: 'Example person']

Struct keys can be any value, not just strings.

[0: 'zero', 'one': 1, ['three', 'four']: 34]

If no key is given for an element, then it defaults to a consecutive integer.

['a', 'b', 'c'] == [0: 'a', 1: 'b', 2: 'c']

This avoids needing different syntax for structs vs tuples - they're not fundamentally different.

I also added some sugar for the common case where the key and value are the same variable.

[:foo] == [foo: foo]

And of course I allow trailing commas.

[
  0,
  1,
  2,
]

Types are first-class and created by a few keywords.

// types
i32
f32
string

// type constructors
struct
union
list
map

Combining a type constructor and a struct produces a type.

struct[name: string, role: string]

struct[i32, i32] == struct[0: i32, 1: i32]

list[i32]

map[i32, string]

union[nums: list[i32], strings: list[string]]

Combining a type and a value attempts to cast the value to the type. This is how all values other than numbers/strings/structs are constructed.

list[string]['a','b','c']

map[string, i32]['one': 1, 'two': 2]

result = union[ok: i32, error: string]
result[ok: 4]
result[error: 'oh no!']

// type error: expected i32, found string
result[ok: 'oh no!']

This is also how values are printed!

print(result[ok: 4])
// prints:
// union[ok: i32, error: string][ok: 4]

The printing is smart enough to avoid printing redundant types.

results = list[result]
print(results[[ok: 4], [error: 'oh no!']])
// prints:
// list[union[ok: i32, error: string]][[ok: 4], [error: 'oh no!']]

For any value that doesn't contain a mutable reference or a function, you can paste the printout back into your code and get a value that compares equal to the original.


Struct fields are accessed with ..

example = [name: 'Alice', role: 'Example person']
example.'name' == 'Alice'

tuple = ['a', 'b', 'c']
tuple.0 == 'a'

The syntax sugar from structs also applies here.

example = [name: 'Alice', role: 'Example person']
example.name == 'Alice'

tuple = ['a', 'b', 'c']
i = 0
tuple.i // error: key 'i' not found in ['a', 'b', 'c']
tuple.{i} == 'a'

This produces an ambiguity with decimal points which I manually resolve in the grammar.

x.0.1 == {x.0}.1
x.0.1 != x.{0.1}

I originally used / for both fields and field access.

example = [/name 'Alice', /role 'Example person']
example/name == 'Alice'

This avoids the ambiguity with decimal points and I also think it worked better visually. Plus the similarity to filesystem paths provides a natural interpretation. But I kept typing example.name anyway. Muscle memory is a bitch.


This is a function call:

push(list, elem)

Function arguments use the exact same syntax as structs. So we get named arguments for free, and function arguments can be represented by structs.

get(salaries, 'Alice', default: 0)

// Equivalently:
call(get, [salaries, 'Alice', default: 0])

This is a function definition:

get = (obj, key, :default) if has(obj, key) obj.{key} else default

Function parameters use the same syntax as function arguments and structs, but that syntax is interpreted as a pattern to match against (with what I'm just going to claim are the obvious semantics).

first = ([a, b]) a
first([1,2]) == 1

Patterns are also allowed in bindings:

[a, b] = [1, 2]
a == 1
b == 2

In many languages functions can also be associated with types - an expression like foo.bar() would look for the function bar associated with type-of(foo).

This doesn't really work nicely in a structurally-typed language so I instead added an operator / for uniform function call syntax.

salaries/get('Alice', default: 0) == get(salaries, 'Alice', default: 0)

Using / rather than reusing . avoids the obnoxious ambiguity between calling a function in UFCS-style vs calling a function stored in a field.

confusing = [get: (key, :default) 42]

confusing/get('Alice', default: [error: 'not found']) == [error: 'not found']

confusing.get('Alice', default: [error: 'not found']) == 42

When serializing a large value as text, we want the type to come first to allow efficient parsing.

list[i32][0,1,2,...]

But in patterns we want the type to come after the variable for better readability.

To support both, I allow using types as functions.

list[i32][0,1,2] == list[i32]([0,1,2]) == [0,1,2]/list[i32]

In patterns, applying a type to a parameter acts as a test that the parameter's value can be cast to that type.

first = (x/list[i32]) x/get(0)

first([1,2,3]) == 1
first(['a','b','c']) // type error: expected i32, found string

{} is used for blocks. Multiple expressions within a block are separated either by newlines or ;.

do-twice = (f) {
  f()
  f()
}

do-twice = (f) {f(); f()}

A block with a single expression can be used for grouping.

x - {y + z}

The value of a block is the value of the last expression.

The empty block {} returns the empty struct [].


'Expressions separated by newlines' sounds awfully like javascript's disastrous semicolon insertion.

I avoid ambiguities by starting with the rule that expressions can't span newlines, and then adding exceptions only in places where it is obvious that the current expression hasn't finished:

// ok
inc = (x) x+1

// syntax error
inc = (x)
  x+1

// ok 
inc = (x) {
  x+1
}
// ok
foo/bar()/quux()

// syntax error
foo
  /bar()
  /quux()

// ok
foo/
  bar()/
  quux()

// syntax error
foo/ bar()/ quux()

This does make function chaining look less nice, but it feels worth it to avoid both semicolons and semicolon-insertion.


@ is (tentatively) for mutable references.

// A mutable reference initialized to the value [x: 0, y: 1]
vec @= [x: 0, y: 1]

// An assignment
@vec.x = vec.x + 1

// Equivalently:
@vec.x += 1

// Equivalently:
inc = (@x) @x += 1
inc(@vec.x)

Mutable references in zest don't behave like mutable variables in java-like languages or pointers in c-like languages (more on that in later posts). So I deliberately picked a foreign syntax to avoid false familiarity.

I don't allow shadowing names, so forgetting the syntax for assignment doesn't cause bugs:

x @= 0

// ok
@x = y

// error: name x is already bound
x = y 

There are only a few groups for precedence rules:

  1. Binding (=)
  2. Binary operators (==, +, and etc).
  3. Chaining operators (., /, function calls etc).
  4. Mutable paths.

All are left-associative. Binary operators must be separated by spaces and can't be mixed. Chaining operators must not be separated by spaces and can be mixed.

x = y.z + 1
// parses as
x = {{y.z} + 1}

a + b + c
// parses as
{a + b} + c

// syntax error: ambiguous precedence between + and -
a + b - c

a/f().b[c]
// parses as
{{f(a)}.b}[c]

The last group, mutable paths, is a slightly awkward hack to make @ and / interact nicely:

@foo.bar/push(thing)
// parses as
push(@foo.bar, thing)
// rather than
@push(foo.bar, thing)

I'm tempted to not have short-circuiting boolean operators at all. The syntax for closures isn't too noisy:

// more binary operators?
{w == x} and {{x == y} or {y == z}}

// or just use functions?
and(w == x, () or(x == y, () y == z))

The syntax is practically LL(k). The AST is produced by a recursive descent parser with only one-token-lookahead (arguably two-token if you count space/newline tokens). But patterns are parsed as if they were expressions, even though only a subset of the expression syntax is supported in patterns. Invalid patterns get reported during semantic analysis instead of parsing, but arguably should be considered part of the grammar.

// error: field access (.) may not be used in a pattern
x.y = 42

// error: `get` is a function and may not be used in a pattern
get(a,b) = 42

I think this layered approach is still pretty tooling-friendly though. An LL(k) parser is good enough for most IDE interactions. The additional rules are easy to check with a single pass over the AST.

It's also pretty human-friendly. The amount of grammar you need to remember is much less than if patterns and paths had separate syntax. That makes it easier to visually parse code with a quick glance.