Baby's second wasm compiler

(This is part of a series on the design of zest. See the list of posts here.)

The zest compiler today is ~4500 loc with no dependencies except the zig standard library.

It generates wasm of similar quality to an llvm at -O0, although I have some ideas to try later that I hope will push the output closer towards -O1.

There are still some low-hanging optimizations to pick off in the compiler, but within the next year or so I'm expecting compile times to be similar to zig's new non-llvm wasm backend - somewhere around ~10x faster than an llvm release build and ~3x faster than an llvm debug build.

The language itself is far from complete, but the codegen is now basically done. It supports integers, structs, basic control flow, arithmetic, comparisons, mutable references (2nd class, interior), functions (non-recursive), closures, destructuring, type inference, compile-time evaluation, function specialization etc.

The plan for the rest of the core language features - allocation, strings, list, maps, etc - to be mostly implemented in zest itself. This is standard for low-level languages but less common for a high-level language - the only examples I know of are virgil and the various java-in-java research projects (EDIT matklad pointed out the go runtime). In my first compiler I linked to a runtime written in zig, but the linking was a pita and having to pull in a entirely different language just for the runtime made my minimalist heart ache.

Writing the runtime in the same language will also allow optimization across the runtime boundary, so I can type-specialize runtime functions and eliminate dead runtime code eg if you write a zest program that doesn't use heap allocation then the resulting wasm won't contain allocator code.

The compiler pipeline looks like:

parse
- LL(1)
- produces sir (syntax intermediate representation)
desugar
- resolves names
- converts closures into regular functions that take an extra argument
- desugars destructuring into imperative code
- replaces mutable variables with ref values
- creates wrappers for each function that can be called dynamically
- inserts staging annotations on struct keys
- inserts asserts to prevent mutable references from escaping
- produces dir (dynamic intermediate representation)
eval
- an interpreter for the dynamically-typed dialect of zest
infer
- infers types of each expression
- specializes functions to their argument types
- does compile-time evaluation by calling into eval
- replaces struct keys with static offsets
- produces tir (typed intermediate representation)
generate
- inlines function wrappers
- propagates constants
- eliminates dead branches and loops
- chooses how to represent values (eg on shadow stack vs in a wasm local)
- tracks the wasm stack to avoid unnecessary locals
- produces wasm

Rather than a linear ir, I stuck to a tree of expressions with explicit variables, like wasm. This makes it very easy during codegen to distinguish between an expression (consumed exactly once, in stack order) and variables (consumed zero or more times, at any point in the scope). I'll explain later why that's useful.

Rather than use pointers or indices to refer to child expressions, I store the tree as a pre-and-post-order traversal, so each branch has both a begin and end tag. It looks like this:

--- SOURCE ---
a = 3
a + 1

--- TOKENS ---
zest.TokenData.name
zest.TokenData.space
zest.TokenData.=
zest.TokenData.space
zest.TokenData.number
zest.TokenData.newline
zest.TokenData.name
zest.TokenData.space
zest.TokenData.+
zest.TokenData.space
zest.TokenData.number
zest.TokenData.eof

--- SIR ---
block_begin
  let_begin
    i32 3
    indirect
    name a mut=false
  let_end
  block_last
  indirect
  call_builtin_begin
    indirect
    name a mut=false
    i32 1
  call_builtin_end zest.Builtin.add
block_end

--- DIR ---
main = f0
f0 = (closure)
  local l0
  return_begin
    assert_has_no_ref_begin
      block_begin
        local_let_begin
          assert_has_no_ref_visible_begin
            i32 3
          assert_has_no_ref_visible_end
        local_let_end l0
        block_last
        call_builtin_begin
          nop_begin
            local_get l0
          nop_end
          i32 1
        call_builtin_end zest.Builtin.add
      block_end
    assert_has_no_ref_end
  return_end

--- TIR ---
main = f0
f0 = (closure)
  local l0 /i32
  return_begin
    block_begin
      local_let_begin
        i32 3
      local_let_end l0
      block_last
      call_builtin_begin
        local_get l0
        i32 1
      call_builtin_end tir.BuiltinTyped.add_i32
    block_end
  return_end

--- WAT ---
(module
  (type (;0;) (func (result i32)))
  (func (;0;) (type 0) (result i32)
    (i32.add
      (i32.const 3)
      (i32.const 1)))
  (memory (;0;) 128)
  (global (;0;) (mut i32) (i32.const 8388608))
  (export "main" (func 0))
  (export "memory" (memory 0)))

This is very compact (or will be when I get around to removing padding and interning strings/types). Each expr needs 1-2 bytes of tags and 0-4 bytes of data. That data includes types, which I only need to store on a few tir exprs.

I'm not sure how this compares to zig. Each zig instruction has a 1 byte tag and 8 bytes of data. Instructions like call that take a variable number of operands pay an additional 4 bytes per operand. But in zig's later irs there are no explicit variables, whereas I have to pay 5 bytes for each local_let and local_get.

Compared to most other compilers though, both zest and zig have tiny irs.

Cranelift instructions are 16 bytes, plus optional extra operands, plus 16 bytes for their control flow data.

As for llvm, I don't have enough c++ knowledge to put a size on this class hierarchy but the next/prev/parent pointers alone must cost 24 bytes. If those are individually heap-allocated too then there is likely additional overhead from fragmentation.

This is kind of a fork in the road for ir design. If you want to do any kind of code motion or peephole optimization then you need an ir that allows cheap insertion and deletion, which usually means 2+ indices/pointers per instruction for the control-flow graph, and maybe additional indices/pointers to manage the operand lists of instructions like call that take a variable number of operands.

But if you want to go produce code really fast, you want to touch as little memory as possible, and touch it in prefetcher-friendly access patterns rather than jumping around linked lists.

Every book I could find on compilers was about going down the optimization fork. The other fork is less well travelled. I found just a smattering of useful papers:

(If you know other good resources, let me know!)

The big downside of my tree encoding is that I have to produce the tree in order. In most passes this is fine but during parsing it's a problem, so in sir I have a pay-as-you-go indirection via Expr{.indirect = offset}. This tells the consuming code to jump to the offset, consume a sub-tree, and then jump back to the next expr after the indirect.

Parsing a + 1 looks like this:

Consume a. Put Expr{.name = "a"} in the buffer.
Consume +. Move the contents of the buffer to the tree. Put Expr.call_builtin_begin, Expr{.indirect = 0} in the buffer.
Consume 1. Put Expr{.i32 = 1}, Expr{.call_builtin_end = .add} in the buffer.
Consume eof. Move the contents of the buffer to the tree. Put Expr{.indirect = 1} in the buffer.

And produces:

call_builtin_begin
  indirect
  name a mut=false
  i32 1
call_builtin_end zest.Builtin.add

This adds 5 bytes to the ir for every infix/postfix expression in the source, which is a little painful, and it also means that I touch every byte in sir twice during parsing. But it still produces fairly significant space savings vs using indices everywhere, and it also only applies to sir - dir and tir don't have an indirect expr.

Each pass is roughly linear in complexity and (aside from a few edge cases in desugar) touches each expr in its input only once.

Eval uses the tree like a stack machine, ignoring begin tags and accumulating values on a stack. Currently some of the control flow requires scanning forward over subtrees - I might want to store offsets in if_then, if_else and while_body to avoid that.

Type inference happens after function specialization, like zig, and is a simple bidirectional system that conservatively approximates the behaviour of the dynamic type checks in eval. It only deals with concrete types and there is no unification.

Stack overflows in compilers are a nuisance (eg), but transformations of trees of expressions are naturally recursive. Hand-converting them to a manual stack produces hideous code.

My compromise at the moment is that eval and infer use a manual stack, because both could overflow from errors in reasonable programs and I want to provide good errors in that case.

The other passses are recursive but their call stack is bounded by the nesting depth of the source. They only overflow if you are a honggfuzz and keep generating one million nested parentheses instead of finding real bugs. This is annoying but the workarounds are worse.

I did write a version of generate that ran in constant stack space but it was significantly harder to read - when I rewrote it to be recursive I immediately spotted three bugs!

In the long run I think the nicest solution is to self-host the compiler and then use wasmfx or similar to manually segment the stack in the recursive passes when the recursion depth hits some fixed limit.

In wasm you get one big heap of memory and some mutable local variables. Locals can only hold primitive values (integers, floats, etc) and can't be pointed to. So for languages that can create pointers to stack-allocated values the usual solution is to designate some fixed portion of the heap as the 'shadow stack' and then use a global variable to track the stack pointer.

For each value the compiler has to decide how to represent that value. Currently I have a very conservative strategy. Zest integers are represented directly by wasm integers. All other values are stored on the shadow stack. Mutable variables are always stored on the stack, even if they are integers.

My calling convetion is similar. Integer arguments to functions are passed directly. Other arguments are passed by reference. Functions that return integers do so directly. Functions that return other values take an extra argument pointing to where they should store their result.

This is definitely suboptimal, but it's simple and it's easy to change later. Most of the codegen is unaware of these rules - it's all encoded in:

fn wasmRepr(repr: Repr) WasmRepr {
    return switch (repr) {
        .i32, .ref => .{ .primitive = wasmAbi(repr) },
        .string, .@"struct", .@"union", .fun, .only, .repr => .heap,
    };
}

Here's an example:

// zest
f = (a) {
  b = 1
  c = a + b
  d = [c]
  [d]
}

;; wasm
;; param 0 is `a`
;; param 1 is the result pointer
(func (;1;) (type 1) (param i32 i32)
  (local i32 i32)

  ;; allocate 4 bytes on the shadow stack
  (local.set 3
    (i32.sub
      (global.get 0)
      (i32.const 4)))

  ;; `b` gets constant folded away
  
  ;; `c` is stored in a local
  (local.set 2
    (i32.add
      (local.get 0)
      (i32.const 1)))

  ;; `d` is stored on the shadow stack
  (i32.store align=1
    (local.get 3)
    (local.get 2))

  ;; `d` is copied into the result pointer
  (i32.store align=1
    (local.get 1)
    (i32.load align=1
      (local.get 3))))

The codegen revolves around this function:

fn genExpr(
    c: *Compiler,
    f: *wir.FunData,
    tir_f: tir.FunData,
    dest: wir.Destination,
) error{GenerateError}!wir.Walue {
    // ...
}

It consumes an expression tree, emits some wasm code to compute the value of that tree, and returns an abstract value - a description of where and how the result of that expr can be accessed. I call these abstract values Walues and I say it to myself in a wario voice.

pub const Walue = union(enum) {
    stack: wasm.Valtype,

    // local variables
    closure, 
    arg: tir.Arg,
    @"return", // value of the result pointer argument
    local: Local,

    // global variables
    shadow, // value of the shadow stack pointer global

    // literals
    i32: i32,
    @"struct": struct {
        repr: ReprStruct,
        values: []Walue, // may not contain .stack
    },
    fun: struct {
        repr: ReprFun,
        closure: *Walue, // may not contain .stack
    },

    // heap addresses
    value_at: struct {
        ptr: *Walue,
        repr: Repr,
    },
    add: struct {
        walue: *Walue,
        offset: u32,
    },
};

These walues allow the codegen to handle literals lazily. Eg if I call genExpr on a struct expression, it just returns a struct walue instead of actually emitting wasm to create the struct. If it turns out later that I actually do need to allocate the struct, I can do it then. If not, then I saved an allocation.

// zest
f = (options: [:length, :width]) length + width
f(
  // It's pointless to allocate this struct just to destructure it again
  options: [length: 4, width: 7],
)

;; wasm
;; func 0 is `main`
(func (;0;) (type 0) (result i32)
  ;; no stack allocations here - directly call `f` with constant arguments
  (call 1
    (i32.const 4)
    (i32.const 7)))

;; func 1 is `f`
(func (;1;) (type 1) (param i32 i32) (result i32)
  (i32.add
    (local.get 0)
    (local.get 1)))

This is where the distinction I mentioned earlier between expressions and variables becomes important. Generating walues lazily can cause more work if I do it more than once, or inside a loop. So a useful heuristic is to be lazy when generating expressions but eager when generating variables.

// zest
f = (options: [:length, :width]) length * width
// Should create options on the shadow stack here...
options = [length: 4, width: 7]
i mut = 0
while {i < 10} {
   // ...rather than lazily here!
   i@ = f(:options)
}

;; wat
(func (;0;) (type 0)
  (local i32 i32 i32 i32)

  ;; push stack frame
  (global.set 0
    (local.tee 0
      (i32.sub
        (global.get 0)
        (i32.const 12))))

  ;; copy `options` to the shadow stack
  (i32.store align=1
    (local.get 0)
    (i32.const 4))
  (i32.store offset=4 align=1
    (local.get 0)
    (i32.const 7))

  ;; initialize `i` on the shadow stack
  (i32.store offset=8 align=1
    (local.get 0)
    (i32.const 0))

  ;; while
  (block  ;; label = @1
    (loop  ;; label = @2

      ;; if not(i < 10) break
      (br_if 1 (;@1;)
        (i32.eqz
          (i32.lt_s
            (i32.load offset=8 align=1
              (local.get 0))
            (i32.const 10))))

      ;; tmp = f(:options)
      (local.set 1
        (i32.load align=1
          (local.get 0)))
      (local.set 2
        (i32.load offset=4 align=1
          (local.get 0)))
      (local.set 3
        (call 1
          (local.get 1)
          (local.get 2)))

      ;; i = tmp
      (i32.store offset=8 align=1
        (local.get 0)
        (local.get 3))

      ;; continue
      (br 0 (;@2;))))

  ;; pop stack frame
  (global.set 0
    (i32.add
      (local.get 0)
      (i32.const 12))))

Of course being eager can be wasteful if a variable is used only once, or only parts of it are used.

// Copying this to the stack was a waste.
options = [length: 4, width: 7]
f(:options)

// Didn't even need this struct - could have just constant-folded the whole thing.
options = [length: 4, width: 7]
area = options.length * options.width

But because I'm doing all this in a single pass I can't look ahead to see how a variable is used. I could record that information in a previous pass, but the constant-propagation and dead-code elimination that I do during generate can change how variables are used! Near the end of the post I have some ideas for how to improve this.

The codegen for the example above may have been surprising.

// zest
f = (options: [:length, :width]) length + width
f(options: [length: 4, width: 7])

;; wasm
;; func 0 is `main`
(func (;0;) (type 0) (result i32)
  ;; no stack allocations here - directly call `f` with constant arguments
  (call 1
    (i32.const 4)
    (i32.const 7)))

;; func 1 is `f`
(func (;1;) (type 1) (param i32 i32) (result i32)
  (i32.add
    (local.get 0)
    (local.get 1)))

f takes one argument, which is a struct. But the generate code takes two arguments which are both integers. This is because of the way I handle destructuring of function arguments. In the dir the example above actually gets rewritten to something like:

f = (length, width) length + width
f-wrapper = (args) {
  assert-is-struct(args)
  assert(args.len == 1)
  options = args.options
  assert-is-struct(options)
  assert(options.len == 2)
  length = options.length
  width = options.width
  f(length, width)
}
f-wrapper([options: [length: 4, width: 7]])

This means that the interpreter doesn't have to know about destructuring.

During type-checking all of the assertions are checked statically - either producing a type error or disappearing from the code entirely. So the tir looks like:

f = (length, width) length + width
f-wrapper = (args) {
  options = args.options
  length = options.length
  width = options.width
  f(length, width)
}
f-wrapper([options: [length: 4, width: 7]])

Finally, during codegen all wrapper functions are inline. Constant propagation fuses the creatioin and destructuring of structs, producing:

f = (length, width) length + width
f(4, 7)

Right now this is just a cutesy, but I'm planning to add functions to destructuring too, allowing for type casts, default arguments etc.

f = (options: [
  :length /i32 /else(0), 
  :width /i32 /else(0),
]) {
  length + width
}

These will get optimized away similarly. Type-checking will either generate a compile-time error or insert the cast code in the function wrapper. Then after inlining the cast ends up in the caller. So we can call f with lots of different integer types, as long as they can be cast to i32 safely, but we'll still only generate one specialization for f because all the casts happen in the caller.

When generating a variable declaration like a = 1, I record the walue and then return it later from any uses of a. This provides some basic constant-propagation through variables.

// zest
a = 1
a

;; wasm
(i32.const 1)

But for mutable variables the walue represents their address rather than their value.

// zest
a mut = 1
a

;; wasm
(i32.store align=1
  (local.get 0)
  (i32.const 1))
(i32.load align=1
  (local.get 0))

In fact, I only ever use walues to represent constant values. This prevents the mapping from variable to walue ever needing to change.

If I actually tracked the value of mutable variables, even abstractly, then I would have to be really careful around assignments, branches, and loops to track different versions of the same variable. This is the problem that static single assignment solves, but constructing ssa is expensive and most of the analyses over it require approximating fixpoints to deal with loops. I want to see how far I can get using purely linear passes.

(Also maybe using mutable value semantics will somewhat reduce the impact of not using ssa, since it encourages making most values deeply immutable?)

The other key part of genExpr is the Destination argument.

pub const Destination = union(enum) {
    nowhere,
    anywhere,
    stack,
    value_at: *Walue,
};

Rather than generating a walue from a child expr and then copying the walue somewhere else, we just tell the child what destination we're want it's result to end up at and then rely on the child to do the copying. This massively reduces the creation of intermediate values.

Most child exprs don't actually do the copying themselves though - they just return a lazy walue and rely on this catch-all handler at the end of genExpr:

switch (dest) {
    .nowhere => {
        return dropStack(c, f, result);
    },
    .anywhere => {
        return result;
    },
    .stack => {
        const repr = walueRepr(c, f, result);
        switch (wasmRepr(repr)) {
            .primitive => |valtype| {
                load(c, f, result);
                return .{ .stack = valtype };
            },
            .heap => {
                loadPtrTo(c, f, result);
                return .{ .value_at = .{
                    .ptr = c.box(wir.Walue{ .stack = .i32 }),
                    .repr = repr,
                } };
            },
        }
    },
    .value_at => |ptr| {
        store(c, f, result, ptr.*);
        return .{ .value_at = .{ .ptr = ptr, .repr = walueRepr(c, f, result) } };
    },
}

But some children can use the destination to avoid intermediate values. For example, when dereferencing a mutable reference we need to copy the contents somewhere to avoid introducing aliasing. But often we can copy to the destination directly instead of allocating stack space.

.ref_deref_begin => {
    const ref = try genExpr(c, f, tir_f, if (dest == .nowhere) .nowhere else .anywhere);
    const repr = take(c, tir_f).ref_deref_end;
    const value = wir.Walue{ .value_at = .{ .ptr = c.box(ref), .repr = repr } };
    switch (dest) {
        .nowhere, .stack, .value_at => {
            // Let the catch-all copy this into the destination.
            return value;
        },
        .anywhere => {
            // Must make a copy into a fresh stack slot to avoid passing around an aliased walue.
            return copy(c, f, value.value_at);
        },
    }
},

Places where destinations are created:

Assignments to mutable variables pass the variables address.
Returning from a function passes the result pointer.
Every expression in a block except the last passes .nowhere.

Places where destinations are used to avoid copies:

Function calls pass the destination as the result pointer.
Dereferencing mutable references copies into the destination.

Several exprs also check if the dest is .nowhere and, if so, avoid generating wasm.

The combination of walues and destinations also works well for chains of field accesses like thing.x.y.z. Generating thing returns a walue representing it's address, then generating thing.x just adds an offset to that address, and so on. No actual code gets generated until the walue meets a destination, at which point we can produce a single load or copy.

a mut = 0
b = [w: 1, x: [y: 2, z: 3]]
a@ = b.x.z

;; a mut = 0
(i32.store align=1
  (local.get 0)
  (i32.const 0))

;; b = [w: 1, x: [y: 2, z: 3]]
(i32.store offset=4 align=1
  (local.get 0)
  (i32.const 1))
(i32.store offset=8 align=1
  (local.get 0)
  (i32.const 2))
(i32.store offset=12 align=1
  (local.get 0)
  (i32.const 3))

;; a@ = b.x.z
(i32.store align=1
  (local.get 0)
  ;; a single load for b.x.z
  (i32.load offset=12 align=1
    (local.get 0)))

Zig has a similar approach to codegen. Their equivalent to walues is here. Their equivalent to destinations is result location semantics, which are actually part of the language semantics and so are handled much earlier in the compiler. By the time zig reaches air I don't think there are any struct literals remaining - they get decomposed into writes to the result location. Something like:

// source
x = .{.a = 1, .b = 2}

// air
x = undefined
x.a = 1
x.b = 2

This can cause some problems though:

var p = Pair{ .a = .{ 1, 2 }, .b = .{ 3, 4 } };
p = .{ .a = p.b, .b = p.a };
std.debug.print("{}\n", .{p});
// prints:
// test.Pair{ .a = { 3, 4 }, .b = { 3, 4 } }

The rewrite effectively did this:

// source 
p = .{ .a = p.b, .b = p.a };

// air
p.a = p.b
p.b = p.a

This is actually part of the language semantics, so they can just say don't do that. But as Martin Wickham explains in attack of the killer features when this is combined with other optimizations like converting pass-by-value to pass-by-reference or passing result locations into functions as result pointers, this kind of aliasing can appear in surprising places. An example like:

pub fn flip(p: Pair) Pair {
    return .{ .a = p.b, .b = p.a };
}

pub fn main() !void {
    var p = Pair{ .a = .{ 1, 2 }, .b = .{ 3, 4 } };
    p = flip(p);
    std.debug.print("{}\n", .{p});
}

Used to get rewritten to something like:

pub fn flip(p: *const Pair, result: *Pair) void {
    result.* = .{ .a = p.b, .b = p.a };
}

pub fn main() !void {
    var p: Pair = undefined;
    p[0][0] = 1; p[0][1] = 2; p[1][0] = 3; p[1][1] = 4;
    flip(&p, &p); // surprise aliasing!
    std.debug.print("{}\n", .{p});
}

But if I play around with the different backends now I get two different versions of the body of main:

// llvm-debug doesn't propagate result location to flip's result pointer
var p = Pair{ .a = .{ 1, 2 }, .b = .{ 3, 4 } };
var tmp: Pair = undefined;
flip(&p, &tmp);
p = tmp;

// zig-debug doesn't pass p by reference
var p = Pair{ .a = .{ 1, 2 }, .b = .{ 3, 4 } };
const tmp = p;
flip(&tmp, &p);

So the surprises seem to be limited to places in the source code where the aliasing is directly visible like p = .{ .a = p.b, .b = p.a }.

For zest I don't want these surprises even when they're directly visible. So I have a different solution - I don't propagate destinations through struct literals. So in zest the above example goes something like this:

// original
flip = (p) [p.1, p.0]
p mut = [[1,2],[3,4]]
p@ = flip(p)

// rewritten to:
flip = (p, result mut) {
  tmp0 = p.1
  tmp1 = p.0
  result.0@ = tmp0
  result.1@ = tmp1
}
p mut = undefined
{ p.0.0@ = 1; p.0.1@ = 2; p.1.0@ = 3; p.1.1@ = 4 }
flip(p@, p@)

In this case it results in the same amount of copying as the zig example. But if we pick an example with no functions:

// original
q@ = [p.1, p.0]

// rewritten to:
tmp0 = p.1 // pointless copy!
tmp1 = p.0 // pointless copy!
q.0@ = tmp0
q.1@ = tmp1

Now we have two pointless copies for something that would have been totally safe in zig.

In the long run I want to take advantage of the fact that mutable value semantics gives me very cheap alias analysis to detect when a copy is actually needed. That would produce something like:

// original
q@ = [p.1, p.0]
p@ = [p.1, p.0]

// rewritten to:
q.0@ = p.1
q.1@ = p.0
tmp = p.0 // copy before over-writing!
p.0@ = p.1
p.1@ = tmp

I actually got this working in the generate pass because it's easy to check if two walues could alias. But I realized that implementation wouldn't generalize once I added heap allocation and had to take reachability into account (eg the walues for some-list and some-list.{i} are distinct memory locations, but calling f(@some-list) can still mutate some-list.{i}). So I ripped that out and plan to reintroduce it in an earlier layer where I still have complete aliasing information.

In the meantime, the copying is not as pervasive as it might initially seem. Because walues can be struct literals, we only end up producing extra copies when a struct literal contains a dereference of a mutable variable or a call to a function that returns a struct.

p = [[1,2],[3,4]] // literals don't require copies
p2 = [p.1, p.0] // reads from non-mutable variables don't require copies
p3 = [p.0, [p.1.0, foo(p.1.1)]] // functions that return primitives don't require copies

Wasm loads and store require specifying an alignment. But the spec also says that this is just a hint - the backend is not allowed to trap if the address is not correctly aligned. I don't understand how this is useful to the backend. Would it test the alignment and switch between a fast path and a slow path? That doesn't seem worthwhile for what would usually be a single instruction?

For now I've just specified 1-byte alignment on everything, and I'm working on x86_64 where unaligned accesses are allowed and are barely slower than aligned accesses. But what happens if I run this on arm where unaligned accesses are still expensive? If I specify larger alignments will the backend actually be able to make use of that hint? If you know, please email me!

In my first compiler I used binaryen to 'save time' but in the end I spent a ton of time working around gaps in the binaryen api.

This time I just generated wasm myself by hand. It's easy! I only have ~200loc dedicated to producing wasm.

Debugging encoding mistakes is tedious, but luckily I don't mess up the encoding too often. Better tooling would help - gor example wasm2wat won't print any output if there is an encoding mistake anywhere in the file, even if I pass --no-check. I want a tool that prints output until it hits an encoding error and then prints out the next 20 or so bytes so I can see what nonsense I put in there.

I have a small pile of hand-written tests which are good for specifying behaviour, but for flushing out codegen bugs I get much more value from fuzzing. The interpreter that I use for compile-time evaluation can also run entire programs, so my fuzz test runs both the interpreter and compiler on the same program and asserts that if the program type-checks then the compiled program should produce the same result as the interpreter.

My compiler is written in zig which currently doesn't have any support for coverage instrumentation. I offered some cpu-days to black-box fuzzers but didn't find even the obvious bugs that I added to make sure fuzzing was working. So what I'm doing now instead is compiling zig to c, then compiling that with clang to get coverage instrumentation, and then doing coverage-guide fuzzing with honggfuzz.

This flushes out all kinds of weird bugs. My favourite so far was that I accidentally used the wrong integer encoding (unsigned instead of signed) for the size of the shadow stack frame. The encodings only differ for numbers over 128, so the fuzzer had to pack a lot of structs onto the stack to trigger that.

For running wasm programs I use deno, because the wasmtime cli can't yet pass or return values from entry points and zest doesn't have printing yet. For debugging I use chrome, because firefox's wasm debugger doesn't show the value stack.

I don't have any real zest programs to actually measure yet, but I can get a rough feel for the point on the compile-time/runtime tradeoff curve that I'm steering towards by comparing to zig.

I made this half-assed benchmark using the tokenizer from the zig standard library to count the number of tokens in a 10mloc zig file.

	compile time (ms)	runtime (ms)
llvm-release-x86_64	458	1850
llvm-debug-x86_64	149	4306
zig-debug-x86_64	65	19633
llvm-release-x86	645	1837
llvm-debug-x86	313	5129
zig-debug-x86	-	-
llvm-release-wasm32 + wasmtime	300 + 15	2456
llvm-debug-wasm32 + wasmtime	230 + 22	4063
zig-debug-wasm32 + wasmtime	37 + 19	12202

The numbers in bold are what I'm aiming towards. Somewhere in the vicinity of ~10x better compile times than llvm-release at the cost of ~2x worse runtime.

Why does zig-debug have such poor runtime compared to llvm-debug? Neither of them do any optimization before codegen, and wasmtime can see through many codegen mistakes. I haven't run any actual experiments, but picking through the output I see a couple of likely suspects.

The first suspect is slow copies. Here is a function that takes an array and just returns it:

const n = 2;
pub fn pass(p: [n]i32) [n]i32 {
    return p;
}

;; llvm-debug
(func $test.pass (type 1) (param i32 i32)
  (local i64)
  (local.set 2
    (i64.load align=4
      (local.get 1)))
  (i64.store align=4
    (local.get 0)
    (local.get 2))
  (return))

;; zig-debug
(func $test.pass (type 0) (param i32 i32)
  (memory.copy
    (local.get 0)
    (local.get 1)
    (i32.const 8))
  (return))

Llvm-debug generates an aligned 4-byte load and store, whereas zig-debug generates memory.copy which assumes 1-byte alignment and has to check for overlap.

If not targeting the bulk-memory extension then zig-debug instead emits multiple 1-byte loads and stores:

;; zig-debug (no bulk-memory)
(func $test.pass (type 0) (param i32 i32)
  (i32.store8
    (local.get 0)
    (i32.load8_u
      (local.get 1)))
  (i32.store8 offset=1
    (local.get 0)
    (i32.load8_u offset=1
      (local.get 1)))
  (i32.store8 offset=2
    (local.get 0)
    (i32.load8_u offset=2
      (local.get 1)))
  (i32.store8 offset=3
    (local.get 0)
    (i32.load8_u offset=3
      (local.get 1)))
  (i32.store8 offset=4
    (local.get 0)
    (i32.load8_u offset=4
      (local.get 1)))
  (i32.store8 offset=5
    (local.get 0)
    (i32.load8_u offset=5
      (local.get 1)))
  (i32.store8 offset=6
    (local.get 0)
    (i32.load8_u offset=6
      (local.get 1)))
  (i32.store8 offset=7
    (local.get 0)
    (i32.load8_u offset=7
      (local.get 1)))
  (return))

I reran the benchmark above with bulk-memory disabled:

	compile time (ms)	runtime (ms)
zig-debug-wasm32 (mvp+bulk-memory) + wasmtime	36 + 19	12202
zig-debug-wasm32 (mvp) + wasmtime	37 + 21	6120

Getting 2x better performance by disabling bulk-memory certainly suggests that zig-debug is using memory.copy too eagerly.

I think the code responsible is here - it seems like a quick fix. Some experiments show that llvm doesn't switch to memory.copy until n > 16. So for zest I do:

if (from_ptr.equal(to_ptr))
    return;

const byte_count = value_at.repr.sizeOf();
if (byte_count > 64) {
    // Emit memory.copy
    load(c, f, to_ptr);
    load(c, f, from_ptr);
    emitEnum(f, wasm.Opcode.i32_const);
    emitLebI32(f, @intCast(byte_count));
    emitEnum(f, wasm.Opcode.misc_prefix);
    emitLebU32(f, wasm.miscOpcode(wasm.MiscOpcode.memory_copy));
    emitLebU32(f, 0); // memory from
    emitLebU32(f, 0); // memory to
} else {
    // Break into loads and stores
    var offset: u32 = 0;
    const to_add = asAdd(c, to_ptr);
    const from_add = asAdd(c, from_ptr);
    while (offset < byte_count) {
        const remaining = byte_count - offset;
        inline for (.{
            .{ 8, wasm.Opcode.i64_load, wasm.Opcode.i64_store },
            .{ 4, wasm.Opcode.i32_load, wasm.Opcode.i32_store },
            .{ 2, wasm.Opcode.i32_load16_u, wasm.Opcode.i32_store16 },
            .{ 1, wasm.Opcode.i32_load8_u, wasm.Opcode.i32_store8 },
        }) |params| {
            const chunk, const load_op, const store_op = params;
            if (remaining >= chunk) {
                load(c, f, to_add.walue.*);
                load(c, f, from_add.walue.*);
                emitEnum(f, load_op);
                emitLebU32(f, 0); // align
                emitLebU32(f, from_add.offset + offset);
                emitEnum(f, store_op);
                emitLebU32(f, 0); // align
                emitLebU32(f, to_add.offset + offset);
                offset += chunk;
                break;
            }
        } else unreachable;
    }
}

This is what it looks like in practice:

// zest
pass = (c) c
pass([1])
pass([1,2])
pass([1,2,3])

;; wasm
;; 4 byte struct
(func (;1;) (type 1) (param i32 i32)
  (i32.store align=1
    (local.get 1)
    (i32.load align=1
      (local.get 0))))
;; 8 byte struct
(func (;2;) (type 1) (param i32 i32)
  (i64.store align=1
    (local.get 1)
    (i64.load align=1
      (local.get 0))))
;; 12 byte struct
(func (;3;) (type 1) (param i32 i32)
  (i64.store align=1
    (local.get 1)
    (i64.load align=1
      (local.get 0)))
  (i32.store offset=8 align=1
    (local.get 1)
    (i32.load offset=8 align=1
      (local.get 0))))

The second suspect is extra copies:

// zig

pub fn initA(c: i32) [1]i32 {
    return .{c};
}

pub fn initB(c: i32) [1]i32 {
    const tmp: [1]i32 = .{c};
    return tmp;
}

pub fn initC(c: i32) [1]i32 {
    const tmp = [1]i32{c};
    return tmp;
}

;; wasm

(func $test.initA (type 0) (param i32 i32)
  (local i32)

  ;; calculate offset of &result[0] from &result
  (local.set 2
    (i32.add
      (local.get 0)
      (i32.mul
        (i32.const 0)
        (i32.const 4))))
 
  ;; result[0] = 1
  (i32.store
    (local.get 2)
    (local.get 1))

  (return))

(func $test.initB (type 0) (param i32 i32)
  (local i32 i32 i32)

  ;; push stack frame
  (global.set $__stack_pointer
    (local.tee 3
      (i32.and
        (i32.sub
          (local.tee 2
            (global.get $__stack_pointer))
          (i32.const 16))
        (i32.const -16))))

  ;; calculate offset of &tmp[0] from &tmp
  (local.set 4
    (i32.add
      (local.get 3)
      (i32.mul
        (i32.const 0)
        (i32.const 4))))

  ;; set tmp[0] = 1
  (i32.store
    (local.get 4)
    (local.get 1))

  ;; tmp2 = tmp
  ;; this is an unnecessary copy!
  (memory.copy
    (i32.add
      (local.get 3)
      (i32.const 4))
    (local.get 3)
    (i32.const 4))

  ;; result = tmp
  (memory.copy
    (local.get 0)
    (i32.add
      (local.get 3)
      (i32.const 4))
    (i32.const 4))

  ;; pop stack frame
  (global.set $__stack_pointer
    (local.get 2))
  (return))


(func $test.initC (type 0) (param i32 i32)
  (local i32 i32)

  ;; push stack frame
  (global.set $__stack_pointer
    (local.tee 3
      (i32.and
        (i32.sub
          (local.tee 2
            (global.get $__stack_pointer))
          (i32.const 16))
        (i32.const -16))))

  ;; tmp[0] = 1
  ;; (for some reason the offset is computed statically here)
  (i32.store
    (local.get 3)
    (local.get 1))

  ;; result = tmp
  (memory.copy
    (local.get 0)
    (local.get 3)
    (i32.const 4))

  ;; pop stack frame
  (global.set $__stack_pointer
    (local.get 2))
  (return))

Where is this extra copy in initB coming from? I'm not 100% sure, but it seems to happen whenever casting a literal into a const result location. This is pretty common in idiomatic zig. I don't have casts yet, but I'll keep an eye out for that when I add them.

// zest
init-b = (c) {
  tmp = [c]
  tmp
}

;; wasm
(func (;1;) (type 1) (param i32 i32)
  (local i32)

  ;; push stack frame
  ;; (don't need to set global stack pointer, since this function doesn't call any others)
  (local.set 2
    (i32.sub
      (global.get 0)
      (i32.const 4)))

  ;; tmp[0] = c
  (i32.store align=1
    (local.get 2)
    (i32.load align=1
      (local.get 0)))

  ;; result[0] = c
  (i32.store align=1
    (local.get 1)
    (i32.load align=1
      (local.get 2))))

I also noticed that llvm-debug builds often produce weird pointless stores:

// zig
pub fn set(p: *[1]i32, c: [1]i32) void {
    p.* = c;
}

;; wasm
(func $test.set (type 1) (param i32 i32)
  (local i32 i32 i32 i32 i32 i32)
  (local.set 2
    (global.get 0))
  (local.set 3
    (i32.const 16))
  (local.set 4
    (i32.sub
      (local.get 2)
      (local.get 3)))
  (global.set 0
    (local.get 4))
  ;; this store is never used!
  (i32.store offset=12
    (local.get 4)
    (local.get 0))
  (local.set 5
    (i32.load
      (local.get 1)))
  (i32.store
    (local.get 0)
    (local.get 5))
  (local.set 6
    (i32.const 16))
  (local.set 7
    (i32.add
      (local.get 4)
      (local.get 6)))
  (global.set 0
    (local.get 7))
  (return))

If I build with -fstrip these pointless stores disappear, along with many of the extra locals:

(func $test.set (type 1) (param i32 i32)
  (local i32)
  (local.set 2
    (i32.load
      (local.get 1)))
  (i32.store
    (local.get 0)
    (local.get 2))
  (return))

And in the llvm ir the pointless stack slot is referenced by the debug metadata:

; Function Attrs: noredzone nounwind
define internal fastcc void @test.set(ptr nonnull align 4 %0, ptr nonnull readonly align 4 %1) unnamed_addr #0 !dbg !157 {
Entry:
  %2 = alloca ptr, align 4
  store ptr %0, ptr %2, align 4, !dbg !165
  call void @llvm.dbg.declare(metadata ptr %2, metadata !166, metadata !DIExpression()), !dbg !165
  call void @llvm.dbg.declare(metadata ptr %1, metadata !167, metadata !DIExpression()), !dbg !165
  call void @llvm.memcpy.p0.p0.i32(ptr align 4 %0, ptr align 4 %1, i32 4, i1 false), !dbg !168
  ret void, !dbg !168
}

So I assume the difference is something to do with making values visible to dwarf calculations. I don't know how dwarf works with wasm.

All the numbers in the benchmarks above are with -fstrip. Turning it off trashes compile times (except for zig-debug-wasm32 which doesn't produce dwarf yet):

	compile time (ms)	runtime (ms)
llvm-release-x86_64	5089	1831
llvm-debug-x86_64	980	4917
zig-debug-x86_64	270	19847
llvm-release-wasm32 + wasmtime	354 + 15	2500
llvm-debug-wasm32 + wasmtime	169 + 23	4300
zig-debug-wasm32 + wasmtime	38 + 19	11962

Currently I always allocate structs on the shadow stack in case they are passed by reference later. This is bad for loops like:

i mut = 0
while {i < string.len} {
   // do stuff...
}

If string was stored in two wasm locals (string.ptr and string.len) then it's trivial for the backend to figure out that string.len is constant and that it's safe to hoist the read out of the loop. But with string stored on the shadow stack the wasm backend has to do clever reasoning about potential aliases, which will probably fail in non-trivial loops.

The difficulty, as I mentioned earlier, is that how best to represent variables depends on how they are used, but constant propagation changes the use of variables. So doing both at the same time is a problem. It might work better to push some of the work into the infer pass (which I should probably rename to something less specific like analyze). So the breakdown would be:

infer
- do everything that infer currently does
- propagate constants and eliminate dead code
- inline function wrappers
- break struct literals into individual writes (using aliasing information to decide if intermediate copies are needed)
- record how each variable is used in the emitted code
generate
- choose representation for variables
  - if ever passed by reference, store on the shadow stack
  - otherwise break up into individual wasm locals
- use a simpler walue to propagate just addresses/offsets and track the wasm stack
- generate wasm

This will also remove some redundancy, since the walues used in generate at the moment encode the type of the value. If I push that into gthe infer pass then in the generate pass walues will only have to deal with wasm primitives and heap addresses.

When generating wasm for builtin functions I have two options:

If I pass Destination.anywhere to the arguments then they might return constant walues and I can fold the whole function away. But if the first argument returns a constant and the second argument returns Walue.stack then I need to store the second argument into a fresh local, load the first argument, read the second argument from the local, and then emit the function call. Extra locals make me sad!
If I pass Destination.stack to the arguments then they are guaranteed to end up on the stack in the correct order, but they will always return Walue.stack so I miss the chance to propagate constants.

I think the solution here is first move constant propagation into the infer pass, then take option 1 above and use the indirect trick in tir so that I can insert a load between the first and second arguments after looking at their walues.

Optimization improvements can wait until later though. Definitely the next thing is to write the runtime and add strings/lists/hashtables, which will get me a lot closer to a language that one might actually use to write interesting programs. (At the moment programs can't have any side-effects or inputs and must return an integer, which somewhat limits their usefulness.)

Zest currently lives in a private repo because I don't want to give the impression that this is something people should try to play with, file issues for, send me pull requests for etc. But I'm happy to add individual people who have informed opinions about compilers or language design. Email me :)