Making live repls behave

Published 2021-05-18

I like to make live repls that evaluate as you type.

Here is an example for a silly little stack language with singleton sets (0 and 1), set union (|), set product (*) and dup (^, makes an extra copy of the top value on the stack).

Limit memory usage to 2^ bytes
Am I broken?

Try changing the numbers and seeing how that affects the results.

This post gives an overview of some of the implementation challenges, a rough map of the possible solutions and then dives into a specific solution that I've been experimenting with lately that makes use of some interesting features in zig.

Challenges

It's easy enough to implement a live repl using any old interpreter, but the interpreter in this demo has some properties that are a bit trickier:

I think these properties are essential to making live repls actually usable in practice.

The first property is easy to achieve - concurrency is a well understood problem. It's the interaction with the other two properties that make things interesting. We can reduce it to these two questions:

And for both questions the challenge is in correctly cleaning up so that a new interpreter can be started afterwards.

Stopping the interpreter

There are many possible approaches to this. I find it helpful in situations like this to try to find some sort of structure to the space of choices:

Separate process, message passing. We could start the interpreter in a totally separate process/webworker and communicate with it via message passing. When we want to stop the interpreter we just kill the process. This is probably the easiest approach to get right, but if the interpreter requires access to large amounts of existing data (eg for a database query interpreter) then there can be a substantial performance hit.

Separate process, shared memory, kill process. Instead of using message passing we could place the data in shared memory. This fixes the performance hit, but we now have to be very careful that killing the process does not leave the shared memory in an invalid state. The easiest case is when the shared data doesn't change at all while the interpreter is running, in which case we could stick to a 'shared xor mutable' regime. Another downside is that sharing memory between processes requires different implementations for each platform, and on the web is complicated by spectre mitigations and is also not widely supported yet for wasm.

Separate process, shared memory, shared flag. Rather than killing the interpreter process at an arbitrary point, we could have it regularly check some flag in shared memory that tells it to stop. Now, rather than having to ensure that the shared memory is always in a valid state, we only have to ensure that it is in a valid state when we check the flag. But on the other hand, we have to ensure that there is no control flow path that goes a long time without checking the flag, otherwise we won't be able to stop the interpreter promptly.

Same process, separate thread, shared flag. Essentially the same as above. More potential for races etc because all memory is shared by default, but on the other hand much easier to implement across multiple platforms.

Same process, separate thread, kill thread. We could try to run the interpreter in a separate thread and kill the thread when we want it to stop. Unfortunately this is practically impossible to get right. The interpreter thread might be in the middle of modifying the state of the global allocator or changing something in the language runtime when it is killed, leading to bizaare crashes later on. Killing a thread is so error-prone that many standard libraries don't even offer this function any more.

Same process, same thread, regularly yield control to caller, save/restore state manually. We could write the interpreter as a giant state machine so that we can pause evaluation after any state transition. Then either give the caller a 'step' method that they can repeatedly call to do work, or give the caller some flag which will be checked after every state transition. This approach works well for interpreters where each individual operation takes a small, bounded amount of time. If, on the other hand, your interpreter has operations that can take unbounded amounts of time to execute or that have complex control flow, it can be a lot of work to break these down into smaller operations and to reify the state of all possible paths through the control flow. Sqlite is a good example of this approach - most database query plans have operations like 'join' that can take unbounded amounts of time to execute but sqlite breaks these down into a bytecode that has conditionals, loops and operations on index cursors.

Same process, same thread, regularly yield control to caller, save/restore state using language support. Many languages have libraries or compiler support for saving/restoring the state of a running function. The names and mechanisms vary - async/await, promises, fibres, coroutines, lightweight threads - but the core idea is always to allow writing code that looks like regular blocking code but that can be paused and resumed at certain points in it's execution. This approach is much easier than saving the state manually, since the compiler authors have already done all the hard work. But depending on the implementation it can be much more challenging to verify that the interpreter doesn't have any control flow paths where it can run a long time without yielding. Haskell's lightweight threads, for example, check if they should yield during every allocation. But some operations can run a long time without allocating eg a relational join between two large tables that doesn't produce any results.

Another way to view this space is by counting whih points the interpreter can stop at. This forms a continuum from fully pre-emptive approaches where the interpreter can stop at any point (eg killing a process) to fully cooperative approaches where the interpreter only stops at manually specified points (eg async/await). The tradeoff then is:

Enforcing memory limits

In some ways this is a much easier problem than stopping the interpreter. Simply count how much memory you are using and on each allocation check if you are over the limit. What makes this difficult is that most languages and libraries assume that allocations are infallible.

The approaches I see are:

Run the interpreter in a separate process. This allows us to use OS-level facilities (eg ulimit on linux) to place limits on memory usage. This is easy, but it limits what approaches we can choose for stopping the interpreter.

Count allocations manually. We can look at all the data-structures used in the interpreter and use our knowledge of the language's memory layout to figure out a lower bound on how much memory we are using. This approach has many downsides: it's easy to miss a data-structure or forget to account for sharing, it depends on language implementation details which might change over time, it can't account for fragmentation or allocator overhead, we have to read through every library we use to count all it's allocations. But the worst is that it's post-hoc - if we make a library call we usually can't guess how much memory it will allocate until it returns, at which point we may be well over the limit.

Track total allocations from the global allocator. Many languages runtimes have access to a global count of total allocated bytes. We can check this before and after each interpreter operation. The main downside is the same as above - it's hard to know in advance whether any given operation will take us over the limit so we might stop too late.

Use libraries that handle allocation failure. One of the main reasons I've been experimenting with zig is that the standard library allows using multiple allocators and will propagate allocation failures. But you could also do the same in other low-level languages like c or rust (with no_std) by writing your own libraries instead of using the standard library.

In all these approaches we also have the same challenge as we do when stopping the interpreter - allocation points are very frequent and we have to ensure that either all shared memory is in a valid state at each point, or that we can clean up correctly when handling allocation failure.

Experiment

The project I'm currently working on has some goals that make life difficult:

The approach I'm experimenting with in the demo above is:

(I also plan to handle shared data by enforcing shared xor mutable at a very coarse level, but in this demo there is no shared data.)

The code is available here.

So here is how it works. The interpreter is passed an ArenaAllocator and a Bounder. It must use the arena for all allocations and must call bounder.spendBudget() whenever it does some work. Here's what that looks like for the Product operator in the demo above:

fn eval(arena: *ArenaAllocator, bounder: *Bounder, exprs: []const Expr) ![]const []const u64 {
    var stack = std.ArrayList([]const []const u64).init(&arena.allocator);
    for (exprs) |expr| {
        switch (expr) {
            .Product => {
                const left_bag = stack.pop();
                const right_bag = stack.pop();
                const bag = try arena.allocator.alloc([]const u64, left_bag.len * right_bag.len);
                var i: usize = 0;
                for (left_bag) |left_row| {
                    for (right_bag) |right_row| {
                        bounder.spendBudget(); // spend budget for every row produced
                        bag[i] = try std.mem.concat(&arena.allocator, u64, &[_][]const u64{ left_row, right_row });
                        i += 1;
                    }
                }
                try stack.append(bag);
            },
            // similar for Constant, Union, Dup etc...
        }
    }
    return stack.pop();
}

Unbeknownst to the interpreter, spendBudget is actually an async function. If spendBudget is called when the work budget has been used up then the entire call stack is suspended up until the first enclosing async. When the caller is ready to continue it can reset the work budget and resume the interpreter by calling bounder.doWork().

const Bounder = union(enum) {
    HasWorkBudget: usize,
    Suspended: anyframe,

    fn init() Bounder {
        return .{ .HasWorkBudget = 0 };
    }

    fn spendBudget(self: *Bounder) void {
        if (self.HasWorkBudget == 0) {
            suspend {
                self.* = .{ .Suspended = @frame() };
            }
        }
        self.HasWorkBudget -= 1;
    }

    fn hasWork(self: *Bounder) bool {
        return (self.* == .Suspended);
    }

    fn doWork(self: *Bounder, work_budget: usize) void {
        const frame = self.Suspended;
        self.* = .{ .HasWorkBudget = work_budget };
        resume frame;
    }
};

When using the interpreter the api looks almost the same as if we had unrolled all the control flow explicitly by hand.

const eval_frame = async eval(arena, bounder, exprs);
while (bounder.hasWork()) bounder.doWork(1000);
const result = await eval_frame;

The state of the interpreter is stored in eval_frame. The caller can call bounder.doWork() whenever they like (in the demo on this page it's called in a loop that uses setTimeout to avoid blocking the ui). Once the work is finished, the result can be retrieved with await eval_frame without blocking.

The zig standard library comes with an allocator that already supports memory limits:

var gpa = GeneralPurposeAllocator(.{
    .enable_memory_limit = true,
}){
    .requested_memory_limit = 1 << 14,
};
var arena = ArenaAllocator.init(&gpa.allocator);

When the interpreter tries to allocate more memory than it is allowed the allocator will return error.OutOfMemory. We can easily bubble this error to the top by wrapping each allocating call in try (and if we forget, the compiler will complain that the error was unhandled).

It could be tricky to correctly free all memory on every possible error path. So instead we wrap gpa in an arena and just free the whole thing whenever we reset the interpreter.

self.arena.deinit();
self.arena = ArenaAllocator.init(&gpa.allocator);

(A more complicated interpreter might need to use a pool rather than an arena so that it can free intermediate values, but the reset code would still look the same.)

Thanks to zig's colorblind async and allocator api the interpreter itself knows very little about what is happening. It doesn't have to be littered with async/await calls, have special return types. All it has to do is use the provided allocator and regularly call spendBudget. We could easily swap from using async to using a separate thread by swapping in a different version of Bounder, without having to change the interpreter code at all.

Async allocation?

When we run out of memory it would be nice if we could just pause the interpreter at the offending allocation and then continue from where we left off, instead of existing the interpreter and starting from scratch once the memory limit is raised.

I tried to do exactly that in this branch by making an allocator wrapper:

const BoundedAllocator = struct {
    parent: ParentAllocator,
    allocator: Allocator,
    state: union(enum) {
        Ok,
        OutOfMemory: anyframe,
    },

    fn alloc(allocator: *Allocator, n: usize, ptr_align: u29, len_align: u29, ra: usize) ![]u8 {
        const self = @fieldParentPtr(BoundedAllocator, "allocator", allocator);
        while (true) {
            const result = self.parent.allocator.allocFn(&self.parent.allocator, n, ptr_align, len_align, ra);
            if (result) |ok| {
                return ok;
            } else |err| {
                // Suspend and request more memory
                suspend {
                    self.state = .{.OutOfMemory = @frame()};
                }
                if (self.state == .Ok)
                    // Request granted, try again
                    continue
                else
                    // Request denied
                    return err;
            }
        }
    }
    
    // etc...
}

Unfortunately this doesn't work:

[nix-shell:~/bounded-live-eval]$ zig build-exe main.zig
./main.zig:73:5: error: 'BoundedAllocator.alloc' cannot be async
    fn alloc(allocator: *Allocator, n: usize, ptr_align: u29, len_align: u29, ra: usize) ![]u8 {
    ^
./main.zig:66:28: note: required to be non-async here
                .allocFn = alloc,
                           ^
./main.zig:82:17: note: suspends here
                suspend {}
                ^
./main.zig:92:5: error: 'BoundedAllocator.resize' cannot be async
    fn resize(allocator: *Allocator, buf: []u8, buf_align: u29, new_len: usize, len_align: u29, ret_addr: usize) Allocator.Error!usize {
    ^
./main.zig:67:29: note: required to be non-async here
                .resizeFn = resize,
                            ^
./main.zig:101:17: note: suspends here
                suspend {}
                ^

In hindsight, this makes sense. Zig specializes functions on the calling convention used, of which async is just one example. So when we store a function pointer whose value is only known at runtime, it's type must include the calling convention it was compiled with. Otherwise we wouldn't know how to call it. The function pointers in the Allocator api are all sync calls so this BoundedAllocator example can't work.

Todo

On linux I'm unable to detect a difference in performance between this code and a sync version that just runs to completion without yielding. BUT. This demo is interpreting a silly language which doesn't do any real computation, so the performance is probably totally dominated by allocation and cache misses. I need to test interpreting a more interesting language.

In the demo above if you set the memory limit to 30 and changed the code in the repl to 01|^*^*^*^*^*^*^*^*^*^*^* then you might see an exception thrown in the console. It looks like a stack overflow and eval appears many times on the callstack. But I checked in gdb on the native code version and eval is never on the callstack more than once. The stack overflow only happens in wasm. It may be caused by a workaround for wasm's control-flow integrity, which prevents you from mangling your own stack however you like. But I haven't looked into it yet.

On a related note, the colorblind async only extends to non-recursive functions. If eval was recursive then it wouldn't compile at all in async mode without explicitly allocating frames for the recursive call. This isn't terrible - usually in this kind of code I have to avoid recursive functions anyway because there is no way to gracefully catch a stack overflow. But it sounds like #1006 would make this work, and as a bonus would replace the stack overflow with an OutOfMemory error which can be caught gracefully.