0020: hytradboi, milestones, data soup, airtable, self-hosting

Published 2022-02-03

HYTRADBOI

Many new talks have been added to the HYTRADBOI schedule. There are another 15 or so in the pipeline.

It's starting to look pretty exciting.

Milestones

I've been thinking lately about the psychological need for a sense of completion, or at least milestones.

In industrial jobs my work has typically been organized around discrete projects with a clear roadmap like 'support the postgres json api'.

For academics, I imagine papers and conference deadlines produce a similar demarcation of projects and time.

But when doing independent research I've typically fallen into the habit of creating unbounded projects with no internal structure. I don't think this has been good for my sanity.

Ink & Switch have an appealing research model. There's clearly an overarching theme, but the actual research is done in individual three month projects, each of which has a clear goal and finishes with a writeup at the end of the period regardless of what state the project is in.

I think this would kill the sense of sunk costs that keep me moving down a particular branch of research long after the returns have declined.

With that in mind, I've tried to retroactively separate the history of imp into distinct versions. I think both v0 and v1 were well-scoped projects that succesfully answered the motivating questions:

But v2 is a mess that doesn't have a clear goal and is being pulled in many different directions. I wanted to explore using imp as a general-purpose language. I wanted to explore live interaction as a development model. I wanted to explore relational approaches to local-first software. And on top of all of that I was trying to ship something that other people could use.

So I'm going to put a pin in v2. And in the future I think it would be a good idea for each successive effort to have a new version number, a clear focus (especially with regards to research vs development), and a deadline after which I write up the results.

Similarly for dida. If I split the "easier to use and understand" goal into two I think it's fair to say that "easier to understand" is a success. Dida contains all the core ideas of differential dataflow (minus the parallel layer) but is under 2000 lines of direct well-commented code and is accompanied by a detailed explanation of the core algorithms. That's a milestone.

This is a kind of magic trick. I didn't write any new code or give up on any goals, but I no longer feel weighed down by unfinished work.

Data soup

My computer usage is full of tiny CRUD problems that are typically solved either with single-purpose apps or with adhoc manual effort. Here's a random selection off the top of my head:

For most of these the actual logic is not very complicated and the effort lies instead in persistence, cross-device sync, cross-platform gui, automatic deployment etc - the kind of problems I described in Pain we forgot. This is sufficiently effortful that I rarely write code to solve my own problems, even very simple problems like the ones above.

This is why I think a lot of talk about end-user programming misses the point - we don't even have very good solutions for programmer programming yet. There's so much action around different approaches to making it easier to specify the logic, but the logic is the easy part of these problems.

Airtable

Most of the problems above can be solved pretty nicely in airtable.

For example, here is the example from Pain we forgot, including automations which send out emails to gather orders from employees and send the days order to the caterer:

And here is a simple accounts table with total spending broken down by year, month and tag:

Even my email inbox would look pretty reasonable in airtable:

Airtable is not the first software to follow this database + widgets approach but it's by far the best I've used on many axes. For many usecases it's a great solution. Organizations that have complex and constantly changing workflows are typically poorly served by paying for custom software development - misunderstood requirements, frustrating UIs, slow turnaround on change requests. Airtable is a huge improvement.

So why don't I use airtable to solve my data soup problems?

The biggest downside is the existential risk. Airtable has 'over a hundred engineers and over a million lines of code... most of Airtable's engineers are yet to be hired, and most of Airtable's code is yet to be written, by many orders of magnitude'. Airtable seems to be doing very well but the expected lifespan of even very successful SaaS companies is typically much shorter than the lifespan of personal data. That data can be exported from airtable, but the logic and UI can't. Even if the airtable code base was open source it would be far too large to outlive the company - just preventing the build from bitrotting would probably require a full-time maintainer in the long run, given how javascript ecosystems age.

I also get frustrated by arbitrary limitations of the query model. For example if I upload some bank transactions and group by counter-party it will automatically show the total amount spent per counterparty. But if I sort by amount, it sorts the records within each group by amount rather than sorting the groups by total amount. As far as I can tell there is no way to sort the groups. In my existing accounts script there is a list of (regex,tag) pairs. For each transaction, the first matching regex determines the tag. In airtable I'd love for the list of (regex,tag) pairs to itself be a table so it's easy to edit, but it doesn't seem to be possible to do any kind of query across tables except via linked records. The best I've come up with so far is attaching a js script that, when manually triggered, reads from both tables and mutates the tag column. In imp this kind of query is trivial.

The performance would also bug me. The little inbox demo I made takes several seconds to load. The contacts autocomplete popup takes maybe a second, despite the fact that it only has to complete from a list of 3 contacts and they should surely be cached given that I just opened it several times. I have ... 89063 emails in my mailbox. I'm not confident that the UI would take kindly to that. (Loading my mailbox into airtable would also take me into the 'call us' pricing tier.) But with native tools like notmuch-emacs searching in my mailbox typically takes ~50ms.

Finally, I don't like being dependent on an internet connection. Especially for things like todo lists, shopping lists, checking if I have room in my budget for donuts etc which I do from my phone and often in areas with spotty service.

The way I look at it, airtable is this bundle of constraints:

Whereas what I want is:

Which opens up all kinds of fun research questions. What would a CRDT look like for an airtable-like schema editor? How could all those myriad UI interactions be handled without mountains of custom UI code? How did Jamie manage to turn a shopping list into a research project?

Fossil

Fossil is the one of the few local-first apps I know that is actually used in anger. It started out as just a DVCS but over time grew a wiki, issue tracker, forum and various other embedded apps. All of which run offline and can be pushed/pulled between repos and even forked.

So I was curious to find out how it worked under the hood.

The underlying data-structure is content-addressed append-only set of artefacts. Forum threads, wiki pages, issues etc are built by summing up the effects of special event artefacts.

Forums are effectively OR-sets - all you can do to a post once it has been made is delete it, leaving a tombstone in the tree.

Wiki pages do last-write-wins. I expected to at least get a merge conflict, but no.

Issues are bags of key-value pairs, where each pair does last-write-wins.

In one sense, it's disappointing that there is so little handling of conflicts.

But on the other hand, there is very little handling of conflicts and it seems like it's fine in practice. So maybe many problems can fall to being broken down into atomic facts and doing last-write-wins for each fact?

Self-hosting

I lurked in various discussions of self-hosting recently. One point that seemed rarely challenged is that self-hosting is hard.

That has been my experience for many pieces of software. But self-hosting fossil is really easy.

What makes most software hard to self-host?

What makes fossil easy to self-host?

Other things we could add:

Why isn't this more common? I suspect because most software is optimized for industrial use, not personal use. For industrial uses the operations overhead is not a big deal compared to the development and operational efficiency gained by breaking things up into communicating services. But for personal uses the overwhelming priority is reducing complexity so that nothing fails.

Malloy is a yet another 'better sql', but a pretty credible one. It leans hard into nested relational algebra, producing a language that feels a lot like tableau looks. I was initially put off by the fact that it only supported pre-defined joins, but adhoc joins were added recently.

Andy Matuschak published a 2021 retrospective covering the challenges of creative work, life as an independent researcher, the long-term prospects of patronage and the lack of a 'tools for thought' community.