So far the editor has to do these things with text:
- iterate forwards/backwards
- insert/remove strings
- lookup the point corresponding to a given line
- find out the line at a given point
The simplest thing we could possibly do is just store the entire text in a contiguous string. How would this perform for a 2mb text on my 2017 xeon laptop?
- A string is the most efficient choice for iterating - we get a simple loop whose memory accesses are easy to prefetch.
- The worst case for inserts is at the beginning of the string because everything else has to be shifted over. On a 2mb string this takes ~1ms.
- To determine what line a point is on, we have to first do line-wrapping. Wrapping an entire 2mb string takes ~2ms so it's feasible to do this on every edit and save the start point of each line in an array.
- Given a point in the string, we can find the line number with a binary search in the array of line start points. This takes several microseconds, too fast to be worth benchmarking properly.
Wrapping per-edit means that other code in the editor that makes implicit edits (eg smart indentation) has to be careful to batch the edits to avoid triggering line-wrapping multiple times. This is pretty easy in a small project, but an editor with a lot of 3rd party plugins might want to code more defensively.
The performance tradeoffs here are interesting. If we used a more complex data-structure for the text inserts would become faster but iteration would become slower which means that eg line-wrapping might become too slow to do from scratch on each edit and we would have to do some more complicated incremental algorithm.
In the large files branch I did just that - I got nerd-sniped into trying to handle files up to 1gb which may not contain any newlines (eg minified js/html, json state dumps). It increased the line count of the project by about 30% for a fairly niche use-case, added a number of tricky bugs and complicated all downstream code.
There is a substantial benefit to insisting on simple data-structures / algorithms, even at the cost of excluding some potential use-cases.