0053: consulting, go tips, benchmark_mode, niri, linkrot, sea of nos, llm outsourcing, books

I missed a couple of months. I've been busy consulting rather than working on zest (other than writing a rough survey of compilation, recompilation, and compile-time evaluation. Reading talk of global recession every week made me want to refill my savings instead of running them down more.

It is nice to be working again for a while. Mostly to have simple, immediate goals. Working on zest is always a little angsty because there are so many things I want to do with it and not enough lifetimes to do them in. But when I'm consulting the horizon is only a few months out and success is clearly defined.

I had a short writing gig, which I've never done before. That may get published sometime in the future, or may just stay internal.

But the main gig at the moment is speeding up a big go codebase. I like optimization gigs because the goal is so clearly defined - it just needs to do the same thing but faster. That means I can work fairly independently and at my own pace. I spend a lot of time looking at different profiles, single-stepping through hot code, trying out quick experiments. I throw away most of the code I write.

My experience of optimization never matches up to what I was taught or what I see online. There's never any obvious bottleneck in the profiles, because if there was they would just fix it instead of hiring me.

The fixes are rarely clever algorithms or micro-optimizations either. Much more often the problem is that huge swathes of code were written in a style that was fine when it was being hit 10s or 100s of times per api call, but is really not fine now that datasets are bigger and the code is being hit 1000000s of times per api call. Suddenly the closure that made the code more readable is responsible for millions of heap allocations, the constant string keys that print nicely when debugging are causing map hashes to show up in the profile, the slice that was defensively copied is drowning you in garbage collection etc.

Often fixing one problem doesn't work out because it puts more weight through another problem. And fixing that problem won't work because a third problem would also have to be fixed at the same time. It feels like untangling a huge knot. You can't just look at the knot and solve it - it's way too big. You just have to patiently tug on different parts of the rope until you find something that can come loose. Hence all the code I'm throwing away. Even if a given change would be an improvement in isolation, it might just be too tangled up to come loose yet.

But when you do finally find a loose end, it's really satisfing.

go tips

I've done very little work in go before so I've learned a lot in the last few weeks. Much of this was hard to find online - or hard to find answers to at least. I find plenty of people asking questions and usually the reply is "you don't want to do/know this, just do the idiomatic thing, performance doesn't really matter, the compiler might change tomorrow anyway".

perf

The goto profiling tool for go is pprof, but I haven't found it very useful. The cpu profiling is too low resolution, it doesn't capture kernel code, and it doesn't understand performance counters. The memory profiling is useful sometimes, but the biasing towards allocation size is actually the opposite of what I usually want - I'm trying to improve throughput rather than reduce the heap size, so I'm much more concerned about millions of small allocations than a few big allocations.

So I've been mostly been using perf. go run omits debug symbols, so you have to go build foo.go && perf record ./foo. --call-graph fp seems to produce the most accurate results, although it does chop the top off really deep stack traces.

I've been using hotspot for viewing the traces. It's much nicer than perf report. I do occasionally see some nonsensical results but I haven't spent the time to investigate whether they are coming from hotspot, perf, or go debug info.

Filtering in/out by symbol is incredibly useful. Eg I'll usually filter out the setup work in the benchmark, and then if I filter in runtime.mallocgc and switch back to the flame graph I can see a breakdown of where time is spent in allocation.

One thing I haven't figured out how to do in go is to get a list of allocations grouped by type. Go is mostly type-erased so there aren't any convenient type tags hanging around on the heap. I think I might be able to get something out of perf by putting an ebpf probe on mallocgc and picking out the type argument. But I don't have any experience with ebpf so it'll have to wait until I can interrogate an ebpf friend.

enums

Go doesn't have a builtin in concept of enums. The language docs encourage doing something like this:

type Flavour uint8

const (
    FlavourSalty Flavour = iota
    FlavourSweet
    FlavourSour
)

But the names are just variables, not part of the data, so if you print these you just get numbers:

func main() {
    fmt.Printf("%s\n", FlavourSweet)
    // prints: %!s(main.Flavour=1)

    fmt.Printf("%v\n", FlavourSweet)
    // prints: 1

    fmt.Printf("%#v\n", FlavourSweet)
    // prints: 0x1
}

So we need to do a bit more work:

var FlavourNames = []string{
    "Salty",
    "Sweet",
    "Sour",
}

func (f Flavour) String() string {
    return FlavourNames[uint8(f)]
}

func main() {
    fmt.Printf("%s\n", FlavourSweet)
    // prints: Sweet

    fmt.Printf("%v\n", FlavourSweet)
    // prints: Sweet

    fmt.Printf("%#v\n", FlavourSweet)
    // prints: 0x1
}

But the default methods for printing composite types ignore String, so this still kinda sucks:

type PersonInfo struct {
    likes Flavour
    hates Flavour
}

func main() {
    fmt.Printf("%s\n", PersonInfo{likes: FlavourSweet, hates: FlavourSour})
    // prints: {%!s(main.Flavour=1) %!s(main.Flavour=2)}

    fmt.Printf("%v\n", PersonInfo{likes: FlavourSweet, hates: FlavourSour})
    // prints: {1 2}

    fmt.Printf("%#v\n", PersonInfo{likes: FlavourSweet, hates: FlavourSour})
    // prints: main.PersonInfo{likes:0x1, hates:0x2}
}

The more ergonomic solution is:

type Flavour string

const (
    FlavourSalty Flavour = "Salty"
    FlavourSweet = "Sweet"
    FlavourSour = "Sour"
)

func main() {
    fmt.Printf("%s\n", PersonInfo{likes: FlavourSweet, hates: FlavourSour})
    // prints: {Sweet Sour}

    fmt.Printf("%v\n", PersonInfo{likes: FlavourSweet, hates: FlavourSour})
    // prints: {Sweet Sour}

    fmt.Printf("%#v\n", PersonInfo{likes: FlavourSweet, hates: FlavourSour})
    // prints: main.PersonInfo{likes:"Sweet", hates:"Sour"}
}

But now Flavour is more expensive to compare and hash, and can't easily be used as an array index - if we were tallying who likes which flavours we could have used [3]int64 before and now we'll probably end up with map[Flavour]int64 instead.

Worse is that if you're deserializing data and you have a string, the temptation is to just write Flavour(s) to convert it. But that doesn't check that it's one of the enum options. Libraries that use codegen or reflection for deserialization will do the same thing, because there is no way for them to know that Flavour is supposed to be an enum.

This bit me last week when I was converting one such enum from strings to uint8 and discovered uncomfortably late in the process that "" was an oft-used value.

It's frustrating how avoidable this was. There are valid reasons why go can't have general sum types, but enums would have been easy and would have been more ergonomic, safer, and faster than either of the above options.

field order matters

The compiler doesn't reorder struct fields.

type Eg1 struct {
    a uint8
    b int64
    c uint8
}

type Eg2 struct {
    b int64
    a uint8
    c uint8
}

func main() {
    var eg1 Eg1
    var eg2 Eg2

    fmt.Println(unsafe.Sizeof(eg1))
    // prints 24

    fmt.Println(unsafe.Sizeof(eg2))
    // prints 16
}

This only occasionally matters because most types are 8-byte aligned anyway, but it does come up.

sum types

Suppose we have an interpreter where values can either be strings, decimals, or times. The idiomatic way to deal with this in go is to make an interface:

type Value interface {
    isValue()
}

type NumberValue struct {
    Val udecimal.Decimal
}

type StringValue struct {
    Val string
}

type TimeValue struct {
    Val time.Time
}

func (v NumberValue) isValue() {}
func (v StringValue) isValue() {}
func (v TimeValue) isValue()   {}

Unfortunately, this means that converting eg a NumberValue to Value requires a heap-allocation, and reading a NumberValue from a Value requires following a pointer. This can get expensive.

The advice I found on the go mailing list is to pack them all into one struct:

type ValueTag int64

const (
    ValueTagNumber ValueTag = iota
    ValueTagString
    ValueTagTime
)

type Value struct {
    tag    ValueTag
    number udecimal.Decimal
    string string
    time   time.Time
}

Unfortunately if you have more kinds of values, or more data in each kind of value, then this struct gets really big. Also it's hard to pass it around by reference without heap-allocating it, and then we're back to the original problem. But passing it around by value requires a lot of copying.

There is a good reason that go can't have inline sum types. It would complicate garbage collection and make it easy to violate memory safety by holding on to an interior pointer while changing the sum type.

But if we're willing to be unsafe (within a small package exporting a safe interface) we can work around the garbage collector. All the gc requires is that:

Any field that is set in the gc bitset for the type always contains either a valid pointer or nil.
Any field that is not set in the gc bitset for the type never contains a pointer.

So I believe that it's safe to do this:

type ValueTag uint8

const (
    ValueTagNumber ValueTag = iota
    ValueTagString
    ValueTagTime
)

type Value struct {
    p0    unsafe.Pointer
    data0 uint64
    data1 uint64
    data2 uint8
    data3 uint8
    Tag   ValueTag
}

func FromString(s string) Value {
    return Value{
        p0:    unsafe.Pointer(unsafe.StringData(s)),
        data0: uint64(len(s)),
        Tag:   ValueTagString,
    }
}

func (v Value) ToString() (string, bool) {
    if v.Tag != ValueTagString {
        return "", false
    }
    s := unsafe.String(
        (*byte)(v.p0),
        int(v.data0),
    )
    return s, true
}

Unfortunately for the actual case I tried this on, it didn't have enough of a benefit to be worth the api change. But I hope to be able to come back to it once the surrounding overhead has been cleared out.

defer

Defer can be surprisingly expensive. The rules that I've read are:

If all defers are statically guaranteed to be hit at most once, and if there are at most 8 defers, then they can be open-coded. Otherwise all defers will be added to a linked list which gets traversed at function exit (because the language guarantees that they will be executed in reverse order).
If a defer is statically guaranteed to be hit at most once (eg inside an if) then its linked list node will be stack-allocated.
If a defer may be hit more than once (eg inside a loop) then its linked list node will be heap-allocated. To reduce allocation pressure these will be allocated from a thread-local pool, with a fallback to global pool behind a mutex(!).

However, even in the open-coded case there is still a lot of overhead eg calls to runtime.deferreturn for each defer (I think this is because defers are still executed when unwinding from panics so they can't just be open-coded into the control-flow graph like in zig). And many cases will allocate a closure for the defer too.

//go:noinline
func foo() int64 {
    var i int64
    i += 1
    return i
}

//go:noinline
func bar() int64 {
    var i int64
    defer (func() {
        i += 1
    })()
    return i
}

0000000000491ac0 <main.foo>:
  491ac0:	b8 01 00 00 00       	mov    $0x1,%eax
  491ac5:	c3                   	ret

0000000000491ae0 <main.bar>:
  491ae0:	49 3b 66 10          	cmp    0x10(%r14),%rsp
  491ae4:	76 7f                	jbe    491b65 <main.bar+0x85>
  491ae6:	55                   	push   %rbp
  491ae7:	48 89 e5             	mov    %rsp,%rbp
  491aea:	48 83 ec 30          	sub    $0x30,%rsp
  491aee:	66 44 0f d6 7c 24 28 	movq   %xmm15,0x28(%rsp)
  491af5:	c6 44 24 07 00       	movb   $0x0,0x7(%rsp)
  491afa:	48 c7 44 24 08 00 00 	movq   $0x0,0x8(%rsp)
  491b01:	00 00 
  491b03:	48 c7 44 24 10 00 00 	movq   $0x0,0x10(%rsp)
  491b0a:	00 00 
  491b0c:	48 8d 05 6d 00 00 00 	lea    0x6d(%rip),%rax        # 491b80 <main.bar.func1>
  491b13:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
  491b18:	48 8d 44 24 10       	lea    0x10(%rsp),%rax
  491b1d:	48 89 44 24 20       	mov    %rax,0x20(%rsp)
  491b22:	48 8d 44 24 18       	lea    0x18(%rsp),%rax
  491b27:	48 89 44 24 28       	mov    %rax,0x28(%rsp)
  491b2c:	c6 44 24 07 01       	movb   $0x1,0x7(%rsp)
  491b31:	48 8b 44 24 10       	mov    0x10(%rsp),%rax
  491b36:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  491b3b:	c6 44 24 07 00       	movb   $0x0,0x7(%rsp)
  491b40:	48 8b 54 24 28       	mov    0x28(%rsp),%rdx
  491b45:	48 8b 02             	mov    (%rdx),%rax
  491b48:	ff d0                	call   *%rax
  491b4a:	48 8b 44 24 08       	mov    0x8(%rsp),%rax
  491b4f:	48 83 c4 30          	add    $0x30,%rsp
  491b53:	5d                   	pop    %rbp
  491b54:	c3                   	ret
  491b55:	e8 66 36 fa ff       	call   4351c0 <runtime.deferreturn>
  491b5a:	48 8b 44 24 08       	mov    0x8(%rsp),%rax
  491b5f:	48 83 c4 30          	add    $0x30,%rsp
  491b63:	5d                   	pop    %rbp
  491b64:	c3                   	ret
  491b65:	e8 16 a3 fd ff       	call   46be80 <runtime.morestack_noctxt.abi0>
  491b6a:	e9 71 ff ff ff       	jmp    491ae0 <main.bar>

(In this case I don't think that deferreturn is even reachable, but in real code I think it typically is.)

This can all get pretty expensive in short hot functions. In the case where defer is being used because there are a lot of exit points rather than for panic safety, one option is to manually open-code the defer.

Before:

func doStuff() {
  start := setup()
  defer teardown(state)
  if foo(state) {
    return
  } else if bar(state) {
    return
  } else {
    return
  }
}

After:

func doStuff() {
  start := setup()
  result := doStuffInner(state)
  teardown(state)
  return result
}

func doStuffInner(state State) {
  if foo(state) {
    return
  } else if bar(state) {
    return
  } else {
    return
  }
}

Sadly we're then at the mercy of inlining heuristics for doStuffInner(), but even un-inlined we're replacing a bunch of overhead and indirect function callsw with a single direct function call.

slice scanning

For a function that operates on a stack or appends to some growable array, it's often worth pre-allocating an array with enough capacity that it will never have to grow.

I tried this for some hot code and it did indeed get faster, but then spent more time in garbage collection.

I think that the reason for this is that go doesn't distinguish between constant slices and growable arrays. Instead both are represent by the same (ptr, len, cap) struct and the underlying allocation only stores the data and doesn't have a len field (unlike eg julia). This means that the garbage collector always has to scan the whole allocation, even if the length of every slice referring to it is zero.

benchmark_mode

On the one hand you shouldn't run benchmarks on a laptop. On the other hand I only have a laptop.

I have a script that reduces the noise a fair amount. For a while it was broken when everything moved to cgroups_v2, but I finally got around to fixing it. It's split into two to make the sudoing easier:

benchmark:

#!/usr/bin/env bash

set -e

if [ "$(cat /sys/class/power_supply/ACAD/online)" != 1 ]
then
  echo "Don't benchmark on battery power"
  exit 1
fi

sudo benchmark_mode

# Run on reserved cpus
# High priority
# Record perf counters
sudo \
cgexec -g cpuset:benchmark \
nice -n -20 \
perf stat \
sudo -u jamie \
"$@"

benchmark_mode:

#!/usr/bin/env bash

set -e

# Setup benchmark cpus
if [ ! -d /sys/fs/cgroup/benchmark ]
then
    mkdir /sys/fs/cgroup/benchmark
fi
echo 'root' > /sys/fs/cgroup/benchmark/cpuset.cpus.partition
for f in /sys/fs/cgroup/*/cpuset.cpus; do echo '2-15' > $f; done
echo '0-1' > /sys/fs/cgroup/benchmark/cpuset.cpus

# Set scaling_governer to performance on benchmark cores
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

# Turn off turbo mode on all cpus
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# Disable ASLR
echo 0 > /proc/sys/kernel/randomize_va_space⏎

I still have to be careful with everything else. Make sure nothing is running in the background. Watch the cpu temperature to make sure I'm not getting thermal throttling. But it's good enough for most work.

niri

I switched from sway to niri. It's worth it for the overview feature alone (I'm always losing my video call window during meetings), but the sideways-scrolling model also works a bit better for me than tiling a single screen and the builtin screenshotting/casting tools work better than what I cobbled together for sway.

The only thing I don't like is that it automatically deletes any empty workspace and moves the others up. This makes it hard to use absolute shortcuts (mod + 1-9) for workspaces because they're always changing position. In sway I just had muscle memory that eg my work notes are always on workspace 9.

linkrot

I run a linkchecker on this site and found 313 broken links. A big chunk of those were from citeseer changing their permalink scheme (you had one job) and from academic department sites getting reorganized.

One of the arguments I've heard for the academic publishing system is that it's better at preserving citations, but that's only true if you cite by doi and indirect through google scholar. In terms of actual urls, I've found that linking to the authors blog is much more reliable than linking directly to a paper.

sea of nos

There is often a missing stair kind of effect in niche technical topics where the mainstream consensus on techniques doesn't reflect what experts are actually doing.

For example, if you study cs in school you'll learn all about various kinds of grammars and clever ways to parse them. If you read books about compilers they'll spend a bunch of time on table-driven parser generators. If you read PL blogs everyone is really into parser combinators. And then you work on a real project and try to do what seems like the standard thing and... it's awful. And you look around at real industrial programming language implementations and find that almost all of them write their own parsers by hand because that is by far the easiest and most effective option. Everyone who actually works on real compilers knows this, but somehow the knowledge doesn't filter out.

So anyway V8 used sea of nodes for a long time but recently switched back to a traditional CFG. This prompted a long back-and-forth between the V8 folks and Cliff Click, the inventor of sea of nodes, about why exactly it wasn't a good fit for V8. The normal course of things would have been for V8 to make this change quietly and for the discussion to have happened in private, so that you'd only learn about it if you happended to already be part of the tiny network of industrial compiler engineers. Kudos to everyone involved for doing it in public.

llm outsourcing

I've seen a lot of people confident that companies won't replace programmers with llms because llms produce poor quality code.

Early in my career I consulted for a company who outsourced most of their software development. I once spent several weeks of back-and-forth trying to get their outsourcerse to correctly url-escape their api calls to the service I was responsible for. The resulting code included this gem:

//*********************************************
// this function trims the trailing spaces on both the sides
//*********************************************
function trim(TRIM_VALUE) {
	if(TRIM_VALUE.length < 1) {
		return"";
	}
	TRIM_VALUE = RTrim(TRIM_VALUE);
	TRIM_VALUE = LTrim(TRIM_VALUE);
	if(TRIM_VALUE == "") {
		return "";
	} else {
		return TRIM_VALUE;
	}
}//End Function

//*********************************************	
/*Function for Trimming spaces from right side of a String*/
//*********************************************
function RTrim(VALUE) {
	var w_space = String.fromCharCode(32);
	var v_length = VALUE.length;
	var strTemp = "";
	if(v_length < 0) {
		return"";
	}
	var iTemp = v_length -1;
	while(iTemp > -1) {
	    if(VALUE.charAt(iTemp) == w_space) {
	    } else {
	    	strTemp = VALUE.substring(0,iTemp +1);
	     	break;
	    }
	    iTemp = iTemp-1;
	} //End While
	return strTemp;
} //End Function

//*********************************************		
/*Function for Trimming spaces from left side of a String*/
//*********************************************
function LTrim(VALUE) {
	var w_space = String.fromCharCode(32);
	if(v_length < 1) {
		return"";
	}
	var v_length = VALUE.length;
	var strTemp = "";
	var iTemp = 0;
	while(iTemp < v_length) {
		if(VALUE.charAt(iTemp) == w_space) {
		} else {
			strTemp = VALUE.substring(iTemp,v_length);
			break;
		}
		iTemp = iTemp + 1;
	} //End While
	return strTemp;
} //End Function

This is not atypical for code that I've seen from outsourcing, or even from big domestic consulting companies. If companies are willing to pay for this, they will be willing to pay for something that is cheaper, higher-quality, and has faster turnaround times.

(The company in question did eventually create their own internal software team, and one of their first hires was the one person at the outsourcing company who, at one end of a long game of telephone through multiple layers of managers, was actually doing all the work.)

In the short term this might actually mean more work for me. My hunch is that llms are going to be much better at digging their way into performance holes than they are at climbing out of them, because the latter requires a lot of non-local reasoning and complex tool use.

I can't actually test at the moment (feeding my clients proprietary code and production data into a third-party system under a personal account would be questionable at best). But after this contract I'd be interested in collaborating with someone who is proficient with llm tools to see if they're at all useful for optimizing some large open-source codebase.

books

Human errors. The first few chapters were good and then the rest felt like filler. But lots of good party factoids. Did you know that humans are way more prone to head-colds than other animals - in part because when we became bipedal our sinuses did not reorient, so now they don't drain properly.

The story of the human body. Really repetitive and full of filler, but was worth reading to catch up on the last decade or two of discoveries. Eg bipedalism came long before serious tool use, so may have been driven more by caloric efficiency moving between receding patches of rainforest.

Evicted. Depressing as hell, but an incredible work. The author spent years living in the poorest neighbourhoods in Milwaukee, making friends with residents and shadowing landlords while they worked, and then used all of that knowledge to know what questions to ask and how to understand the answers when doing wide-scale quantitative research. I also really appreciated the way the book itself was written - any text in quote marks was transcribed from a recording, anything the author didn't directly witness is labeled in footnotes, any time they describe someone as thinking or feeling something it was reported by that person rather than inferred. They also hired a fact-checker to randomly sample claims and match them to sources in their notes/recordings. I wish everything was reported this way.

Character limit. I enjoyed all the crazy stories, but I can't help contrasting it to Evicted. So many times they describe Elons thoughts or feelings and it's really not clear where that is coming from. They also don't clearly distinguish between directly recorded events and distant recollections of witnesses. Evicted proved that you can be really rigorous about reporting without it hurting the writing at all. And I don't think the embellishment is necessary at all - Elons actions speak for themselves.