I keep seeing discussions that equate zigs level of memory safety with c, or occasionally with rust. Neither is particularly accurate. This is an attempt at a more detailed breakdown.
I'm concerned mostly with security. In practice, it doesn't seem that any level of testing is sufficient to prevent vulnerabilities due to memory safety in large programs. So I'm not covering tools like AddressSanitizer that are intended for testing and are not recommended for production use. Instead I'll focus on tools which can systematically rule out errors (eg compiler-inserted bounds checks completely prevent out-of-bounds heap read/write).
I'm also focusing on software as it is typically shipped, ignoring eg bounds checking compilers like tcc or quarantining allocators like hardened_malloc which are rarely used because of the performance overhead.
Finally, note the 'Updated' date below the title. Zig in particular is still under rapid development and will likely change faster than this article updates. (See the tracking issue for safety mechanisms).
Here are the issues against which c/zig/rust have systematic protection:
issue | c | zig (release-safe) | rust (release) |
---|---|---|---|
out-of-bounds heap read/write | none | runtime | runtime |
null pointer dereference | none | runtime⁰ | runtime⁰ |
type confusion | none | runtime, partial¹ | runtime² |
integer overflow | none | runtime | runtime³ |
use after free | none | none⁴ | compile time |
double free | none | none⁴ | compile time |
invalid stack read/write | none | none | compile time |
uninitialized memory | none | none | compile time |
data race | none | none | compile time |
- optional types
- tagged unions, doesn't protect against holding a pointer to value while changing tag
- tagged unions
- not by default, but available via compiler setting or by linting against unchecked arithmetic
- optional protections exist but I expect the runtime overhead to be unacceptable in many domains - see discussion here and here
There are two clear groups here:
- Spatial memory safety. Mostly runtime mitigations. Nearly identical in both zig and rust. These are easy to implement and probably sufficiently non-controversial that any new systems language will have similar features.
- Temporal memory safety and data race safety. Mostly compile time mitigations. Unique to rust. These are novel, non-trivial to implement and add a significant amount of complexity to the language.
So we can say that zigs spatial memory safety is roughly comparable to rust, and its temporal memory safety and data race safety are roughly comparable to c.
Zig also has some non-systematic improvements over c with regards to temporal memory safety:
- The standard library includes a set of allocators which don't reuse allocations, preventing use-after-free, and which catch double-free. It's not yet clear how high the runtime and memory overhead will be. Similar allocators do exist for c and are not widely used, which makes me somewhat pessimistic, but I'd be happy to be proved wrong.
- In c unitialized variables are often used when they can't easily be initialized by a single expression. In zig it's possible to use a labeled block that returns the initial value, or to use an optional type and initialize it to null. Creating an unitialized variable also requires using the
undefined
keyword which helps flag such cases for review. - The pervasive allocator api makes it easier to use arena allocation or garbage-collected pools to simplify lifetime management.
- Using
defer
anderrdefer
simplifies resource cleanup inside complicated control flow, reducing the possibility of mistakes. - Support for generics reduces the chances of casting mistakes.
Zig also has a number of tools to help detect violations of temporal memory safety during testing. These are very helpful for development, but experience with c indicates that they won't be sufficient to eliminate vulnerabilities.
I tried looking at some public breakdowns of security issues from various projects written in c and c++ (mostly sourced from Alex Gaynors handy summary) to get a sense of the relative frequencies of different kinds of errors:
- Android: ~75% spatial vs ~15% temporal (just eyeballing the pie-chart)
- Windows: Some of the categories don't map neatly to spatial vs temporal. If we assume that 'stack corruption' is always temporal but 'heap corruption' could go either way then we have 23-36% spatial vs 28-41% temporal for 2018. If we narrow down to exploited issues then it's 0% spatial vs 75% temporal.
- Curl: 45% spatial vs 7% temporal (the pie-chart breakdown is only for the 52% of security issues related to memory safety)
- 0day in the wild: Insufficient detail on most, but if we look just at those explicitly marked as 'use after free' then we have 5/25 in 2020, 5/21 in 2019, 6/13 in 2018 etc so possibly >50% temporal?
This isn't a very clear picture. The percentages vary wildly between projects. The categories are sufficiently vague that I could be classifying them all wrong. Looking only at fixed issues tells us nothing about how easy they are to exploit, but looking at existing exploits limits us to a very small dataset.
It certainly seems like just fixing spatial memory safety (going from c to zig) is a non-trivial improvement. But I'd like to better understand why actual exploits appear here to rely more often on violating temporal memory safety.
When does this matter?
Rust bears additional complexity and friction to buy temporal memory safety and data race safety. But sometimes we might be able to buy those more cheaply eg:
- Systems that can approach temporal memory safety by:
- Never calling free (practical for many embedded programs, some command-line utilities, compilers etc)
- Having very simple ownership and lifetime models (eg many games)
- Making use of some specialized system for managing long-lived state (eg fossil's use of sqlite, edge's use of memgc)
- Making use of fine-grained sandboxing (eg rlbox)
- Systems that can approach data race safety by:
- Being single-threaded
- Using an architecture with strictly controlled sharing (eg glommio, differential dataflow)
Sometimes we might also just choose the bear the cost. For systems with low risk profiles (eg internal software that is never exposed to hostile input) we might decide that debugging the occasional use-after-free is preferable to adding development friction.
There are certainly systems though where none of the above are options. For example, the web spec pretty much mandates that browsers must have complicated ownership models, use pervasive sharing between threads and be constantly exposed to hostile inputs. In such cases it's hard to make an argument for zig, unless alongside some additional system of protection like memgc or rlbox.