What is a database?

Published 2022-04-27

Someone asked me this the other day and I didn't have a glib answer.

There's no value to trying to litigate whether some particular system is or isn't a database - categories aren't real. For any definition of database I could come up with there is some compelling counterexample.

But there is still a clear difference between the ways we tend to work with data in a database and the ways we tend to work with data in a programming language. It's worth thinking about what root causes could lead to such different design choices.

Databases are usually used for storing valuable, long-lived data. This data needs to survive crashes and power outages, which typically means either storing it on disk or replicating it across multiple machines. In either case, we can't use pointers to represent relationships because the value of a pointer depends on the virtual memory mapping in the current process. Some databases use some swizzling mechanism to remap pointers when reading, but it's much more common to use a data model that represents relationships in a representation-independent way (eg foreign keys, graph edges) rather than the pointer-centric models that are common in programming languages.

Resident data is often much larger in databases than in programs (partly as a consequence of being longer-lived). When data is small, it's reasonable to handle hashtable collisions by rehashing into a larger hashtable. It's not reasonable for a single write to your database to require reindexing the entire dataset.

When data is much larger than available cache then there is also a steep penalty for random access. So even when databases use hashtables in memory while executing a query, they'll sometimes partition the hashtable into cache-sized chunks and process each one individually (eg).

The cost model on disk is very different to in memory. The size of a cache line on most machines is 64 bytes, but most spinning disks don't support writes smaller than 512 bytes and the OS filesystem cache typically works in 4kb pages. If you store data on disk in a hashtable and write several 64 byte keys these will, by design, be written to random locations which means you'll pay 4kb of bandwidth for each write. So databases that expect heavy write load usually use data-structures that allow coalescing multiple logical writes into a single 4kb block.

Saturating the bandwidth on a modern NVMe disk requires having multiple reads or writes in flight at any one time. It's hard to do this in hand-coded queries - much easier to have some compiler that can figure out the data dependencies in your query and schedule independent subtasks concurrently. But it's difficult for a compiler to do that in an imperative programming language with mutable variables and loops because it has to prove that it's safe to eg run a loop in a different order. Much easier with a declarative disorderly language like sql or graphql.

Being long-lived means that we maybe don't know in advance all the kinds of queries that we're going to run. It's rare to hot-reload new code into a program in production but it's very common to start running a new query in a production database. So many databases have mechanisms for reorganizing storage on the fly and for re-planning old queries to work with the new storage model. This means that the queries themselves can't reference the storage directly (eg looping over an index) - they need to refer to some more abstract data model. Hence the separation in sql between tables and indexes.

We tend to need to access valuable long-lived data from multiple other programs. If we just had a bunch of programs poking around in some shared memory space we'd get crashes whenever one program temporarily broke some data-structure invariant while another was reading from it. So most databases have some notion of concurrency control to allow safe simultaneous access. This can range all the way from ACID transactions down to single key read/writes or filesystem locks, but it is almost always mediated in some way rather than allowing direct access to the underlying representation. Programming languages tend to provide direct access and leave coordination up to the programmer. But the data in databases is too valuable to trust every accessing program to get it right, so databases prefer techniques that maintain integrity by default. It's pretty easy to write bad code that will leave a programs memory in an invalid state and lead to a crash, but it's usually impossible to write a bad database query that causes other queries or the database itself to crash.

Thinking about design in terms of root causes is useful because you can figure out which parts of a solution are inherent to the problem and which are accidental. This helps understand the tradeoffs between existing systems and also provides a guide for designing new systems for new problem spaces.

Eg suppose you are designing a database that runs in the browser and isn't persisted to disk. Then you don't need concurrency control so you don't need to worry about multi-version data-structures. You aren't likely to have large datasets so you can probably get away with using hashtables as indexes. The set of queries isn't going to change within a single browser session so you don't need to handle live changes to the storage model. But on the other hand you have to deal with the lack of shared memory, which means either limiting yourself to small queries to avoid blocking the main thread or making all access to the database asynchronous.

We can also speculate about how new hardware might affect the design of future data systems. The data-structures and models used in programming languages were designed around the characteristics of RAM, and the data-structures and models in databases around the characteristics of spinning disks (with some recent tweaks for SSDs). So if NVRAM becomes more widely available then we might find that we need to something totally different to adapt (eg twizzler).