Unexplanations: relational algebra is math

Published 2024-03-11

I see this claim appear in various forms: relational algebra is math, is based on math, comes from math, has strong mathematical foundations.

It's a statement that I struggle to assign any precise meaning to, but typically the intent seems to be something like:

I encourage mentally replacing the phrase 'based on math' with 'based on a true story'. This usually provides the correct intuition.

In mathematics, relations are sets of tuples of values. They are usually binary (only two columns) and often contain infinitely many tuples.

They're typically manipulated using set-builder notation, or treated as binary operators or functions. Some examples:

\[ <_{AC} \;= \{\;(a,c) \mid (a,b) \in \;<_{AB}, (b,c) \in \;<_{BC} \}\\ \;\\ a <_{AC} c \;\Leftrightarrow\; a <_{AB} b \;\land\; b <_{BC} c\\ \;\\ <_{AC} \;=\; <_{AB} \circ <_{BC} \]

There isn't much to be said about the properties of relations in general. They're mostly used as a building block for talking about equality, ordering, functions etc. The 'theory of relations' in mathematics exists to about the same extent that the 'theory of structs' exists in C - they're just a common way to structure data.

Relational algebras are a niche branch of formal logic, popularized by Tarski, that use an algebra over binary relations as a point-free alternative to (a subset of?) first-order logic. They aren't widely used. For example, I'd be surprised to see it mentioned in an undergrad-level course.

In databases, relational algebra is a somewhat fuzzy term.

In the relational algebra originally defined by Codd, relations are finite sets of records with named fields. Relations are manipulated using a small set of functions. Some of those functions (eg union, intersect, product) are based on similiar functions in set theory, with some tweaks to prevent duplicate field names and avoid nesting tuples. Others (eg rename, select) are unique to databases (at least in terms of notation and vocabulary - of course the operations can be defined mathematically).

\[ AC = \Pi_{a,c} ( \sigma_{b=b2} \; (AB \times_{} \rho_{b_2/b}(BC)))\\ \]

Codd's relational algebra bears some superficial resemblance to Tarski's relational algebras. Both are ways of expressing first-order logic using functions over sets. But the point of Tarksi's relational algebra is to avoid free boolean variables, whereas Codd's relational algebra uses arbitrary scalar formulae in the select function. And Codd's relational algebra is concerned with computability rather than axiomatization eg providing a difference function instead of a negation function to prevent the creation of infinite relations.

All of which is to say that Codd's relational algebra is not something that is actually used by mathematicians, nor does it benefit from any deep connection with existing bodies of mathematical theory. It's a novel set of operations created to satisfy the needs of 1970s database engineers.

It's main feature is that it's able to express a variety of queries while still being easy to implement efficiently. The relational calculus is often much nicer for the programmer, but it's not immediately obvious how to execute or transform it. The relational algebra, by contrast, has a fixed set of operations over large collections, which lends itself to rewrite-based optimization and to amortizing the overhead of an interpreter. Codd's theorem shows that we can compile the relational calculus to the relational algebra to get the best of both worlds.

So think of Codd's relational algebra not as some deep mathematical truth, but as a particularly nice compiler IR for queries over relations.

Anyway, Codd's relational algebra isn't what modern relational databases are actually based on. We have to add:

Most databases also have:

None of this could be expressed in the original relational algebra, and once you add it all it's not obvious we should even still be calling this an algebra, let alone granting it any mathematical mystique. I'd settle for calling it the 'sql calculus'. Or 'the algebra formerly known as relational'.

It is still a reasonably good compiler IR though, and that's still the most useful way of thinking about it.

Maybe we meant that the relational model is math, rather than the relational algebra?

But the relational model isn't particularly mathematical either. Multisets over records with named fields are not a common sight in math papers. Foreign keys are a compromise for compiler legibility - a mathematician would just let column types range over arbitrary sets instead. The various normal forms are useful design heuristics but they aren't very rigorously defined (eg is a string a composite value?).

As an aside: I often see claims that relational databases owe their performance to the power of the relational model and how easy it is to optimize. This is funny (in a traumatized kind of way) - writing a usable query optimizer is an immense amount of work. When the relational model was first proposed, the main argument against it was that it would never produce acceptable performance compared to eg hierarchical databases. It took at least a decade to prove that wrong, and even today engineers still resort to denormalization when struggling with performance. The relational model is very much succesful in spite of it's performance challenges.

So if we can't say that relational databases are 'based on maths', can we at least say that they have a mathematical description? Sure, but the trouble is that pretty much anything can be described in mathematical terms.

For example, there is a whole family of xml query algebras, many of which do a much cleaner job of traversing the scalar/collection divide than the hodgepodge of subqueries and UNNESTs used in relational databases today. So if relational databases are math then xml databases are even more math.

Unlike xml query algebras, the algebra-formerly-known-as-relational isn't defined anywhere at all. Every relational database uses it's own internal planning language with their own menagerie of operations and subtly different semantics. Since those planning languages are (by design) not user-facing they're also free to change from version to version.

Even the sql spec doesn't use relational algebra. Or any formal model at all. I'm hard pressed to even call it a spec - my copy of chapter 2 of the sql 2016 spec has 994 occurences of the phrase "implementation-defined" and 361 occurences of "implementation-dependent" and those are just the places where they noticed that the spec wasn't complete.

By way of example, here is the tail end of the spec for GROUP BY:

What exactly is a grouped table?

Not a particularly rigorous definition, but it's also demonstrating that to give a composable semantics to sql we actually end up needing data-structures other than first-order relations - relational algebra is not a good fit.

The closest I've seen to a mathematically rigorous model for sql is this 2017 paper which only covers a tiny subset of the language, excluding eg GROUP BY. (Incidentally, I'm 70% certain that they simply forgot to handle correlated variables in their proof of prop 2, making the scope of the paper even smaller).

To summarize:

None of this is itself a criticism of relational databases! The relational model has many useful properties for engineers:

There is no need to make vague appeals to a totally different field in order to justify liking it!

Find more unexplanations here.