Here are two pages from the internet movie database:
There are two things to note about these pages.
-
The data on the page is presented in a hierarchichal structure. The movie page contains a director, a list of genres, a list of actors, and each actor in the list contains a list of characters they played in the movie. You can't sensibly fit all of this into a single flat structure like a relation.
-
The order of the hierarchy isn't the same on both pages. On one page we have movie->actors and on the other page we have actor->movies. So you can't just directly store the hierarchy in your database - you need to be able to traverse relationships in both directions.
So we store all our data in a relational database in flat tables and then whenever we need to render some UI we transform the flat data into whatever hierarchy we need.
Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations. The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.
Whatever database and programming language you use, you will have to deal with this. But it's particularly painful in sql because sql wasn't designed to produce hierarchical data.
sql wasn't built to yield structure
Let's grab the imdb public dataset and try to reproduce the source data for that movie page (or at least a subset of it, because I didn't bother importing all the tables). We want to see an output that looks like this:
{
"title": "Baby Driver",
"director": ["Edgar Wright"],
"writer": ["Edgar Wright"]
"genres": ["Action", "Crime", "Drama"],
"actors": [
{"name": "Ansel Elgort", "characters": ["Baby"]},
{"name": "Jon Bernthal", "characters": ["Griff"]},
{"name": "Jon Hamm", "characters": ["Buddy"]},
{"name": "Eiza González", "characters": ["Darling"]},
{"name": "Micah Howard", "characters": ["Barista"]},
{"name": "Lily James", "characters": ["Debora"]},
{"name": "Morgan Brown", "characters": ["Street Preacher"]},
{"name": "Kevin Spacey", "characters": ["Doc"]},
{"name": "Morse Diggs", "characters": ["Morse Diggs"]},
{"name": "CJ Jones", "characters": ["Joseph"]}
],
}
Let's grab the title first:
postgres=# select primaryTitle from title where tconst = 'tt3890160';
primarytitle
--------------
Baby Driver
And now we need the director:
postgres=# select primaryTitle, person.primaryName
from title, principal, person
where title.tconst = 'tt3890160'
and title.tconst = principal.tconst
and principal.nconst = person.nconst
and principal.category = 'director';
primarytitle | primaryname
--------------+--------------
Baby Driver | Edgar Wright
And the writer:
postgres=# select
primaryTitle,
director.primaryName as director,
writer.primaryName as writer
from title,
principal as principal_director, person as director,
principal as principal_writer, person as writer
where title.tconst = 'tt3890160'
and title.tconst = principal_director.tconst
and principal_director.nconst = director.nconst
and principal_director.category = 'director'
and title.tconst = principal_writer.tconst
and principal_writer.nconst = writer.nconst
and principal_writer.category = 'writer';
primarytitle | director | writer
--------------+--------------+--------------
Baby Driver | Edgar Wright | Edgar Wright
We're already in trouble. If this movie had 2 directors and 2 writers, this query would return 4 rows:
primarytitle | director | writer
--------------+--------------+--------------
Baby Driver | Edgar Wright | Edgar Wright
Baby Driver | Edgar Wright | A. Writer
Baby Driver | A. Director | Edgar Wright
Baby Driver | A. Director | A. Writer
If there was no director in the database then this query would return 0 rows, no matter how many writers there were. Now we don't even know what the movie is called.
primarytitle | director | writer
--------------+--------------+--------------
We can't sensibly fit the data we want into a single relation, and we can't return more than one relation per query. So we have to issue multiple queries:
postgres=# select primaryTitle from title where tconst = 'tt3890160';
primarytitle
--------------
Baby Driver
postgres=# select person.primaryName
from title, principal, person
where title.tconst = 'tt3890160'
and title.tconst = principal.tconst
and principal.nconst = person.nconst
and principal.category = 'director';
primaryname
--------------
Edgar Wright
postgres=# select person.primaryName
from title, principal, person
where title.tconst = 'tt3890160'
and title.tconst = principal.tconst
and principal.nconst = person.nconst
and principal.category = 'writer';
primaryname
--------------
Edgar Wright
postgres=# select person.nconst, person.primaryName
from title, principal, person
where title.tconst = 'tt3890160'
and title.tconst = principal.tconst
and principal.nconst = person.nconst
and principal.category = 'actor'
limit 10;
nconst | primaryname
-----------+---------------
nm5052065 | Ansel Elgort
nm1256532 | Jon Bernthal
nm0358316 | Jon Hamm
nm2555462 | Eiza González
nm8328714 | Micah Howard
nm4141252 | Lily James
nm3231814 | Morgan Brown
nm0000228 | Kevin Spacey
nm1065096 | Morse Diggs
nm1471085 | CJ Jones
postgres=# select principal_character.nconst, principal_character.character
from title, principal, principal_character
where title.tconst = 'tt3890160'
and title.tconst = principal.tconst
and principal.nconst = person.nconst
and principal.category = 'actor'
and principal_character.tconst = principal.tconst
and principal_character.nconst = principal.nconst;
nconst | character
-----------+---------------------
nm5052065 | Baby
nm8328714 | Barista
nm0358316 | Buddy
nm2555462 | Darling
nm4141252 | Debora
nm0000228 | Doc
nm1256532 | Griff
nm1471085 | Joseph
nm1065096 | Morse Diggs
nm3231814 | Street Preacher
Through the magic of joins we have retrieved all the data we need and it only required holding a transaction open for 4 network roundtrips.
All that's left to do now is... the same joins, but inside the backend web server. Because we have to re-assemble these flat outputs into the structure of the page.
Also note that fully half of the data returned is the nconst
column which we didn't even want in the output. We only returned it because we need it as a key so we can repeat the joins that we already did in the database. The more paths you traverse, the more useless join keys you need to send to the backend web server.
All of this is pretty tedious so we invented ORMs to automate it. But:
- Almost all ORMs end up sending multiple queries for the output that we want. If you have a good ORM and you use it carefully it'll send one query per path in the output, like the raw sql above. If you're less careful you might get one query per actor in the film.
- Many ORMs also make a mess of consistency by lazily loading data in separate transactions. So we might generate a page where different parts of the data come from different points in time, which is confusing for users.
- Using an ORM locks you into only using one specific programming language. What if you need to query your data from a different language? You'll probably end up talking to the same ORM through a microservice.
old dogs can sort of learn new tricks
These days sql actually can produce structured data from queries.
A lot of people are mad about this. Whenever I talk about it they reflexively yell things like "structured data doesn't belong in the database" as if there was a universal system of morality that uniquely determined the locations of various data processing tasks.
But I can't help but note again that the structure has to happen somewhere because that's what the output page looks like and that doing it outside the database isn't working very well.
Whenever we're building a UI for humans, whether on the web or native, the main use of the query language is to turn relational data into structured data for the client to render. So it would be really nice if the query language was actually able to produce structured data.
Like this:
select jsonb_agg(result) from (
select
primaryTitle as title,
genres,
(
select jsonb_agg(actor) from (
select
(select primaryName from person where person.nconst = principal.nconst) as name,
(
select jsonb_agg(character)
from principal_character
where principal_character.tconst = principal.tconst
and principal_character.nconst = principal.nconst
) as characters
from principal
where principal.tconst = title.tconst
and category = 'actor'
order by ordering
limit 10
) as actor
) as actors,
(
select jsonb_agg(primaryName)
from principal, person
where principal.tconst = title.tconst
and person.nconst = principal.nconst
and category = 'director'
) as director,
(
select jsonb_agg(primaryName)
from principal, person
where principal.tconst = title.tconst
and person.nconst = principal.nconst
and category = 'writer'
) as writer
from title
where tconst = $1
) as result;
It's not perfect. You can definitely see the duct tape, and the query plan often suffers from the lack of decorrelation. But we can grab all the data needed for the entire page in a single query, with one network roundtrip. Whether you use these features directly or as the output of your ORM, this is a sizable improvement for one of the main usecases of relational databases!
It doesn't matter that this isn't the way things have always worked. Sql is not relational algebra and relational algebra is not math, and neither was carved into stone tablets handed down from Codd.
We make tools to serve our purposes, and our purposes have changed a hell of a lot since the 70s, when the main client of a database was a human typing sql character by character into an interactive transaction on a teletype connected to a mainframe with 500kb of RAM, almost 20 years before the invention of the world wide web.
Maybe it's ok for our tools to evolve to meet new demands. And we can evolve with them.