The Relational Model & SQL: Organizing the World's Data

#TL;DR

In 1970, an IBM mathematician named Edgar Codd published a paper that redefined how humans store and retrieve data. Before Codd, databases were mazes — you had to know the physical layout of data to find anything. Codd said: separate the what from the how. Organize data as mathematical relations (tables), query it with logic, and let the machine figure out the physical details. IBM understood it was brilliant and spent years not shipping it. Oracle read IBM’s own research papers and beat them to market. The language Codd’s colleagues invented to make his math accessible — SQL — became the most widely used data language in history and remains so fifty years later.

#The Maze Problem

Before 1970, databases were built around how data was stored, not what data meant.

The dominant model was hierarchical — IBM’s IMS (Information Management System), used to track parts for the Apollo program, organized records in trees. To find a supplier’s parts, you navigated from the root, through departments, through warehouses, down to the record you wanted. The path was hardcoded into your program. Change the storage structure, and every application that touched it broke.

Network databases (the CODASYL standard) were slightly more flexible — records could have multiple parents — but the programmer still had to traverse explicit pointers between records. You didn’t ask what you wanted. You described how to navigate to it.

This created a fundamental problem: data dependence. Your application knew too much about the physical layout of data. A reorganization of storage — moving a file, adding an index — could require rewriting thousands of programs. Databases were expensive to build and nearly impossible to change.

Edgar Codd had spent years thinking about this problem from the wrong side of the building at IBM’s San Jose Research Lab. He was a mathematician, not an engineer, and he thought the engineers were solving the wrong problem entirely.

#Codd’s Insight: Data Independence

Codd’s 1970 paper — “A Relational Model of Data for Large Shared Data Banks”, published in Communications of the ACM — made a deceptively simple argument:

Organize data as relations (tables of rows and columns). Query it using relational algebra — operations from mathematics that describe what you want, not how to retrieve it. Let a query optimizer figure out the physical access path.

This gave applications data independence: programs didn’t need to know whether data was stored on tape, indexed, partitioned, or reorganized. They just described what they wanted. The database engine’s job was to figure out how to deliver it.

A relation is just a table: a set of tuples (rows), each conforming to the same schema (columns). What made it powerful was the mathematical operations you could apply:

Orders table:             Customers table:
┌────┬─────────┬──────┐  ┌────┬──────────────┬──────────┐
│ id │ cust_id │  amt │  │ id │ name         │ city     │
├────┼─────────┼──────┤  ├────┼──────────────┼──────────┤
│  1 │     101 │  250 │  │101 │ Ada Lovelace │ London   │
│  2 │     102 │   80 │  │102 │ Alan Turing  │ London   │
│  3 │     101 │  430 │  └────┴──────────────┴──────────┘
└────┴─────────┴──────┘

JOIN on cust_id = id WHERE city = 'London':
→ All orders placed by London customers

The key insight: the join. By expressing relationships through shared values (foreign keys) rather than physical pointers, data could be combined in ways the original designers never anticipated. You didn’t have to predict every query when designing the schema — you just had to model the data correctly, and the relational algebra would handle the rest.

#Normalization: Designing Data Without Redundancy

Codd also formalized normalization — a set of rules for structuring tables to minimize redundancy and prevent inconsistencies.

The classic problem: store a customer’s city in every order row, and you have a thousand rows to update when they move. Miss one, and your data is now contradictory. Store the city once in a customers table, reference it by ID, and there’s only one place to update.

Codd defined normal forms — increasingly strict constraints on table design:

-- Bad: city duplicated in every order row
CREATE TABLE orders_bad (
    id       INT,
    cust_id  INT,
    cust_city TEXT,  -- ← duplicated, will go stale
    amount   DECIMAL
);

-- Good: city lives once in customers, joined when needed
CREATE TABLE customers (
    id   INT PRIMARY KEY,
    city TEXT
);
CREATE TABLE orders (
    id      INT PRIMARY KEY,
    cust_id INT REFERENCES customers(id),
    amount  DECIMAL
);

First, second, and third normal form (1NF, 2NF, 3NF) eliminate progressively subtler forms of redundancy. Most practical schemas aim for 3NF. Codd later defined Boyce-Codd Normal Form (BCNF) after a conversation with Don Chamberlin and Ray Boyce that revealed a gap in 3NF.

#SEQUEL: Making Math Readable

Codd’s relational algebra was mathematically elegant and practically unusable for anyone without a math degree. IBM assigned two researchers — Don Chamberlin and Ray Boyce — to turn it into something programmers could read.

Chamberlin attended one of Codd’s talks and came out convinced:

“I saw what Codd was doing, and I thought: I can write a language that anybody can use.”

They called it SEQUEL — Structured English Query Language. The name was later shortened to SQL due to a trademark conflict with an aircraft company.

The design goal was English readability. Instead of relational algebra notation, you’d write something that looked like a sentence:

-- Relational algebra (what Codd's math looked like):
π name, city (σ city='London' (Customers))

-- SQL (what Chamberlin and Boyce designed):
SELECT name, city
FROM customers
WHERE city = 'London';

-- The full power: joining tables, aggregating, filtering
SELECT
    c.name,
    COUNT(o.id)   AS order_count,
    SUM(o.amount) AS total_spent
FROM customers c
JOIN orders o ON o.cust_id = c.id
WHERE c.city = 'London'
GROUP BY c.name
HAVING SUM(o.amount) > 200
ORDER BY total_spent DESC;

SQL described what to retrieve. The database engine decided how — which indexes to use, which join algorithm, whether to scan the whole table or seek to a specific row. This separation of concerns was radical.

#IBM Sat On It. Oracle Didn’t.

IBM built System R (1974–1979), the first relational database implementation, as a research project. It proved the concept worked. IBM then spent years deciding whether to productize it, partly because IMS — their existing hierarchical database — was profitable and deployed everywhere.

Larry Ellison read IBM’s published System R papers. In 1977, he founded Software Development Laboratories — later renamed Oracle — with a simple plan: build what IBM described before IBM did.

Oracle shipped the first commercial relational database in 1979, a year before IBM’s own product. IBM followed with SQL/DS in 1981 and DB2 in 1983. By then, Oracle had a head start it never gave back.

The lesson was not lost on the industry: publishing research is an act of generosity to your competitors.

By 1983, relational databases had displaced hierarchical ones as the default. In 1986, SQL became an ANSI standard. Today, every major database — PostgreSQL, MySQL, SQLite, SQL Server, BigQuery — speaks a dialect of the language two IBM researchers designed to make a mathematician’s ideas accessible.

#ACID: The Promise Databases Make

As relational databases moved into production, engineers needed a guarantee: if the power goes out mid-transaction, what state is the data in?

ACID — formalized in 1983 by Theo Härder and Andreas Reuter — defines the four properties a database transaction must have:

Atomicity — a transaction either completes fully or not at all. No half-written orders.
Consistency — every transaction leaves the database in a valid state. Foreign keys hold. Constraints are respected.
Isolation — concurrent transactions don’t see each other’s intermediate states. Two users booking the last seat get the right answer.
Durability — once committed, a transaction survives crashes. The write is on disk, not just in memory.

BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE id = 1;
  UPDATE accounts SET balance = balance + 500 WHERE id = 2;
COMMIT;
-- Both updates happen, or neither does. Never just one.

ACID is what separates a database from a file. It’s why banks trust relational databases with money, hospitals trust them with records, and airlines trust them with seat inventory. The guarantees are hard to implement — write-ahead logs, multi-version concurrency control, two-phase locking — but the programmer doesn’t see any of that. They just write SQL and trust the promise.

#What the Relational Model Got Right

Codd’s 1970 paper is one of the most cited in computer science, and the ideas have held up better than almost anything else from that era:

Data independence — swap the storage engine, add an index, partition a table — applications keep working
Declarative queries — describe the result, not the algorithm; let the optimizer find the best path
The join — relationships expressed through values, not pointers, made data combinable in ways nobody predicted
Open standards — SQL’s ANSI standardization meant knowledge transferred between systems

The relational model isn’t perfect. It struggles with hierarchical data (XML, JSON), graph relationships, and time-series at scale. NoSQL emerged in the 2000s partly to escape its constraints. But NoSQL databases consistently added SQL interfaces back — because SQL, it turned out, wasn’t the constraint. It was the solution.

The internet now runs on TCP/IP for communication and SQL for data. Both were designed in the early 1970s. Both are still running.