The Query Optimizer: How SQL Actually Runs

#TL;DR

The relational model’s most underappreciated property is that SQL describes what you want, not how to get it. Between SELECT and the result, there’s a piece of software — the query optimizer — that parses your query, rewrites it, estimates the cost of dozens or thousands of possible execution strategies, and picks one. The modern cost-based optimizer traces back to IBM’s System R project in the late 1970s, specifically to Pat Selinger’s 1979 paper. Forty-seven years later, query optimization is still the single largest source of “why is this query slow” moments in production databases — and understanding how optimizers think is the main skill gap between people who write SQL and people who debug SQL.

#The Job

Given the query:

SELECT c.name, SUM(o.amount)
FROM customers c
JOIN orders o ON o.cust_id = c.id
WHERE c.city = 'London'
GROUP BY c.name
HAVING SUM(o.amount) > 1000;

The database has to decide:

Which table to read first. Read customers, filter to London, then look up orders? Or read orders, then join to customers?
How to execute the JOIN. Nested loop? Hash join? Sort-merge?
Which indexes to use. Is there an index on customers.city? On orders.cust_id?
How to sort or hash for the GROUP BY. Already sorted by c.name? Build a hash table in memory?
When to apply HAVING and filters. Filter as early as possible (predicate pushdown)?

Each of these decisions is independent, and the combinations multiply fast. A 10-table join has 10! = 3.6 million possible orderings, each with multiple algorithms, each with index choices. The optimizer can’t try all of them. It has to be clever about which plans to consider.

#The Four-Stage Pipeline

Every SQL query goes through roughly four stages between being sent and being executed.

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│  Parse   │──►│ Rewrite  │──►│   Plan   │──►│ Execute  │
└──────────┘   └──────────┘   └──────────┘   └──────────┘

Parse — turn the SQL string into a parse tree. Syntax errors live here. Output: an abstract syntax tree, roughly faithful to what the user typed.

Rewrite (logical planning) — apply transformations that don’t change the semantics but simplify the tree. Examples: view expansion (replace references to a view with its definition), constant folding (WHERE 1 = 1 removed), subquery flattening, predicate pushdown (move filters closer to the tables), outer join simplification.

Plan (physical planning, aka optimization) — decide the actual execution strategy. This is where join ordering, join algorithms, index selection, and aggregation strategies are chosen. The output is an execution plan — a tree of physical operators.

Execute — run the plan. Read pages, apply filters, perform joins, stream results. By the time execution starts, all the interesting decisions have already been made.

#Logical vs. Physical Plans

It’s worth seeing what a plan actually looks like. Postgres’s EXPLAIN output for the query above might produce:

 GroupAggregate  (cost=0.71..24.86 rows=10 width=40)
   Group Key: c.name
   Filter: (sum(o.amount) > 1000)
   ->  Nested Loop  (cost=0.71..24.76 rows=25 width=14)
         ->  Index Scan using customers_city_idx on customers c
               (cost=0.29..8.30 rows=5 width=10)
               Index Cond: (city = 'London')
         ->  Index Scan using orders_cust_id_idx on orders o
               (cost=0.43..3.27 rows=5 width=8)
               Index Cond: (cust_id = c.id)

Reading bottom-up:

Use the customers_city_idx to find rows where city = ‘London’. Estimated 5 rows.
For each, use orders_cust_id_idx to find matching orders. Estimated 5 per customer.
Feed the joined pairs into a GroupAggregate, which sums amounts per name.
Apply the HAVING filter.

The optimizer chose a nested loop join because it expected the outer side (London customers) to be small. With more customers, it might have picked a hash join instead.

The numbers cost=0.71..24.86 are the optimizer’s cost estimate in internal units. The rows=10 is its guess at how many rows this step will produce. Both are estimates, and they’re the crucial inputs to plan selection.

#Cost-Based Optimization

Modern optimizers are cost-based. They estimate how expensive each candidate plan would be, then pick the cheapest. The alternative — rule-based optimization — applied fixed rules regardless of data characteristics, and it largely died in the 1990s because it produced terrible plans for non-trivial workloads.

Cost estimation depends on two things:

Statistics about the data. Every modern database maintains per-column statistics: total row count, distinct value count, value distribution (often as a histogram), null fraction, average row width. Postgres gathers these via ANALYZE. Other databases have similar mechanisms.

A cost model that converts plan operations into cost units. A sequential scan of N rows has cost proportional to the number of pages read. A nested loop of N × M rows has cost proportional to N × M. An index lookup has cost roughly proportional to log M. Each of these is tuned with constants like seq_page_cost (the cost of reading a sequential page) and cpu_tuple_cost (the cost of processing one tuple in memory).

Combining statistics and the cost model gives an estimated cost for any candidate plan. The optimizer explores the plan space — informed by a set of heuristics about which plans are even worth considering — and picks the cheapest.

#Cardinality Estimation: The Thing That Goes Wrong

The key input to cost estimation is cardinality — how many rows a given operation will produce.

For a simple predicate like WHERE city = 'London', the optimizer looks at its histogram of the city column. If ‘London’ makes up 5% of the histogram’s samples, and the table has 10,000 rows, the estimate is 500 rows.

For a JOIN, the optimizer estimates how many rows each side will produce, then how many will match. For two tables with 1,000 and 10,000 rows joining on a foreign key, the optimizer might estimate the join produces 10,000 rows (if the foreign key is fully populated).

Cardinality estimation is the thing that goes wrong most often in real optimizers. Its failures produce the classic “good for small data, catastrophic for large data” plans:

Correlated columns. The optimizer assumes columns are independent. If city = 'London' and country = 'UK' are strongly correlated, the estimate of “rows matching both” is usually too low.
Skewed data. A histogram with 100 buckets can’t capture the fact that 60% of rows have the value 'pending' for a status column, especially if the bucket boundaries don’t align.
Predicate stacking. Multiple filters are often assumed to combine multiplicatively. If each filter keeps 10% of rows, four filters are estimated to keep 0.01% — often way too aggressive.
Stale statistics. If ANALYZE hasn’t been run recently, the distribution the optimizer is using doesn’t match the current data.

When cardinality estimation is wrong, the optimizer picks a plan optimized for the wrong data size. The plan is then expected to process 100 rows but actually processes 10 million, with a nested loop that was great for 100 and catastrophic for 10 million. This is the single most common performance problem in production databases.

EXPLAIN ANALYZE exists specifically to diagnose this — it shows the estimated vs. actual row counts at each step.

->  Nested Loop  (cost=... rows=100 width=...)
              (actual rows=10432892 loops=1)

A 5-order-of-magnitude miss between estimate and actual means the plan is wrong and the reason is cardinality estimation. The fix is usually better statistics, a query rewrite to help the optimizer, or an explicit hint.

#Selecting Indexes

The optimizer considers every index on a relevant column and estimates the cost of each potential access path:

Sequential scan. Read every row of the table. Cost proportional to table size.
Index scan. Traverse an index (B-tree or hash), read matching rows. Cost proportional to result size, not table size — but with additional random I/O per row.
Index-only scan. If all required columns are in the index, skip the table entirely. Much faster.
Bitmap index scan. Build a bitmap of matching row IDs, then read those rows in physical order. Good for medium-sized result sets.

The choice depends on how many rows the filter is expected to return. For 1% of a large table, an index scan is much faster. For 50% of a table, a sequential scan is actually better — the random I/O of the index scan outweighs the savings.

This is why adding an index doesn’t always make a query faster. If the optimizer already chose a sequential scan because the filter is non-selective, adding an index is ignored. If the optimizer picks the new index but the filter is less selective than it looks (cardinality estimation is wrong), performance can get worse.

#Join Ordering

For a query joining N tables, the optimizer has to pick an order. In theory, all N! orderings are candidates. In practice, the optimizer uses dynamic programming to consider only left-deep or bushy plans up to a threshold (often 8-12 tables), and switches to heuristics for larger joins.

Dynamic programming (Selinger-style). Build up larger joins from smaller ones. For 2 tables, try both orderings and all algorithms, pick the best. For 3 tables, take each 2-table sub-plan and add the third table in all possible ways. For N tables, repeat. This produces roughly O(2^N) candidate plans, manageable up to 8-12 tables.

Greedy / heuristic for large joins. Beyond the DP threshold, optimizers fall back to heuristics: “join the smallest first,” “prefer joins with selective filters,” “avoid Cartesian products.” These produce plans quickly but aren’t guaranteed optimal. Postgres has geqo (genetic query optimization) for 12+ table joins, which runs a genetic algorithm on plan candidates.

The Selinger paper from 1979 (“Access Path Selection in a Relational Database Management System”) introduced the DP approach and is still roughly how modern optimizers work. Most of the modern improvements are in better cost estimates and smarter heuristics for large joins, not in the fundamental algorithm.

#Why the Same Query Can Be Fast and Slow

A question developers ask with some regret: “Why did this query take 50 ms yesterday and 5 minutes today?”

Several reasons:

Plan changes due to updated statistics. An ANALYZE ran, changing the data distribution the optimizer sees. The optimizer picked a different plan. The new plan is worse for current data.
Parameter sniffing. The database cached a plan optimized for a specific parameter value. A new invocation with a different parameter value uses the cached plan, which is wrong for the new value.
Data growth crossed a threshold. A sequential scan was fine on a 1,000-row table. At 100,000 rows, it’s slow. The optimizer might not have noticed because its statistics haven’t caught up.
Index bloat or statistics staleness. The optimizer estimates an index will be fast; the actual index has become bloated due to updates and performs much worse than estimated.
Other queries competing for resources. The same plan that was fast on an idle system is slow when the database is busy.
A new index was added or dropped. Access paths have changed.

Diagnosing “why is this slow” usually means running EXPLAIN ANALYZE and looking for two things: big gaps between estimated and actual row counts (cardinality problem), and unexpectedly expensive steps (wrong access path).

#Query Hints: Overruling the Optimizer

Some databases let you give the optimizer hints — directives that force specific access paths or join algorithms. Examples:

Oracle: SELECT /*+ USE_NL(o, c) */ ...
SQL Server: SELECT ... FROM o INNER LOOP JOIN c ...
MySQL: SELECT ... FROM t1 FORCE INDEX (idx1) ...
Postgres: does not natively support hints, though the pg_hint_plan extension adds them.

Query hints are a blunt instrument. They override the optimizer’s cost analysis with your judgment. They’re useful when:

You know the optimizer is consistently wrong for this specific query.
You’ve exhausted other fixes (better statistics, schema changes, query rewrites).
The workload is stable and the optimal plan won’t change.

They’re dangerous because they persist past changes that would otherwise fix themselves. A hint that forces an index scan today will still force that index scan after you drop the index, upgrade the database, or change the data distribution. Postgres’s choice not to support hints is ideological — the position being that if the optimizer is wrong, the optimizer should be fixed, not bypassed.

#Why Optimizers Are Hard

After fifty years of research, query optimization is still an active area. The fundamental difficulties:

The plan space is enormous. For a 20-table join, there are more possible plans than atoms in the observable universe. You can only explore a tiny fraction.
Cardinality estimation is unsolved. Estimators based on histograms, sampling, machine learning (there’s serious research on using ML-based estimators) all fail in predictable ways on realistic data.
Costs depend on caching. A cold-cache sequential scan is very different from a hot-cache one. Optimizers have imperfect models of the buffer pool’s contents.
Workload drift. A plan optimized for a workload from last month may be terrible for this month’s workload. Adaptive query processing (rethinking plans during execution) is an ongoing research area.
Interaction effects. Many queries running concurrently compete for resources in ways an optimizer considering one query at a time can’t model.

Each of these is a live research topic. The query optimizer remains one of the most difficult pieces of software in any database, and it accounts for a large share of the performance engineering effort on every mature database product.

#What the Optimizer Actually Unlocked

The Relational Model post says SQL describes what to retrieve, and the database engine decides how. That statement is only meaningful because the optimizer exists. Without it, “declarative query language” is just syntax. With it, SQL becomes a true abstraction over data access.

The payoff is that the same SQL keeps working as the database improves. Your query might run 100x faster after a database upgrade, because the optimizer got smarter — without you changing a line. New indexes can be added and the optimizer will use them. Data distributions can shift and the optimizer will adapt. The query is a stable contract; the execution is adaptive.

This is the quiet, load-bearing promise the relational model makes. Codd’s 1970 paper argued for data independence — the idea that programs shouldn’t know how data is stored. The optimizer is what delivers on that promise in practice. It’s the component that makes SQL a productive language to write against, and it’s the component that makes SQL a frustrating language to debug against. Both things are true at the same time, and they have been for almost fifty years.