蓝图 · 1,473 字 · 6 分钟阅读

The URL: Addressing Everything

Berners-Lee's third invention gave every resource on earth a single, stable name — and turned the web from a collection of documents into a navigable space.

#TL;DR

Before the web, every network system had its own addressing scheme. FTP had host-and-path pairs. Gopher had multi-field selectors. Email had user@host. If you wanted to point someone to a resource, you had to explain which system to use and how to use it. Tim Berners-Lee unified all of this into a single string: the Uniform Resource Locator. A URL encodes what protocol to speak, what machine to talk to, and what to ask for — in a format concise enough to print on a napkin. URLs made the hyperlink possible, because a link is just a URL embedded in a document. They made search engines possible, because a search result is just a list of URLs. They became the universal addressing system not just for the web, but for APIs, mobile apps, operating systems, and even physical objects via QR codes. The URL is arguably the most widely used naming system humans have ever created.

#The Address Problem

In 1990, the internet had dozens of information systems, and none of them could point to each other.

FTP required you to know a hostname and a file path: “connect to ftp.mit.edu, go to /pub/papers/, download rfc1738.txt.” Gopher gave you a menu of items, each described by a type code, a selector string, a hostname, and a port. WAIS required a structured query against a named database. Usenet used newsgroup names. Email used user@host.

Each system existed in its own namespace. If you found a useful file on an FTP server and wanted to reference it from a Gopher menu, you couldn’t — not in any standard way. If you wanted a document to link to another document on a different system, there was no common syntax for “here’s where that thing lives.”

Berners-Lee saw this as the core problem. HTTP was the protocol. HTML was the document format. But without a universal addressing scheme, you couldn’t build hyperlinks across systems, and without hyperlinks, you didn’t have a web. You had a collection of isolated servers.

#One String to Name Anything

Berners-Lee’s solution was the Uniform Resource Locator — a single-line string that encoded everything a client needed to retrieve a resource:

scheme://authority/path?query#fragment

https://example.com:443/posts/era-2-url?ref=homepage#the-address-problem
└─┬──┘  └────┬────┘└┬┘└──────┬───────┘ └─────┬────┘ └────────┬────────┘
scheme    host    port     path           query            fragment

Each part answers a specific question:

  • Scheme — what protocol do I speak? (http, https, ftp, mailto)
  • Authority — who do I talk to? (hostname, optional port)
  • Path — what do I ask for? (hierarchical, like a filesystem)
  • Query — with what parameters? (key-value pairs after ?)
  • Fragment — which part of the result? (client-side, never sent to the server)

The genius was generality. The URL didn’t just describe web pages. It could describe any resource accessible via any protocol:

http://info.cern.ch/hypertext/WWW/TheProject.html   — a web page
ftp://ftp.uu.net/pub/unix/bsd-sources.tar.gz        — a file on an FTP server
mailto:berners-lee@w3.org                            — an email address
telnet://bbs.example.com:23                          — a telnet session
news:comp.infosystems.www                            — a Usenet newsgroup

One syntax. Every system. A link in an HTML document could point to a web page, an FTP download, or an email address, and the browser would know what to do based on the scheme alone.

#Encoding: Making the Messy World Fit

URLs had to be transmissible over any medium — email, printed paper, spoken aloud, encoded in HTML attributes. That meant they needed to be a subset of ASCII, with no spaces, no ambiguous characters, and a clear escape mechanism for everything else.

The rule: any character outside the safe set gets percent-encoded — replaced by % followed by its two-digit hexadecimal value.

Space       →  %20  (or + in query strings)
café        →  caf%C3%A9
日本語      →  %E6%97%A5%E6%9C%AC%E8%AA%9E
"quotes"    →  %22quotes%22

This was pragmatic and ugly. It meant URLs for non-English content were unreadable strings of percent codes. It meant every web framework needed a URL encoder and decoder. It meant developers would fight encoding bugs for decades — double-encoding, missing encoding, encoding the wrong component.

But it worked. A URL could be safely embedded in HTML, transmitted in an email header, printed on a billboard, or read aloud over the phone. The encoding made URLs universal carriers of identity, at the cost of human readability for edge cases.

#The Anatomy of Resolution

When you type a URL into a browser, a cascade of systems activates to turn that string into a response:

https://blog.example.com/posts/hello?lang=en

1. Parse the URL
   scheme = https, host = blog.example.com, path = /posts/hello, query = lang=en

2. DNS lookup
   blog.example.com → 93.184.216.34

3. TCP connection
   Connect to 93.184.216.34:443

4. TLS handshake
   Verify certificate for blog.example.com, establish encrypted channel

5. HTTP request
   GET /posts/hello?lang=en HTTP/1.1
   Host: blog.example.com

6. Server response
   HTTP/1.1 200 OK
   Content-Type: text/html
   ...

Each step relies on a different era’s technology: DNS from the 1980s, TCP/IP from the 1970s, TLS from the 1990s, HTTP from 1991. The URL is the string that binds them all together — the starting point for a chain of lookups that spans the entire internet stack.

#URIs, URNs, and the Naming Debate

The naming around URLs got complicated fast, and the confusion persists today.

Berners-Lee originally defined the URL — a string that tells you where something is and how to get it. But the IETF wanted a broader concept: a URI (Uniform Resource Identifier) that could name a resource without necessarily telling you how to retrieve it.

The distinction:

  • A URL is a locator: https://example.com/paper.pdf — tells you the protocol, the server, and the path
  • A URN is a name: urn:isbn:0-201-63361-2 — identifies a book, but doesn’t say where to find it
  • A URI is the umbrella: every URL is a URI, and every URN is a URI
                    URI
                   /   \
                 URL    URN
        (where + how)  (just a name)

In practice, almost every URI you encounter is a URL. URNs never gained wide adoption because a name without a retrieval mechanism isn’t useful on a network built for retrieval. The web runs on locators, not abstract identifiers. The spec says URI; the world says URL. Both are right.

#Cool URIs Don’t Change

In 1998, Berners-Lee wrote an essay titled “Cool URIs Don’t Change” — an argument that URLs should be permanent. If a URL worked last year, it should work this year. If someone bookmarked it, cited it in a paper, or linked to it from another page, breaking it destroys value.

His rules were simple:

  • Don’t put technology in the URL (no .php, .asp, .cgi) — technologies change, URLs shouldn’t
  • Don’t put organizational structure in the URL — departments reorganize, URLs shouldn’t break
  • Design URLs as if they’ll last forever, because some of them will
Bad:   /cgi-bin/articles/display.php?id=42&format=html
Good:  /articles/42

The web ignored this advice comprehensively. Link rot — URLs that stop working — is endemic. Studies consistently find that 50% or more of links in academic papers, court opinions, and news articles go dead within a decade. The Internet Archive’s Wayback Machine exists specifically because the web can’t keep its URLs alive.

And yet the principle shaped how the best-designed systems work. Wikipedia’s URLs have been stable for twenty years. REST APIs treat URLs as the primary identifier for resources. Frameworks like Rails and Django have URL routing systems that produce clean, stable paths. The battle isn’t won, but it’s not lost either.

#URLs as Platform

The URL outgrew the browser. It became a universal entry point for software:

Deep linking — mobile apps register URL schemes (twitter://status/123, slack://channel/general) so URLs can open specific content inside native apps. When you tap a link on your phone and it opens the right app to the right screen, that’s URL-based dispatch.

QR codes — a QR code is a URL encoded as a visual pattern. Every restaurant menu, concert ticket, and payment terminal that uses a QR code is using a URL as a physical-world hyperlink.

API addressing — REST APIs use URLs as resource identifiers: GET /users/42/posts names a specific collection on a specific server. The URL is both the address and the API contract.

Operating systems — macOS uses x-apple-data-detectors:// URLs for system features. Android intents resolve URLs to app handlers. Windows registers protocol handlers. The URL became the universal interface for “open this thing.”

#What the URL Got Right

The URL is a 32-year-old design that handles use cases its creator never imagined:

  • Protocol independence — by putting the scheme first, URLs could absorb new protocols without changing the syntax. http: became https:, ws: (WebSocket), data: (inline data), blob: (binary objects). The format didn’t change. The namespace just grew.
  • Hierarchical paths — the slash-separated path borrowed from Unix filesystems, giving URLs a natural structure that maps to directories, categories, and API resources. This made URLs both human-readable and machine-parseable.
  • Decentralization — nobody owns the URL namespace. Any domain owner can create any path under their domain. There’s no central registry of pages, no approval process. This is why the web could grow from one server to billions without coordination.
  • Universality — the URL became the one identifier that works everywhere: in browsers, in emails, on paper, in QR codes, in APIs, in database records, in log files. No other naming system has achieved this breadth of adoption.

Berners-Lee needed three inventions to build the web: a protocol for transport, a language for documents, and a system for names. Of the three, the naming system may have been the most important. HTTP could be replaced (and partially has been, by HTTP/2 and HTTP/3). HTML could be replaced (and competes with JSON, Markdown, and native apps). But the URL has no successor. It’s the one piece of the original web that everything still depends on — the permanent address of the internet.