蓝图 · 1,504 字 · 6 分钟阅读

PageRank: The Algorithm That Organized the Web

Two Stanford grad students asked a simple question — what if a link is a vote? — and built the algorithm that made the web's infinite library searchable.

#TL;DR

By 1998, the web had millions of pages and no good way to find anything. Search engines like AltaVista indexed the web’s text, but they ranked results by keyword frequency — which meant the best result for any query was whichever page repeated the search term the most. The web was drowning in its own growth. Larry Page and Sergey Brin, two Stanford PhD students, had a different idea: instead of asking “what does this page say?”, ask “what does the rest of the web say about this page?” Their insight was that hyperlinks are implicit votes. A page that many other pages link to is probably important. A page linked to by important pages is even more important. This recursive definition became PageRank — a single algorithm that turned the unstructured web into a ranked, navigable space. It powered Google, reshaped the web’s economics, and established the principle that link structure encodes human judgment.

#The Search Problem

The early web had search engines. They weren’t good.

AltaVista (1995) could index the full text of millions of pages — a technical achievement. But when you searched for “cars,” it returned pages in order of keyword density. Pages that stuffed the word “cars” into invisible text a thousand times ranked higher than actual useful content. Webmasters figured this out immediately. The arms race between search engines and spammers began before Google existed.

Yahoo (1994) tried a different approach: human editors categorized the web by hand, building a directory like a library’s card catalog. This produced high-quality results but couldn’t scale. The web was growing faster than any team of editors could categorize it.

Lycos, Excite, Infoseek — they all used variations of term frequency and document matching. They all had the same fundamental problem: they analyzed each page in isolation. A page’s content can lie about its relevance. The question was whether there was a signal that couldn’t be faked as easily.

In 1996, Larry Page started a research project at Stanford called BackRub — a system to analyze the web’s link structure. His observation was borrowed from academic citation analysis: if a paper is cited by many other papers, it’s probably important. If it’s cited by important papers, it’s even more so.

The web’s hyperlinks worked the same way. When a page links to another page, the author is making a judgment: this page is worth pointing to. A link is an endorsement. The question was how to count those endorsements.

A naïve approach — count inbound links — fails immediately. A page with a million links from spam sites shouldn’t outrank a page with ten links from authoritative sources. The number of links matters, but so does who’s linking.

Page and Sergey Brin formalized this into a recursive definition: a page is important if important pages link to it. But that’s circular — you can’t know which pages are important until you know which pages are important.

The solution was elegant: start with the assumption that all pages are equally important, then iteratively refine. Each page distributes its importance equally among its outbound links. Repeat until the values stabilize. The math is a fixed-point computation on the web’s link graph.

#The Algorithm

PageRank models a random surfer — an imaginary user who clicks links at random. At any page, the surfer either follows a random outbound link (with probability d, the damping factor, typically 0.85) or jumps to a completely random page (with probability 1 - d). The PageRank of a page is the probability that the random surfer ends up there.

def pagerank(graph, damping=0.85, iterations=100):
    """
    graph: dict mapping page → list of pages it links to
    Returns: dict mapping page → PageRank score
    """
    pages = list(graph.keys())
    n = len(pages)
    rank = {page: 1.0 / n for page in pages}  # start equal

    # Invert: who links TO each page?
    inbound = {page: [] for page in pages}
    for page, links in graph.items():
        for target in links:
            if target in inbound:
                inbound[target].append(page)

    for _ in range(iterations):
        new_rank = {}
        for page in pages:
            # Sum of (linking page's rank / its number of outbound links)
            link_score = sum(
                rank[source] / len(graph[source])
                for source in inbound[page]
                if len(graph[source]) > 0
            )
            new_rank[page] = (1 - damping) / n + damping * link_score
        rank = new_rank

    return rank

# A tiny web
web = {
    "A": ["B", "C"],
    "B": ["C"],
    "C": ["A"],
    "D": ["C"],
}
scores = pagerank(web)
# C gets the highest rank: linked by A, B, and D

The damping factor (d = 0.85) means the random surfer has a 15% chance of getting bored and jumping to a random page. This serves two purposes: it guarantees the algorithm converges (no infinite loops in link cycles), and it ensures every page gets a minimum baseline score.

Mathematically, PageRank is the principal eigenvector of the web’s link matrix — a fact that Page and Brin noted in their 1998 paper. The iterative computation is just the power method for finding eigenvectors, applied to a matrix with billions of rows.

#”The Anatomy of a Large-Scale Hypertextual Web Search Engine”

In 1998, Page and Brin published their landmark paper while still at Stanford. It described not just PageRank but the entire architecture of Google — how to crawl billions of pages, build an inverted index, and combine link analysis with text matching to produce ranked results.

The paper was remarkably specific. It included storage formats, data structures, crawl rates, and a frank discussion of the tension between search quality and advertising:

“We expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers.”

They were describing their own company’s future business model — and warning against it.

Google launched as google.stanford.edu in 1998, moved to google.com, and incorporated in September of that year. The results were visibly better than anything else available. Users could feel the difference: search for a topic, and the most authoritative page appeared first, not the most keyword-stuffed. Word of mouth did the rest.

#The SEO Arms Race

PageRank worked because links were honest signals — people linked to pages they genuinely found useful. But the moment PageRank became public knowledge, links became currency.

Link farms — networks of fake sites that linked to each other to inflate PageRank — appeared immediately. Paid links turned PageRank into something you could buy. Guest posting, comment spam, and link exchanges all exploited the same principle: if links are votes, manufacture votes.

Google responded with an escalating series of countermeasures:

  • The nofollow attribute (2005) — told Google not to count a link as a vote. Applied to comments, ads, and user-generated content.
  • The Penguin update (2012) — algorithmically penalized sites with unnatural link profiles
  • Manual penalties — Google’s webspam team manually demoted sites caught buying links

The SEO industry grew into a multi-billion-dollar ecosystem, split between “white hat” practitioners (make genuinely useful content that earns links) and “black hat” practitioners (manufacture signals to game the algorithm). This arms race never ended. It just shifted — from link manipulation to content farms to AI-generated spam. Each generation of manipulation triggers a new generation of algorithmic defenses.

PageRank’s insight — that link structure encodes importance — proved applicable far beyond web search:

Social networks use variations of PageRank to measure influence. Twitter’s (now X’s) recommendation algorithm, Facebook’s News Feed ranking, and LinkedIn’s professional graph all incorporate network-based authority scores.

Academic citation — Google Scholar uses a citation-graph algorithm directly descended from PageRank. A paper’s importance is weighted by the importance of the papers that cite it. This is the domain that inspired Page in the first place.

Fraud detection — financial networks use graph algorithms related to PageRank to identify suspicious transaction patterns. A node connected to many known-fraudulent nodes inherits suspicion.

Recommendation systems — “users who liked this also liked that” is, at its core, a graph-walking algorithm. The link isn’t a hyperlink — it’s a purchase, a rating, a click — but the math is the same.

#What PageRank Got Right

PageRank was published 28 years ago. Google has since added hundreds of ranking signals — machine learning, natural language understanding, user behavior data — and PageRank itself is a much smaller factor than it once was. But the core ideas endure:

  • The web’s structure is information — before PageRank, search engines treated each page as an independent document. Page and Brin showed that the relationships between pages encode human judgments about quality and relevance. This insight — that graph structure is a signal — became foundational to network science, social media, and recommendation systems.
  • Recursive authority — the idea that importance flows through a network, and that a vote from an important node counts more than a vote from an obscure one, solved the spam problem that plagued keyword-based search. It’s the reason Google’s results were dramatically better than AltaVista’s, despite indexing the same web.
  • Algorithms shape economies — PageRank didn’t just organize the web. It determined which pages got traffic, which businesses got customers, and which ideas got amplified. The algorithm created an economy (SEO, content marketing, digital advertising) and a new form of power: the ability to decide what people find when they search. That power has only grown.
  • Simplicity as foundation — the core algorithm is twenty lines of code. A matrix multiplication, iterated until convergence. The elegance of the math is what made it tractable at web scale and what made it extensible — Google could layer complexity on top of a foundation that was clean and well-understood.

Page and Brin started with a question about academic citations and ended up defining how billions of people navigate the world’s information. PageRank made the web usable at the exact moment it was becoming too large to navigate any other way. The web existed before Google. But the web as a useful tool for ordinary people — that started with PageRank.