- What outcome are we optimizing for? โ Query satisfaction: did the user find what they were looking for? Measured by: click-through rate on results, absence of "pogo-sticking" (clicking result โ immediately returning to search), and query abandonment rate. Secondary: time-to-answer (zero-click results like featured snippets resolve the query without a click). This shapes ranking: relevance IS the product, not a feature. Every architectural decision (index freshness, crawl frequency, serving latency) exists to improve result quality.
- Scope of web? โ Billions of pages. The crawler discovers and indexes the web continuously. But not all pages are equally important โ PageRank and freshness determine crawl priority.
- Query types? โ Navigational ("facebook login"), informational ("how to make pasta"), transactional ("buy running shoes"). Each has different ranking signals and result formats.
| In Scope | Out of Scope |
|---|---|
| Web crawling & page fetching | Ads / auction system |
| Inverted index construction | Image / video search |
| Query serving & ranking | Knowledge panels / featured snippets |
| Autocomplete / query suggestions | Personalization (search history) |
| Snippet generation | Safe search filtering |
- Query latency <500ms p99 โ users expect near-instant results.
- Index freshness โ news pages recrawled within minutes; average pages within days-weeks.
- Index completeness โ crawl billions of pages. Prioritize by importance (PageRank-like).
- High availability โ search must never be down. Replicated serving tier.
- The system is TWO separate planes: the OFFLINE plane (crawl + index, batch/streaming) and the ONLINE plane (query serving, real-time). They share the index as the interface.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Query latency <500ms at 100K QPS | Sharded inverted index (scatter-gather) | Index split across 10K+ servers. Query fans out to shards in parallel. Each shard returns top-K in <200ms. Merge in <100ms. | โ |
| Offline crawl/index decoupled from online serving | Separate offline and online planes | Crawling is batch/continuous. Serving is real-time. Different scaling characteristics. Coupling them would make both worse. | โ |
| Index freshness for breaking news | Incremental delta index (not full rebuild) | Small in-memory delta index merged with main index at query time. New content searchable in minutes without rebuilding petabyte-scale main index. | โ |
| Rank quality over raw speed | Two-phase ranking: fast retrieval โ ML re-ranking | Phase 1: inverted index retrieves top 1000 candidates (fast). Phase 2: ML model re-ranks with 100s of signals (slow but only on 1000 docs). | โ |
| Crawl billions of pages with priorities | Custom URL frontier (not Kafka/SQS) | Priority queue by PageRank + freshness + change frequency. Standard queues lack per-URL priority and deduplication at billion-scale. | โ |
- URL Frontier: A priority queue of URLs to crawl. Priority based on: PageRank of the domain, freshness (how stale is our copy?), change frequency (news sites recrawled hourly, static sites monthly). Implemented as a distributed priority queue (custom, sharded by domain hash).
- Politeness: Per-domain rate limiting. Max 1 request/sec per domain. Respect
robots.txtandCrawl-delay. This means the frontier must be domain-aware โ dequeue URLs from different domains to maximize parallelism while respecting per-domain limits. - Deduplication: Before adding a discovered URL to the frontier, check if we've already crawled it recently. Use a Bloom filter (space-efficient) for fast "probably seen" checks. For content dedup (same page, different URL), hash the page content (simhash for near-duplicate detection).
- Crawl workers: Thousands of stateless workers. Each: dequeue URL โ DNS resolve (cached) โ fetch with timeout โ store HTML โ extract links โ add new URLs to frontier.
last_crawled_at and estimated_change_frequency per URL. Pages that change often (news homepages) get recrawled every few minutes. Stable pages (Wikipedia articles) get recrawled every few weeks. This maximizes index freshness per crawl budget. Full recrawl would waste 90%+ of resources on unchanged pages.- Sharding: The 100B documents are divided across ~10,000 shards (document-based sharding). Each shard holds ~10M documents and their complete inverted index. A query is sent to ALL shards in parallel; each returns its local top-K.
- Replication: Each shard is replicated 3ร for availability and load distribution. At 300K queries/sec fan-out to 10K shards = 3B sub-queries/sec. With 3 replicas, each replica handles ~1B sub-queries/sec รท 10K shards = ~100K per shard per sec.
- Query execution on a single shard: (1) Look up each query term in the term dictionary. (2) Fetch posting lists. (3) Intersect posting lists (for AND queries) or union (for OR). (4) Score each matching document using BM25 + PageRank + hundreds of other signals. (5) Return top-K with scores.
- Posting list intersection: Use skip pointers (jump ahead in sorted lists) to avoid scanning every entry. For a 2-term AND query, start with the shorter posting list and skip through the longer one. This turns O(n+m) into O(min(n,m) ร log(max(n,m))).
- Two-phase ranking: (1) On each shard, use a fast scoring function (BM25 + static PageRank) to get top-1000 candidates. (2) At the merger, apply a heavier ML-based re-ranker on the top ~200 global candidates to produce the final top 10.
- Ranking signals: BM25 (term frequency / inverse document frequency), PageRank (link authority), domain authority, title match, URL match, content freshness, page speed, mobile-friendliness, user engagement signals (click-through rate from past queries).
- Snippet generation: For each top-10 result, extract the most relevant passage from the cached page that contains the query terms. Highlight matching terms. Done at the merger stage.
- Autocomplete: Separate system. Trie-based prefix search over the most popular queries (precomputed from query logs). Served from in-memory cache. <50ms latency. Fires on every keystroke.
| Data | Store | Scale | Access Pattern |
|---|---|---|---|
| Inverted Index | Custom (SSTable-like on SSD) | 100+ PB, 10K shards ร 3 replicas | Read at query time. Written in batch by indexer. |
| Page Store | Bigtable / custom blob store | 5+ PB compressed HTML | Read for snippet generation. Written by crawler. |
| URL Frontier | Custom distributed priority queue | ~100B URLs tracked | Dequeue by priority + domain fairness. |
| Link Graph | Custom (adjacency list, sharded) | ~1T edges | Batch read for PageRank computation. |
| Query Logs | Kafka โ data warehouse | 8.5B queries/day | For autocomplete, ranking model training, analytics. |
| Autocomplete | In-memory trie (replicated) | ~100M popular queries | Prefix lookup per keystroke. <50ms. |
| Data | Store | Why This Store |
|---|---|---|
| Raw crawled pages | Distributed blob store (Bigtable-like) | Compressed HTML of billions of pages. Key = URL hash. Used by indexer and for cache/snippet generation. |
| Inverted index | Custom sharded index | Term โ posting lists (doc_ids + positions). Sharded across 10,000+ servers. The core data structure for query serving. |
| URL frontier | Distributed priority queue | URLs to crawl, prioritized by PageRank and freshness. Deduplication via URL hash. Billions of entries. |
| PageRank scores | Distributed KV store | Pre-computed per-URL importance scores. Updated in batch (MapReduce). Used by ranking at query time. |
| Knowledge Graph | Graph database | Entities, relationships, attributes. Powers info boxes and structured answers. Billions of facts. |
| Serving index (per DC) | In-memory + SSD | Hot subset of the full index, replicated per data center. Memory-mapped for sub-ms term lookups. |
- Base index: Full rebuild weekly (batch MapReduce over entire page store โ new index shards). Atomic swap.
- Real-time layer: For breaking news, a separate fast-update index ingests recently crawled pages within seconds. Query serving merges base index results with real-time index results.
- Priority recrawl: News sites, social media, high-PageRank sites recrawled every few minutes. Feeds into the real-time layer.
| Extension | Architecture Impact |
|---|---|
| Semantic / Vector Search | Embedding-based retrieval alongside inverted index. ANN (approximate nearest neighbor) index for dense vectors. Hybrid scoring: BM25 + vector similarity. |
| AI Overviews (LLM answers) | Top-K retrieval โ feed results as context to an LLM โ generate summary. Adds ~2-5s latency. Shown alongside traditional results. |
| Personalization | User search history โ user embedding โ personalized ranking signal. Privacy-sensitive โ can be done on-device or with federated learning. |
| Image / Video Search | Separate index per media type. Vision models for content understanding. Cross-modal retrieval (text query โ image results). |
How does the inverted index handle a query like "Barack Obama birthday"?
The query is processed in stages. First, the query processor tokenizes and normalizes: "barack", "obama", "birthday". It may also expand with synonyms or correct spelling. Then, the query is sent to the index shards. Each shard looks up the posting lists for each term โ a posting list is a sorted array of document IDs that contain that term. For "barack" there might be 500M documents, "obama" 800M, "birthday" 2B. The intersection of these three lists gives documents containing all three terms. The intersection is computed efficiently using skip lists โ since posting lists are sorted, you can skip ahead when IDs don't match. After intersection, each matching document is scored using hundreds of ranking signals: term frequency, document PageRank, freshness, user location, and many more. The top ~1000 results per shard are returned to the merger, which combines results from all shards and re-ranks globally. For this particular query, the Knowledge Graph would also fire, returning a direct answer panel ("August 4, 1961") alongside the organic results.
How do you handle index freshness for breaking news?
The crawler has a priority system. Most of the web is crawled on a days-to-weeks schedule based on PageRank and change frequency. But news sites (CNN, NYT, Reuters, etc.) are in a "real-time" crawl tier โ checked every few minutes. When a breaking story is detected (via trending queries, social signals, or crawler discovering new URLs on news sites), the pipeline accelerates: (1) the URL is fast-tracked through crawl, parse, and index within 2-5 minutes, (2) the serving index can do "incremental updates" โ instead of waiting for a full index rebuild, a small delta index is merged with the main index at query time. This delta is in memory and can be updated in seconds. (3) For truly breaking events, Google may also surface results from Twitter/social media or its own news aggregation. The tradeoff: fast-indexed pages have less ranking signal (no PageRank computed yet, no link analysis), so Google applies a "freshness boost" โ newer content from trusted sources is ranked higher than the steady-state algorithm would suggest.