- What outcome is the platform optimizing for? β Developer velocity: code-to-production cycle time. How fast can a developer go from idea β commit β PR β review β merge β deploy? Secondary: collaboration quality (PR review turnaround, code review coverage). This shapes architecture: git push must be fast, PR creation instant, CI integration seamless. Any friction in the inner loop multiplies across millions of developers.
- Core operations? β git push, git clone/fetch/pull (the hot path), web UI browsing (file tree, diff, blame), Pull Requests (create, review, merge), Issues, Search, Actions/CI.
- Protocols? β Git over HTTPS (majority of traffic) and Git over SSH. Web API (REST + GraphQL). WebSockets for real-time updates.
- Read:write ratio for Git ops? β Extremely read-heavy. ~50:1 or higher. Clones and fetches dominate. Pushes are relatively rare.
- Scale? β ~150M developers, ~420M repositories, ~3B files indexed for search, ~100M pull requests/year.
- Data durability requirement? β Absolute. Losing a repository's data is catastrophic. Three replicas minimum.
| In Scope | Out of Scope |
|---|---|
| Git hosting (push/clone/fetch over HTTPS+SSH) | GitHub Copilot / AI features |
| Repository storage & replication (Spokes) | GitHub Packages (registry) |
| Web UI (file browsing, diffs, blame) | GitHub Pages (static hosting) |
| Pull Request workflow (review, merge) | GitHub Mobile (client app) |
| Code search across all repos | Billing / marketplace |
| CI/CD (Actions) β high level | GitHub Codespaces (cloud IDE) |
| Webhooks & notifications | Enterprise Server (on-prem) |
- UC1 (Clone/Fetch): Developer runs
git clone https://github.com/org/repoβ system locates repo replica, streams packfile to client. Must handle repos from 1 KB to 100 GB. - UC2 (Push): Developer runs
git pushβ system receives new objects, updates refs, replicates to all replicas atomically, fires webhooks, triggers Actions. - UC3 (Pull Request): Developer opens PR β system computes diff, runs merge checks, enables code review (inline comments, approvals), and merges when ready.
- UC4 (Browse): User visits
github.com/org/repoβ system renders file tree, README, commit history. Most visited page on GitHub. Must be sub-second. - UC5 (Search): Developer searches
"handleAuth" language:typescriptβ system searches across 3B+ files and returns results in seconds.
- Durability: Zero repository data loss. Every push is replicated to 3 independent servers before being acknowledged.
- Availability: Git reads (clone/fetch) must remain available even if a replica server fails. 99.99% target.
- Latency: Web UI page load <500ms. Git clone of medium repo (<1GB) starts streaming within 1-2s.
- Consistency: After a successful
git push, all subsequent reads (from any replica) must see the new refs. Strong consistency for writes. - Isolation: A "hot" repository (millions of clones) must not degrade performance for other repos sharing the same server ("fate sharing" avoidance).
git clone of a 500MB repo requires streaming 500MB of packfile from disk to network. The bottleneck is disk throughput and network bandwidth, not CPU. This means the storage layer design dominates everything.| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Zero repository data loss | Spokes: 3-replica synchronous replication | Push not acknowledged until all 3 replicas confirm. Git-level replication understands packfile semantics. Disk-level (DRBD) doesn't. | CP |
| Repository data is filesystem, not relational | NFS/Gitaly (not MySQL for git objects) | Git objects are a DAG of files. Storing packfiles in a relational DB adds overhead with zero benefit. Filesystem is the natural fit. | β |
| Metadata: repos, issues, PRs need joins | MySQL sharded by repo_id (Vitess) | Most queries are repo-scoped. Vitess manages shard routing. All repo data co-located for single-shard queries. | CP |
| Code search across 200M repos | Custom trigram index (not Elasticsearch) | Trigram index enables regex search. ES can't do arbitrary regex. Trigram pre-filtering reduces search space 99.99%. | β |
| Massive read amplification on web UI | Memcached for query result caching | A single repo page triggers dozens of DB queries. Memcached reduces DB load 10x. Simple GET/SET β no data structures needed. | β |
| CI: ephemeral, untrusted code execution | Ephemeral VMs (not containers) | Hardware-level isolation. Container escape = game over for shared kernel. VMs destroyed after each workflow run. | β |
π Load Balancers INGRESS
- L7 (HTTPS): web UI, API, git-over-HTTPS
- L4 (TCP): git-over-SSH (long-lived connections)
- Separate LBs β different traffic profiles
π₯οΈ Web Frontend APP
- Ruby on Rails application
- Renders pages: file tree, PRs, issues, diffs
- REST API + GraphQL API
- Stateless β scales horizontally
π SSH Gateway AUTH
- Terminates SSH, authenticates public keys
- Maps SSH key β user account β repo permissions
- Proxies Git protocol to Git Backend
βοΈ Git Backend (RPC) CORE
- Stateless workers that execute Git operations
- git-upload-pack (for clone/fetch)
- git-receive-pack (for push)
- Talks to Spokes for storage layer access
π¦ Spokes (Git Storage) STORAGE
- 3 replicas per repo on 3 independent servers
- Three-phase commit for writes
- Reads load-balanced across replicas
- Federated filesystem (bare metal, ext4)
ποΈ MySQL (Metadata) DATA
- Users, orgs, permissions, repo metadata
- PRs, issues, comments, reviews
- Sharded by repo_id / org_id
- Strong consistency for permissions
π Search Service SEARCH
- Code search across 3B+ files
- Custom search engine (not Elasticsearch)
- Indexes trigrams for substring matching
- Async re-index on push events
π Actions (CI/CD) COMPUTE
- Event-driven: triggered by push, PR, schedule
- Job queue β runner pool (ephemeral VMs)
- Artifacts stored in object storage (S3/Azure Blob)
How Git Stores Data (quick primer)
Spokes Architecture
- Three replicas per repo: Every repo is stored as a bare Git repository on 3 independently chosen file servers. Servers are bare metal machines with ext4 filesystems and large SSDs.
- Routing table: A mapping from
repo_id β [server_A, server_B, server_C]. Stored in MySQL, cached aggressively. The Git Backend looks up this table to find where a repo lives. - Reads: Served from ANY of the 3 replicas. Load-balanced for performance. If one replica is busy or down, reads route to another.
- Writes: Coordinated by a Spokes proxy via three-phase commit across all 3 replicas. Write succeeds only if ALL 3 replicas confirm.
Three-Phase Commit (Write Path)
git clone internally). Tradeoff: more complex protocol than DRBD, but dramatically more flexible and reliable.Handling Hot Repos & Fate Sharing
- Problem: A viral repo (millions of clones) on Server A degrades performance for ALL other repos on Server A.
- Solution: Monitor per-server I/O load. If a repo becomes "hot," Spokes can add additional read replicas (temporarily increase from 3 to 5+) and rebalance the routing table. The hot repo's reads spread across more servers.
- Fork deduplication: When a repo is forked, the fork initially shares the same object store (objects are content-addressed, identical). Only new objects from the fork are stored separately. This saves enormous storage β a repo with 10K forks doesn't store 10K copies of the same objects.
PR Data Model
Diff Computation
- What: The diff between
base_sha(target branch) andhead_sha(source branch). Actually a three-way diff: find the merge base, then diff baseβhead and baseβtarget. - Where: Computed on the Git Backend by calling
git diffon the Spokes replica. For large diffs (thousands of files), this is expensive. - Caching: Diffs are cached by the key
(base_sha, head_sha). Since SHAs are immutable, this cache NEVER needs invalidation. A new push to the PR branch changeshead_shaβ new cache key β new diff computed. Old cache entries still valid (for viewing old versions of the PR). - Mergeability check: Async background job runs
git merge --no-committo test if the PR can merge cleanly. Result stored inmergeablefield. Re-computed when either branch changes.
Merge Strategies
| Strategy | Git Command | Result |
|---|---|---|
| Merge commit | git merge --no-ff | Creates a merge commit with two parents. Preserves full branch history. |
| Squash and merge | git merge --squash | Combines all PR commits into a single commit on the target branch. Clean history. |
| Rebase and merge | git rebase + fast-forward | Replays PR commits on top of target. Linear history. No merge commit. |
- Merge operation is a
git pushunder the hood β goes through the full Spokes three-phase commit. - Branch protection rules checked server-side BEFORE the merge: required reviewers? required status checks passed? no force-push?
- Merge queue (for busy repos): PRs lined up and merged sequentially, each tested against the latest base. Prevents "merge skew" where two PRs that individually pass CI break when combined.
- Index structure: Trigram index. Every 3-character subsequence of every file is indexed. Query
"handleAuth"β look up trigramshan, and, ndl, dle, leA, eAu, Aut, uthβ intersect posting lists β candidate files β verify with full-text scan. - Index size: ~150 TB for 3B files. Distributed across a cluster of search nodes. Sharded by repository.
- Indexing pipeline: Push event β Kafka β search indexer reads new/changed files from Spokes replica β updates trigram index. Near-real-time: new code searchable within ~30 seconds of push.
- Ranking: Results ranked by: repo popularity (stars, forks), file path relevance, number of matches, language match. Not just "first match found."
- Scope: User can scope to a single repo, an org, or all of GitHub. Single-repo search uses the repo's own index partition β very fast. Global search fans out across all partitions β slower but still seconds.
func.*Auth regex) and can be optimized for the specific query patterns developers use. Tradeoff: massive engineering investment to build and maintain a custom search engine.| Data | Store | Why This Store |
|---|---|---|
| Repository metadata | MySQL (sharded) | Repos, issues, PRs, users. Sharded by repo_id. Vitess for shard management. Heavy read load from web UI. |
| Git objects (packfiles) | NFS β Gitaly | Actual git data. 3-replica Spokes replication. Append-only (git objects are immutable). ~200M repos. |
| Code search index | Elasticsearch | Source code indexed for full-text search. Updated on push via Kafka event. Trigram index for regex search. |
| Background jobs | Redis | Sidekiq job queue for webhooks, notifications, CI triggers. Job state and dedup keys. |
| Session & cache | Memcached | Page fragment caching. User sessions. DB query result caching. Massive read amplification reduction. |
| Webhook events | Kafka | push, pull_request, issues events. Consumed by Actions, external integrations, analytics. |
| CI artifacts & logs | S3 | Actions workflow logs, build artifacts. Lifecycle policy for retention (90 days default). |
- Every push, PR, issue change fires webhooks to configured URLs. At scale: ~10M webhook deliveries/day.
- Internal event bus (Kafka): All state changes published as events. Consumed by: search indexer, notification system, Actions dispatcher, analytics, audit log.
- Webhook delivery: At-least-once delivery with exponential backoff retries. If endpoint is down, retry for up to 3 days. Event payload includes a delivery ID for deduplication.
- Fan-out for notifications: A push to a popular repo (10K watchers) generates 10K notifications. Fan-out handled asynchronously via the event bus. Notifications delivered via email, web UI badge, and mobile push.
- Authentication: SSH public keys (mapped to users), personal access tokens, OAuth apps, GitHub Apps (per-repo scoped). Two-factor authentication.
- Authorization: Role-based: owner, admin, write, triage, read. Per-repo, per-org, per-team. Checked on EVERY Git operation and API call.
- Branch protection: Server-side enforcement. Cannot be bypassed by client. Checks: required reviews, required status checks, prevent force-push, require signed commits.
- Secret scanning: Every push is scanned for accidentally committed secrets (API keys, tokens). If found, the secret provider is notified for revocation.
- Audit log: Enterprise feature β every action logged: who pushed what, who changed permissions, who accessed what repo.
| What | Cache Layer | Invalidation |
|---|---|---|
| Repo metadata (stars, description) | Memcached (TTL 60s) | Cache-aside, TTL-based |
| File tree rendering | Memcached (keyed by tree SHA) | Never β SHAs are immutable |
| README render (markdownβHTML) | Memcached (keyed by blob SHA) | Never β SHAs are immutable |
| Diff computation | Dedicated diff cache (keyed by base_sha+head_sha) | Never β SHAs are immutable |
| Spokes routing table | In-memory (Git Backend) | Event-driven on repo move |
| User permissions | Redis (TTL 30s) | Event-driven on permission change |
| Extension | Architecture Impact |
|---|---|
| Copilot / AI Code Review | LLM inference pipeline triggered on PR events. Needs access to diff + full codebase context. Adds latency-sensitive AI inference to the PR workflow. Model serving infrastructure separate from Git storage. |
| Codespaces (Cloud IDE) | Persistent dev containers linked to repos. Requires compute orchestration (Kubernetes), persistent volumes, and low-latency Git access from the container to Spokes. Different scaling model β long-running stateful workloads vs. short request-response. |
| Multi-region Spokes | Geo-replicate repos across data centers for lower latency. Writes still go to primary DC (consistency). Reads served from nearest geo-replica. Cross-DC replication adds latency to the three-phase commit β must optimize for WAN. |
| Git LFS at Scale | Large files (videos, datasets) stored in object storage (S3), not in Git objects. LFS pointer files in the repo reference blobs in S3. Separate storage path, separate CDN delivery, separate billing. |
| Merge Queue Improvements | For monorepos with 100+ PRs/day: speculative merging (test multiple PRs combined), bisection on failure (find which PR broke the build), priority ordering. |
How does Spokes replication work and why not just use database replication?
Git is a filesystem-level data structure, not a database. A git repository is a directed acyclic graph of objects (blobs, trees, commits) stored as files on disk. Database replication doesn't understand this structure. Spokes replication works at the git level: when a user pushes to a repository, the primary Spokes server receives the packfile, writes it to local disk, then replicates the packfile to 2 other servers. The replication is synchronous β the push isn't acknowledged until all 3 copies confirm. Each replica is a full, independent git repository that can serve reads directly. The key advantage over filesystem-level replication (like DRBD): Spokes understands git semantics, so it can optimize β for example, if two pushes happen in quick succession, it can send just the delta rather than re-syncing the entire repository. It also handles split-brain correctly by using the primary as the source of truth for write ordering.
What happens during a "git push" to a repository with thousands of CI workflows?
The push itself is synchronous and fast: git objects are written to the primary Spokes server and replicated. But the push ALSO triggers a cascade of asynchronous events via Kafka: (1) a `push` webhook event is published, (2) the Actions scheduler evaluates all workflow files in `.github/workflows/` against the push event (branch, path filters), (3) matching workflows are queued as jobs. With thousands of workflows, the scheduler might queue hundreds of CI jobs simultaneously. Each job needs an ephemeral VM β this is where the real scaling happens. The Actions infrastructure maintains pools of pre-warmed VMs by runner type (ubuntu-latest, windows-latest). Jobs are assigned to VMs via a scheduler that considers: queue priority (paid plans get priority), runner availability, and geographic locality. If the pool is exhausted, jobs wait in queue. The push author doesn't wait for any of this β they get their push acknowledgment in seconds. CI status updates flow back via the Checks API.
How would you design the code search feature to handle regex across 200M repositories?
This is actually one of GitHub's hardest problems. Naive approach: Elasticsearch with full-text indexing β works for keyword search but can't handle regex. GitHub's solution (Blackbird) uses a trigram index: every 3-character sequence in every file is indexed. A regex like `func.*Handler` is decomposed into trigrams: "fun", "unc", "Han", "and", "ndl", "dle", "ler" β the index finds files containing ALL these trigrams (fast intersection), then the actual regex is applied to only those candidate files (slow but on a small set). The index is sharded by repository and built incrementally: on each push, only changed files are re-indexed. For 200M repos, the index is ~100TB. Search queries are scatter-gathered across shards with a timeout β if some shards are slow, we return partial results. The UX shows "Results from X repositories" and allows filtering by language, org, or repo. The key insight: the trigram index reduces the search space by 99.99%, making regex tractable at scale.