- What outcome are we optimizing for? โ For CDN: cache hit ratio (% of requests served from edge without hitting origin). For security: attack mitigation rate (% of malicious requests blocked). For both: latency reduction vs. direct-to-origin. This tells us the system's value is KEEPING TRAFFIC AT THE EDGE โ every request that reaches the origin is a partial failure. Architecture must minimize origin touches.
- Core product? CDN only, or also DNS, security, and edge compute? โ Full stack: DNS + CDN + WAF + DDoS + edge compute (Workers). All run on the same edge infrastructure.
- Deployment model? โ Every service runs on every server in every data center. NOT specialized clusters per function.
- How do customers onboard? โ Change DNS nameservers to Cloudflare. We become the authoritative DNS AND the reverse proxy. All traffic flows through us.
- Scale? โ 330+ data centers globally, ~50M HTTP requests/sec, ~25M DNS queries/sec, protecting ~25M customer domains.
- DDoS scale? โ Must absorb multi-Tbps attacks without impacting other customers.
| In Scope | Out of Scope |
|---|---|
| DNS resolution (authoritative) | Domain registrar |
| Anycast request routing | Email routing internals |
| TLS termination at edge | Certificate authority operations |
| CDN caching (+ tiered cache) | Object storage (R2) internals |
| WAF / DDoS protection | Zero Trust / SASE product |
| Edge compute (Workers) | Billing / customer dashboard |
| Origin connection (Tunnels) | Stream video product |
- UC1: User in Tokyo requests
example.com/image.pngโ DNS resolves to nearest DC via anycast โ edge cache HIT โ served in <50ms. Origin never touched. - UC2: Cache MISS โ edge forwards to origin (or tiered cache parent) โ response cached at edge โ served to user. Subsequent requests from that region served from cache.
- UC3: DDoS attack sends 5 Tbps of traffic at a customer โ anycast distributes across 330 DCs โ each DC absorbs its share โ attack mitigated at the edge, origin unaffected.
- UC4: Customer deploys a Cloudflare Worker โ user request executes custom JavaScript at the edge DC closest to the user โ sub-millisecond cold start, no origin needed.
- Latency: <50ms for cached content (edge โ user). Within 50ms of ~95% of the world's Internet-connected population.
- Availability: If one DC goes down, anycast automatically routes traffic to the next nearest DC. Zero customer-visible downtime.
- Multi-tenancy: Millions of customers share the same infrastructure. One customer's DDoS attack must not impact another customer's performance.
- Homogeneous edge: Every service runs on every server. No specialized "cache servers" vs "WAF servers." This is a fundamental architectural principle.
- Configuration propagation: When a customer changes a WAF rule, it must propagate to all 330 DCs within seconds.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Survive any single DC failure | BGP Anycast across 330+ DCs | Dead DC stops announcing BGP route โ traffic auto-routes to next nearest DC. No DNS change, no failover script. | AP |
| Edge compute with multi-tenant isolation | V8 isolates (not containers) | 5ms cold start, <1MB memory. Containers: 500ms start, 50MB. At 10K tenants/server, containers are 100x more expensive. | โ |
| DDoS mitigation at Tbps scale | Distributed across all DCs (not centralized scrubbing) | 1Tbps attack / 330 DCs = 3Gbps per DC โ manageable. Centralized scrubbing creates a bottleneck and single point of failure. | AP |
| Config changes to 330 DCs in <5 seconds | Custom KV distribution (not database replication) | Purpose-built push system over persistent TCP. Database consensus across 330 nodes would take minutes, not seconds. | AP (eventual, <5s) |
| Cache must serve during origin failure | Serve stale on origin error | Better to show a 5-minute-old page than an error page. Cache-Control: stale-while-revalidate pattern. | AP |
| Every request inspected for attacks | WAF in hot path (compiled Lua/Rust rules) | Rule evaluation must add <1ms latency. Interpreted regex would add 10-50ms. Compiled rules are pre-optimized at deploy time. | โ |
๐ BGP Anycast Router NETWORK
- Announces same IP from all 330 DCs
- BGP routing sends traffic to nearest DC
- Layer 3/4 DDoS mitigation (hardware)
๐ก DNS Resolver EDGE
- Authoritative DNS for customer domains
- Returns anycast IP for customer's site
- DNSSEC signing, GeoDNS
๐ TLS Termination EDGE
- Terminate TLS at edge (not origin)
- Session resumption, 0-RTT
- Automatic certificate management (Let's Encrypt)
๐ก๏ธ WAF / Security Pipeline EDGE
- Managed rulesets + custom rules
- Rate limiting, bot management
- Layer 7 DDoS detection
๐ฆ Cache Engine EDGE
- SSD-backed content cache
- Cache rules per customer domain
- Tiered cache (edge โ regional โ origin)
โก Workers Runtime EDGE
- V8 isolates (not containers)
- Execute customer JS at the edge
- Sub-ms cold start, per-request isolation
๐ Origin Connector EDGE
- HTTP/2 connection pooling to origins
- Cloudflare Tunnel (encrypted tunnels)
- Argo Smart Routing (fastest path)
โ๏ธ Config Store (local) EDGE
- Local replica of all customer configs
- Updated via global config propagation
- In-memory + SSD. Hot-path lookups <1ms.
How Anycast Works
- Same IP, everywhere: Every Cloudflare DC announces the SAME IP prefix (e.g., 104.16.0.0/12) via BGP to all upstream ISPs. When a user sends a packet to 104.16.x.x, BGP routing naturally delivers it to the topologically nearest DC.
- No client configuration: The user's ISP's routing table picks the best path. No DNS tricks, no client-side logic. Works at the IP layer.
- Automatic failover: If a DC goes offline, it stops announcing its BGP routes. Within seconds (~30-90s for BGP convergence), all traffic is re-routed to the next nearest DC. No customer impact beyond brief increased latency.
- DDoS by architecture: A 5 Tbps attack aimed at one IP gets distributed across all DCs announcing that IP. Each DC absorbs ~15 Gbps (5Tbps / 330 DCs) โ well within capacity. No single DC is overwhelmed.
Onboarding: How a Customer Joins
Cache Architecture
| Layer | Location | Purpose | Cache Hit Rate |
|---|---|---|---|
| L1: Hot (RAM) | Each server | Most frequently accessed objects. Latency: <1ms. | ~40-60% |
| L2: Warm (SSD) | Each server | Larger working set. Latency: ~1-5ms. | +20-30% |
| L3: Tiered Cache | Regional hub DC | Shared cache for a region's edge DCs. Avoid hitting origin. | +10-15% |
| L4: Cache Reserve | R2 (object store) | Persistent cache. Survives eviction. Eliminates origin egress. | +5% |
| MISS | Origin server | Only reached on full cache miss through all tiers. | ~5-10% of requests |
Tiered Cache Topologies
- Standard (flat): Each DC acts as a direct reverse proxy. Cache miss โ origin. Simple but origin gets hammered (up to 330 concurrent requests for the same asset).
- Smart Tiered Cache: Cloudflare automatically selects regional "parent" DCs based on latency. Cache miss at edge โ check parent โ check upper tier โ origin. Origin hit only ONCE, then content fans out through the hierarchy.
- Custom topology: Enterprise customers can define their own tier structure (e.g., "always check Singapore before hitting my origin in US-East").
Cache Purge
- Single URL purge: Customer sends API call โ purge propagated to all 330 DCs via config distribution โ each DC evicts that URL from local cache. Takes <5 seconds globally.
- Purge by tag/prefix: Customer tags cached objects (e.g.,
Cache-Tag: product-123). Purge by tag evicts all objects with that tag across all DCs. - Purge everything: Nuclear option. Evicts ALL cached content for a domain. Origin hit rate spikes to 100% temporarily โ must handle the thundering herd.
- Thundering herd mitigation (coalescing): When a popular object is purged and 1,000 concurrent requests arrive, only ONE request goes to origin. The rest wait for the first response, which is then served to all waiting requests. Critical for preventing origin overload after purge.
Accept-Language for localized content), include cookies. Custom cache keys enable powerful caching strategies (e.g., cache different versions for mobile vs. desktop).Security Pipeline (per request)
DDoS Mitigation Architecture
| Layer | Attack Type | Mitigation |
|---|---|---|
| L3/4 (Network) | SYN floods, UDP floods, amplification | Hardware-level filtering at edge routers (XDP/eBPF on Linux kernel). Drops packets BEFORE they reach application layer. Anycast distributes volume. |
| L7 (Application) | HTTP floods, slowloris, credential stuffing | Behavioral analysis: requests/sec per IP, request patterns, JS challenges. Machine learning on traffic patterns across all customers (collective intelligence). |
| DNS | DNS amplification, query floods | Anycast DNS infrastructure. Rate limiting per source IP. Response rate limiting (RRL). |
V8 Isolate Architecture
- NOT containers: Containers have ~50-500ms cold start. Unacceptable for edge latency targets. Instead, use V8 isolates โ the same technology that creates separate contexts in Chrome tabs.
- Per-request isolation: Each Worker invocation runs in its own V8 isolate. Memory is not shared between customers. An isolate for Customer A cannot access Customer B's data.
- Cold start: <5ms. V8 isolates spin up 100x faster than containers. Many are pre-warmed (kept alive between requests to the same Worker).
- Resource limits: CPU time: 50ms per request (free tier), 30s (paid). Memory: 128MB per isolate. No filesystem access. Network: only outbound fetch().
- Execution model: Worker intercepts request BEFORE cache lookup. Can rewrite request, generate response, or pass through to cache/origin. This makes Workers a programmable proxy layer.
| Data | Store | Why This Store |
|---|---|---|
| Cached content | Local SSD (per DC) | LRU cache of HTTP responses. Key = URL + Vary headers. Tiered: hot (memory) โ warm (SSD). Eviction under memory pressure. |
| Configuration | KV Store (replicated) | Customer rules, DNS zones, WAF policies. Replicated to all 330+ DCs within seconds via Quicksilver-like system. |
| Workers KV | Distributed KV | Key-value store accessible from Workers. Eventually consistent globally. Used by customers for edge state. |
| DNS zones | In-memory per DC | Authoritative DNS records loaded at startup and updated via config push. Must be in memory for sub-ms response. |
| Analytics & logs | ClickHouse (central) | Request logs aggregated centrally. Not on hot path โ shipped async. Powers the customer analytics dashboard. |
| TLS certificates | Local disk per DC | Customer SSL certs cached locally. Fetched from central store on first request and cached. Auto-renewed. |
- System: A global KV distribution system (similar to Cloudflare's internal "Quicksilver") pushes config changes to all 330 DCs.
- Mechanism: Central API writes config delta โ publishes to distribution layer โ each DC subscribes and applies locally. NOT a database replicated globally โ it's a purpose-built eventually-consistent config store.
- Target: <5 seconds from API call to config live at every DC worldwide.
- Volume: 25M domains ร average 1 change/day = ~300 config changes/second globally. Small payloads (deltas, not full configs).
- Consistency: Eventually consistent. Brief window where some DCs have old rules, others have new. Acceptable for WAF rules (a few seconds of old rules is fine). NOT acceptable for certificate rotation (must be atomic โ separate mechanism).
- Shared-nothing per request: Every request is processed independently. No shared state between customers on the hot path. Customer A's DDoS attack doesn't consume resources for Customer B because DDoS is mitigated at L3/4 before reaching the application layer.
- Fair resource allocation: Per-customer rate limits on API calls. Per-domain CPU budget for Workers. Per-domain cache eviction based on hit rate (cold content evicted before hot content regardless of customer).
- Blast radius: A misconfigured WAF rule for Customer A only affects Customer A's traffic. Other customers' traffic flows through a completely independent rule evaluation (different config lookup).
- Per-customer analytics: Request count, bandwidth, cache hit ratio, threat events, Worker CPU time โ all available via dashboard and GraphQL API.
- Per-DC health: Request throughput, error rate, cache hit ratio, DDoS mitigation events, BGP route stability.
- Global: Config propagation latency, inter-DC connectivity, origin reachability per customer.
| Extension | Architecture Impact |
|---|---|
| Durable Objects (strongly consistent edge state) | Single-writer per object, colocated with the user. Enables real-time collaboration, game state, counters at the edge. Fundamentally different from KV (eventually consistent). |
| AI inference at the edge | Run ML models in Workers. Requires GPU resources at edge DCs. Scheduling challenge: which DCs have GPU capacity? |
| Zero Trust / SASE | Corporate traffic routed through Cloudflare (not just web traffic). Identity-based access policies. Extends from "protect websites" to "protect corporate networks." |
| R2 (S3-compatible object storage) | Leverages edge network for delivery but needs centralized (or regionally replicated) storage for persistence. Different consistency model from cache. |
How does Anycast handle a data center going offline?
Anycast is a BGP routing protocol where multiple DCs announce the same IP address. Routers on the internet choose the "nearest" DC based on BGP path length. If a DC goes offline, it stops announcing the BGP route โ within 30-90 seconds, all internet routers converge and traffic automatically flows to the next nearest DC. There's no DNS change, no failover script, no manual intervention. The elegance is that this works at the network layer, below any application logic. The 30-90 second convergence time is the main limitation โ during this window, some users might experience timeouts as their packets route to the dead DC. To mitigate this, Cloudflare does health-checking within DCs: if a server is unhealthy, the DC's internal load balancer removes it before it affects BGP. The DC only withdraws its BGP route if the ENTIRE DC is unreachable (power failure, network cut). For individual server failures, the impact is absorbed internally.
How do you push a configuration change to 330 data centers in under 5 seconds?
This is the Quicksilver problem. Traditional approaches fail: database replication is too slow (consensus across 330 nodes), and CDN cache invalidation is eventually consistent with unknown latency. Cloudflare built a purpose-built KV distribution system. The architecture: (1) a customer makes a change via the API (central), (2) the change is written to a central store and assigned a monotonic sequence number, (3) every DC maintains a persistent connection to the central distribution layer, (4) changes are pushed to all DCs in parallel (not serially), (5) each DC applies changes in sequence-number order, ensuring consistency. The system is optimized for small, frequent writes (rule changes, DNS updates) rather than bulk data. It uses a custom binary protocol over persistent TCP connections. The 5-second target is p95 โ most changes propagate in <2 seconds. DCs that are temporarily unreachable queue changes and catch up on reconnection. This is eventually consistent by design โ two DCs might briefly disagree on a config, which is acceptable because the window is <5 seconds.
Why V8 isolates instead of containers for Workers?
Startup time and density. A container (even a lightweight one) takes 50-500ms to cold start and uses 10-50MB of memory. A V8 isolate starts in <5ms and uses <1MB. On an edge server handling 10,000 requests/second across 100 different customer Workers, containers would need 100 containers consuming 1-5GB of memory. V8 isolates for the same workload use <100MB. The tradeoff: V8 isolates only support JavaScript/WASM (no arbitrary binaries), and they have strict limits (128MB memory, 50ms CPU per request for free tier). But for the edge computing use case โ lightweight request transformation, A/B routing, header manipulation, simple API responses โ these limits are rarely hit. Security isolation comes from V8's sandbox: each isolate has its own heap and cannot access another isolate's memory. This is the same isolation model Chrome uses between tabs. It's not as strong as container/VM isolation (no kernel separation), but it's sufficient for the threat model (customer code manipulating HTTP requests, not running untrusted binaries).