- What outcome is the product optimizing for? โ "Meaningful social interactions" (MSI) โ not raw engagement time. After 2018, Facebook shifted from maximizing time-on-site to prioritizing interactions between friends over passive content consumption. This shapes feed ranking: a friend's post with 2 comments outranks a viral video with 10K views. The architecture must support re-ranking by interaction quality, not just click-through rate.
- Core features? Profiles, friend connections, posting, news feed? Or also groups, pages, marketplace, stories, events? โ Focus on the core social graph + news feed. Mention others as extensions.
- Feed type? Chronological or ranked/algorithmic? โ Ranked feed is more interesting to design. Chrono as fallback.
- Content types? Text, photos, videos, links? โ Text + photos in scope. Video as extension (different pipeline).
- Notifications? In scope? โ Acknowledge but keep out of deep dive. Similar pattern to feed fanout.
- Scale? โ ~2 billion registered users, ~500M DAU. Average user has ~300 friends.
- Geography? โ Global. Multi-region is in scope for discussion.
| In Scope | Out of Scope |
|---|---|
| User profiles & auth | Messenger / real-time chat |
| Social graph (friends) | Groups & pages |
| Create posts (text + photos) | Video upload & streaming |
| News feed (ranked) | Stories, Reels |
| Likes & comments | Marketplace, Events, Ads |
| Media storage & serving | Search (beyond friend lookup) |
- UC1: User opens app โ sees personalized, ranked news feed of posts from friends
- UC2: User creates a post (text + optional photos) โ post appears in friends' feeds
- UC3: User likes or comments on a post โ visible to post author and other viewers
- UC4: User sends/accepts friend request โ bidirectional connection established
- Feed latency โ <500ms to render first screen of feed. Stale content (seconds old) is acceptable.
- Availability > consistency for feed โ it's better to show a slightly stale feed than to show nothing.
- Post creation must be durable โ once a user hits "Post," we must not lose it. Strongly consistent write.
- Social graph must be strongly consistent โ if I unfriend someone, they should not see my posts. Eventual is not acceptable here.
- Read-heavy โ feed reads vastly outnumber post writes. Most users consume, few produce.
- Global โ users everywhere. Latency-sensitive reads need to be served from nearby regions.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Feed reads at 150K QPS, <500ms | Hybrid fanout (write for normal, read for celebrities) | Fanout-on-write for 500 friends = 500 Redis writes. For 50M followers, write storm is unacceptable โ pull at read time. | AP |
| Posts must never be lost | PostgreSQL for posts (ACID) | User-generated content is irreplaceable. Redis cache is reconstructable; posts are not. | CP |
| Social graph at millions of QPS | PostgreSQL + Memcached (not Neo4j) | Adjacency list handles complex queries. Memcached absorbs 95% of reads. Neo4j doesn't scale horizontally. | AP |
| Media at petabyte scale | S3 + CDN (not database BLOBs) | Blobs don't belong in relational DBs. S3 = infinite scale, CDN = edge caching. 95% of media reads served from CDN. | โ |
| Async fanout, notifications, analytics | Kafka (not synchronous calls) | Post.created event consumed independently by fanout, notification, and analytics services. Decoupled failure domains. | โ |
| Feed staleness < 30s acceptable | Eventual consistency for feed reads | Showing a 30-second-stale feed is infinitely better than showing nothing during a partition. | AP |
๐ฑ Client Apps CLIENT
- Web, iOS, Android
- REST API for CRUD, polling/long-poll for feed refresh
๐ API Gateway + LB EDGE
- AuthN (JWT), rate limiting, routing
- SSL termination, geographic routing
๐ Post Service WRITE PATH
- Create / edit / delete posts
- Store in Posts DB, publish event to fanout
๐ฐ Feed Service HOT PATH
- Serve precomputed feed from cache
- Merge with celebrity posts at read time
- Apply ranking model
๐ข Fanout Service ASYNC
- On new post โ push post ID into friends' feed caches
- Hybrid: fanout-on-write for normal users, skip for celebrities
๐ฅ Social Graph Service CORE
- Friend requests, accept/reject, unfriend
- "Get friends of user X" โ used by fanout
๐ผ๏ธ Media Service MEDIA
- Upload photos โ S3/blob store
- Async resize/compress โ multiple sizes
- Serve via CDN
๐ Notification Service ASYNC
- Consumes events: likes, comments, friend accepts
- Push notifications, in-app badge counts
Fanout-on-Write (Push Model)
- When user posts โ immediately push post_id into every friend's feed cache
- โ Feed reads are instant โ just fetch from precomputed cache
- โ Read latency: ~10ms (cache hit)
- โ Write amplification: 1 post โ 300 cache writes (avg)
- โ Celebrity problem: 1 post from a user with 10M friends โ 10M cache writes
- โ Wasted work: pushing to inactive users who never read
Fanout-on-Read (Pull Model)
- When user loads feed โ query all friends' recent posts, merge, rank, return
- โ No write amplification โ posts stored once
- โ No wasted work โ only computed when user actually opens feed
- โ Slow reads: must query 300+ friends' posts, merge, rank at request time
- โ At 150K reads/sec, this creates enormous DB load
Feed Cache Architecture
Fanout Service Design
- Triggered by: Kafka event
post.created - Step 1: Check author's friend count. If >10K โ mark as celebrity, skip fanout, add author to
celebrity_usersset. - Step 2: Query Social Graph for author's friend list.
- Step 3: Filter to only active users (logged in within 30 days) โ don't waste writes on dormant accounts.
- Step 4: For each active friend, ZADD post_id to their
feed:{friend_id}sorted set in Redis. - Step 5: Batch the writes. Process in chunks of 1000 friends with pipelining.
- Throughput: With 300M posts/day and avg 200 active friends (after filtering), that's ~60B feed inserts/day or ~700K/sec. A Redis cluster with 20-30 shards handles this.
Feed Ranking
- Lightweight ranking model applied at read time after merging. Not a full ML inference per request (too slow at 150K/sec).
- Pre-computed features stored on each post: engagement_score (likes + comments + shares, decayed by time), author_affinity_score (how much the reader interacts with this author), content_type_weight (photos rank higher than text-only).
- Ranking formula:
score = engagement ร affinity ร recency_decay ร content_weight - Affinity scores are precomputed offline (batch job) and cached per user-pair. Updated daily.
?cursor=1707890400&limit=20. This is stable even as new posts are inserted (unlike offset which shifts).Post Write Path
- Step 1: Client uploads photo(s) directly to Media Service via presigned S3 URL โ returns
media_id. This decouples large file upload from the post creation API. - Step 2: Client sends
POST /postswith text +[media_ids]. Post Service writes to Posts DB (PostgreSQL). - Step 3: Post Service publishes
post.createdevent to Kafka. - Step 4: Async consumers: Fanout Service (feed), Media Processing (resize/compress), Notification Service.
Media Processing Pipeline
- Trigger: Kafka event
media.uploaded - Processing: Generate 4 sizes (thumbnail 150px, small 320px, medium 720px, large 1080px). Strip EXIF for privacy. Apply content moderation (async ML scan).
- Storage: All sizes to S3. CDN pulls from S3 origin.
- Serving: Client receives CDN URL pattern:
cdn.fb.com/{media_id}/{size}.jpg. Client picks size based on viewport.
| Data | Store | Access Pattern | Consistency |
|---|---|---|---|
| User Profiles | PostgreSQL + Redis cache | Read by user_id, low write freq | Strong |
| Social Graph | MySQL (sharded) + Redis cache | Get friends, check friendship. 700K reads/sec from fanout | Strong for unfriend, eventual for display |
| Posts | PostgreSQL (sharded by author_id) | Write ~3.5K/sec, read by post_id (cache-backed) | Strong writes, eventual reads from cache |
| Feed Cache | Redis Cluster (~25TB) | Sorted set per user. 700K writes/sec (fanout), 150K reads/sec | Eventual (best-effort) |
| Post Cache | Redis / Memcached | Hydrate feed: MGET 20 posts per feed load | Eventual (TTL: 5 min) |
| Like/Comment Counts | Redis (counter cache) โ flush to Postgres | Increment on action, read on feed render | Eventual (seconds stale) |
| Photos / Media | S3 + CDN | Write 150TB/day. Read from CDN edge. | Eventual (CDN TTL) |
| Events | Kafka | post.created, post.liked, friend.accepted, etc. | Ordered per partition |
| Affinity Scores | Precomputed (Spark) โ Redis | Read at feed ranking time. Batch updated daily. | Stale by design (daily batch) |
{posts: [{id, author, content, media_urls, like_count, comment_count, created_at}], next_cursor}{content, media_ids[], visibility}Response:
{post_id, created_at}{content}{target_user_id}{action: "accept" | "reject"}{content_type, file_size}Response:
{upload_url, media_id} โ client PUTs file directly to S3| Data | Store | Why This Store |
|---|---|---|
| Posts & comments | PostgreSQL (sharded) | Sharded by user_id. Post creation requires ACID. Write-once, read-many. Eventual consistency acceptable for reads. |
| Social graph | PostgreSQL + Memcached | Adjacency list in DB, heavily cached. "Friends of X" is the hottest query. Graph traversal uses TAO-like cache layer. |
| News feed cache | Redis | Precomputed feed per user (fanout-on-write for normal users). List of post_ids, ~500 entries. TTL-based invalidation. |
| Media (images/video) | S3 + CDN | Original โ S3. Resized variants generated async. CDN serves 95%+ of media requests. Origin fallback for cold content. |
| Session & auth tokens | Redis | Short-lived tokens with TTL. Distributed across regions. Invalidated on password change. |
| Event stream | Kafka | post.created, post.liked, comment.added. Consumed by fanout service, notifications, analytics, ad targeting. |
| At Scale | What Breaks | Mitigation |
|---|---|---|
| 10ร (5B DAU) | Feed cache size explodes (~250TB Redis). Fanout throughput needs 7M/sec writes. Social graph DB shards become hot. | Tiered caching: only cache feeds for active users (opened app in 7 days). Shard Redis by user_id hash. Move social graph to TAO-like custom cache layer. |
| 100ร current | Single Kafka cluster can't handle event volume. Ranking model inference becomes bottleneck. Cross-region latency for global users. | Regional Kafka clusters. Pre-score posts during fanout (embed ranking features on write). Full multi-region deployment with independent feed services per region. |
| Data | Model | Rationale |
|---|---|---|
| Post creation | Strong | User expects to see their own post immediately. Read-your-writes consistency. |
| Feed content | Eventual (seconds) | Slightly stale feed is invisible to users. Availability > freshness. |
| Friendship state | Strong for unfriend/block | Privacy: unfriended user must not see new posts. Eventual for display (friend list). |
| Like/comment counts | Eventual (seconds) | Approximate counts are fine. Denormalized counters flushed periodically. |
| Celebrity feed merge | Eventual (minutes) | Pull-at-read caches celebrity posts for 1-2 min TTL. |
- Golden signals per service: Feed Service p99 latency, Fanout lag (time from post.created to last feed cache write), Post write success rate.
- Business metrics: Feed engagement rate (likes/impressions), feed freshness (age of newest post at load time), fanout completion rate within 5s.
- Distributed tracing: Trace from post creation โ Kafka โ fanout โ feed cache write. Trace from feed load โ cache read โ hydration โ ranking โ response.
- Alerting: Fanout lag > 30s, feed p99 > 1s, Post DB replication lag > 2s, CDN hit ratio drops below 95%.
- Visibility enforcement: Feed Service checks post visibility (public/friends/only_me) against the viewer's relationship to the author. This check happens AFTER hydration, before returning to client.
- Unfriend/block is synchronous: On unfriend, immediately invalidate the ex-friend's feed cache entries from this user. On block, add to a block list checked at feed render time.
- Photo privacy: EXIF stripping on upload. Photo URLs are not guessable (UUID-based paths). Can add signed URLs with expiration for extra protection.
- Rate limiting: Post creation: 50/day per user. Likes: 500/hour. Friend requests: 100/day. Prevents spam and abuse.
| Extension | Why It Matters | Architecture Impact |
|---|---|---|
| Groups & Pages | Content beyond friend graph | New entity types in feed. Group membership is a separate graph. Fanout becomes topic-based (group subscribers), not just friend-based. |
| Video | Highest engagement content | Completely separate pipeline: transcoding (FFmpeg farm), adaptive bitrate (HLS/DASH), dedicated video CDN. Much more expensive than photos. |
| Stories | Ephemeral, high-engagement format | Separate from feed. TTL-based storage (24h auto-delete). Ring-based UI means different access pattern than infinite scroll feed. |
| Real-Time Notifications | Engagement & retention driver | WebSocket/SSE connection layer. Fan-out similar to feed but smaller (notify post author, not all friends). Priority queue for different notification types. |
| Full ML Ranking | Dramatically improves feed quality | Feature store (precomputed user/post features). Lightweight model served at read time. A/B testing infrastructure to measure engagement impact. |
| Content Moderation | Policy compliance, safety | Async ML pipeline on post.created events. Flag/remove before fanout completes. Separate review queue for borderline content. |
| Multi-Region | Global latency | Regional feed caches and CDN. Cross-region replication for social graph. Posts replicated async. User routed to nearest region. |
Why fanout-on-write for regular users but fanout-on-read for celebrities?
It's a cost tradeoff driven by follower count. When a regular user with 500 friends posts, fanout-on-write means we push that post_id into 500 feed caches โ 500 Redis writes, done in <100ms via pipeline. The reader path is trivial: read their precomputed feed from Redis. But when a celebrity with 50M followers posts, fanout-on-write would mean 50M Redis writes โ that could take minutes and create a massive write spike. So for celebrities, we don't write to anyone's feed. Instead, when a user loads their feed, the Feed Service merges their precomputed feed (from regular friends) with a real-time pull of recent posts from celebrities they follow. The threshold (say, >500K followers) is tunable. This hybrid approach means 99% of users get instant feed reads, and the 0.01% who are celebrities don't cause write storms.
How would you handle a post that goes viral โ 10M likes in an hour?
The like counter is the bottleneck, not the like event itself. If we increment a counter in PostgreSQL for every like, that single row becomes a write hotspot. The solution is multi-layered: (1) likes are published to Kafka immediately โ the event is durably captured, (2) a counter service aggregates likes in Redis using INCR (which is atomic and handles thousands of increments/sec), (3) periodically (every 30s), the Redis counter is flushed back to PostgreSQL as a batch update. The user sees a "fuzzy" count that's accurate within 30 seconds โ which is fine because Facebook already shows "10M" not "10,247,391." For the post author's notification, we don't send 10M individual notifications โ we batch: "Your post received 1M new likes" every hour. The Kafka stream also feeds the ranking algorithm to boost the post's visibility.
What's your strategy for cache invalidation on the social graph?
This is one of the hardest problems. The social graph cache is a TAO-like layer where "friends of user X" is cached in Memcached. Invalidation happens on writes: when user A unfriends user B, we invalidate both A's and B's friend lists. The tricky case is transitivity: if A's privacy settings say "friends of friends can see my posts," then unfriending B potentially changes visibility for all of B's friends โ but we do NOT eagerly invalidate all of those. Instead, we use a short TTL (5 minutes) on graph cache entries and accept that for up to 5 minutes, someone might see a post they technically shouldn't. This is a deliberate consistency tradeoff: real-time graph consistency at scale would require invalidating millions of cache entries on every friend/unfriend action, which happens thousands of times per second globally. The 5-minute staleness window is acceptable for the social use case.
How do you prevent a single user's data from being spread across too many shards?
We shard by user_id, which means ALL of a user's data (posts, friends, settings) lives on the same shard. This is critical because the most common query pattern is "load user X's profile and recent posts" โ a single-shard query. The risk is hot shards: a celebrity's shard handles more reads. We mitigate this with: (1) read replicas per shard โ the celebrity's shard might have 5 replicas while a quiet shard has 2, (2) the Memcached layer absorbs most reads so the DB only sees cache misses, and (3) we can split a hot shard by moving ranges of user_ids to a new shard (this is a background operation). What we explicitly avoid is cross-shard queries โ things like "find all posts mentioning keyword X" go through a separate search index (Elasticsearch), not a scatter query across all user shards.
If you had to choose between losing a post and showing a duplicate post, which and why?
Show the duplicate. The user can scroll past a duplicate, but a lost post (especially one with sentimental value โ a birth announcement, a memorial) is unrecoverable. This is why post creation uses write-ahead logging to Kafka before acknowledging to the user โ even if the database write fails, the event is in Kafka and can be replayed. The duplicate scenario happens when the client retries after a timeout: the post was actually written, but the acknowledgment was lost. We use client-generated idempotency keys (UUID per post attempt) to detect and suppress duplicates at the API layer. If the dedup check fails (Redis with the idempotency key is down), we accept the duplicate and rely on a background dedup job that scans for identical content from the same user within a 60-second window. Availability and durability over perfect consistency.