- What outcome are we optimizing for? โ Watch time (total minutes viewed), not views. A 10-minute video watched to completion is worth more than a clickbait video abandoned after 5 seconds. Secondary: creator ecosystem health (upload rate, creator revenue). This shapes the recommendation engine (optimize for predicted watch time, not clicks) and the transcoding pipeline (quality must be good enough that users don't abandon due to buffering).
- Upload vs. streaming priority? โ Streaming is 1000:1 more frequent than uploads. But upload reliability matters for creators โ a failed upload of a 2-hour video is devastating. Design both, optimize for streaming.
- Live streaming in scope? โ Out of scope for now. Fundamentally different pipeline (real-time transcoding vs. batch).
| In Scope | Out of Scope |
|---|---|
| Video upload & transcoding pipeline | Live streaming |
| Video streaming (adaptive bitrate) | Shorts / vertical video |
| Home feed / recommendations | Monetization / ads |
| Likes, comments, subscriptions | Content moderation (mention only) |
| Search | Creator Studio / analytics dashboard |
- Video start time <2s โ users abandon if buffering takes too long. CDN + adaptive bitrate are essential.
- Upload processing within 10โ30 min โ creator sees "processing" then "published." Async pipeline.
- Massive storage โ petabytes of video. Cost optimization is a first-class concern.
- Global โ users everywhere. CDN with multi-region origin.
- Read-heavy โ ~1000:1 watch-to-upload ratio. System optimized for streaming reads.
๐ค Upload Service WRITE
- Resumable upload (tus protocol)
- Store raw video to S3
- Publish transcode job to queue
๐ฌ Transcoding Pipeline ASYNC HEAVY
- FFmpeg workers (auto-scaling pool)
- Encode to 5 resolutions (360pโ4K)
- Generate HLS/DASH manifests
๐บ Streaming Service HOT PATH
- Serve video manifest + segments
- Adaptive bitrate selection
- CDN-first, origin fallback
๐ฐ Feed / Rec Service READ
- Personalized home feed
- ML ranking model
- Precomputed candidate lists
๐ Search Service READ
- Elasticsearch for video metadata
- Title, description, tags, captions
๐ Video Metadata Service CORE
- Video info, channel, view counts
- Likes, comments
- Heavily cached
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Transcoding takes 15-30 min โ can't block upload | Kafka/SQS queue โ worker pool | Acknowledge upload in seconds (file in S3). Transcode async with auto-scaling workers. Synchronous would timeout HTTP connections. | โ |
| Video storage at petabyte/month scale | S3 with lifecycle tiering | Hot (CDN) โ Warm (S3 Standard) โ Cold (Glacier). 80% of videos have <1K views โ tiered storage saves millions/month. | โ |
| Analytics: "views per country per hour" over billions | ClickHouse (not PostgreSQL) | Columnar store scans 1B+ rows/sec reading only needed columns. PostgreSQL reads full rows โ 10-100x slower for aggregations. | AP |
| Video start time <2s globally | CDN + Adaptive Bitrate Streaming (HLS) | CDN edge caching + client-side ABR means first segment loads from nearest PoP at optimal quality. Origin rarely hit. | โ |
| Metadata needs relational integrity | PostgreSQL for video metadata | Videos belong to channels, have tags, comments, playlists โ relational. Search index (ES) is secondary, derived from Postgres. | CP |
| Upload spikes after events (Super Bowl) | Auto-scaling worker pool, not fixed capacity | Workers scale 100โ1000+ based on queue depth. Fixed capacity wastes money 99% of the time or fails during spikes. | โ |
- Resumable upload: Large videos (multi-GB) need resumable uploads. Use the tus protocol or Google's resumable upload API. Client can resume from last byte on network failure. Upload directly to S3 via presigned multipart upload.
- Transcoding DAG: Not a single job โ it's a directed acyclic graph: (1) probe input format โ (2) transcode to 5 resolutions in parallel โ (3) generate HLS segments + manifests โ (4) generate thumbnails โ (5) extract/generate captions โ (6) content moderation scan โ (7) update metadata โ (8) push to CDN origin.
- Worker pool: Spot/preemptible instances for cost (60-80% savings). Jobs are idempotent and retryable โ if a spot instance is reclaimed, the job retries on another. Kubernetes with auto-scaling based on queue depth.
- Adaptive Bitrate Streaming (ABR): Client player downloads the HLS master manifest, which lists all available resolutions. Player starts with a low resolution for fast start, then switches up as bandwidth estimate stabilizes. On bandwidth drop โ seamlessly drops to lower quality. This is the standard HLS/DASH behavior.
- CDN strategy: Multi-CDN (CloudFront + Akamai + Fastly) for redundancy and geo optimization. Popular videos cached at edge (thousands of PoPs). Long-tail videos cached at regional PoPs. Cache key:
{video_id}/{resolution}/segment_{n}. - Prewarming: When a video is published, pre-push the first few segments of each resolution to CDN edges in the creator's region. For viral videos, aggressively pre-warm globally.
- Two-stage architecture: (1) Candidate generation โ narrow 1B videos down to ~1000 using lightweight models (collaborative filtering, content-based, subscription-based). (2) Ranking โ score the 1000 candidates with a heavier ML model (deep neural net) that predicts engagement probability. Return top 20.
- Candidate sources: Subscribed channels (recent uploads), watch history similarity (collaborative filtering), topic/category affinity, trending videos, and "explore" candidates for diversity.
- Ranking features: Video age, view velocity, user's watch history, user-channel affinity, video length preference, time of day, device type.
- Serving: Pre-generate candidate lists offline (batch, hourly). Ranking is online (per request) but fast (~50ms with model serving via TensorFlow Serving / TorchServe).
| Data | Store | Access Pattern | Consistency |
|---|---|---|---|
| Video Files | S3 (tiered) + CDN | Write once, read many. CDN-first. ~1 EB total. | Eventual (CDN) |
| Video Metadata | PostgreSQL + Redis cache | Read by video_id. Heavily cached for watch page. | Strong write, eventual read |
| View / Like Counts | Redis (counter) โ flush to DB | 5B increments/day. Approximate display. | Eventual (30s stale) |
| Comments | PostgreSQL (sharded by video_id) | Append-heavy. Read paginated by video_id. | Strong |
| Subscriptions | PostgreSQL + Redis cache | "Who does user X subscribe to?" for feed. Graph-like. | Strong |
| Search Index | Elasticsearch | Video title, description, tags, auto-captions. | Near-real-time |
| Rec Candidates | Feature Store (Redis/Bigtable) | Pre-generated per user. Read at feed load. | Stale (hourly batch) |
| Transcode Jobs | SQS / Kafka | Job queue for transcoding workers. | At-least-once |
| View Events | Kafka โ ClickHouse | Streaming analytics. 58K events/sec. | Eventual |
| Data | Store | Why This Store |
|---|---|---|
| Raw uploads | S3 | Original video files. Write-once, read by transcoding pipeline. Lifecycle policy โ Glacier after 90 days. |
| Transcoded segments | S3 + CDN | HLS/DASH segments at 5 resolutions. CDN-first serving. Origin fallback for cold videos (<10 views/day). |
| Video metadata | PostgreSQL | Title, description, upload date, channel, privacy settings. Relational joins for channel pages. |
| View counts & analytics | ClickHouse | Columnar store for time-series analytics. "Views per country per hour" queries across billions of rows. |
| Search index | Elasticsearch | Video metadata + auto-generated captions. Full-text search with relevance scoring. |
| Transcode job queue | Kafka / SQS | Durable queue for upload โ transcode pipeline. Dead-letter queue for failed transcodes. |
| Recommendation features | Redis | Precomputed candidate lists per user. Refreshed hourly by ML pipeline. ~1000 video_ids per user. |
- Storage tiering: ~90% of videos get <1% of total views. Move to Glacier after inactivity. Saves ~70% storage cost.
- Codec optimization: AV1 encoding reduces bandwidth ~30% vs. H.264 at same quality. Encode popular videos in AV1; long-tail stays H.264 (not worth the encode cost).
- CDN cost: Multi-CDN bidding โ route traffic to cheapest CDN per region. Peer-to-peer assist for live events.
| Extension | Architecture Impact |
|---|---|
| Live Streaming | Completely different pipeline: ingest (RTMP) โ real-time transcode โ low-latency HLS (LL-HLS) โ CDN. No pre-processing possible. |
| Shorts | Vertical video format. Different transcoding profiles. Feed is swipe-based (infinite scroll of short videos), preloading next 3-5 videos. |
| Content Moderation | ML pipeline on upload: nudity, violence, copyright (Content ID fingerprinting). Block before publish or flag for human review. |
| Offline Downloads | DRM-encrypted video stored on device. License server controls access. Separate download manifest from streaming manifest. |
Why not transcode on upload instead of queueing?
Because transcoding is CPU-intensive and variable-duration. A 10-minute 4K video takes ~15-30 minutes to transcode into all 5 resolutions. If we did this synchronously, the upload API would need to hold the connection open for 30+ minutes โ HTTP timeouts would kill it. More importantly, upload traffic is spiky: after a major event (Super Bowl, product launch), thousands of creators upload simultaneously. A synchronous model would require provisioning for peak transcoding capacity at all times. The queue-based model lets us: (1) acknowledge the upload in seconds (file is in S3), (2) transcode at our own pace with backpressure (workers pull from queue), (3) auto-scale workers based on queue depth, (4) retry failed transcodes without re-uploading. The creator sees "Processing..." for 10-30 minutes, which is an accepted UX pattern. We do prioritize: verified creators and scheduled premieres get higher queue priority.
How would you handle a video that suddenly gets 100M views in an hour?
The CDN handles this naturally โ that's its entire purpose. The video segments are cached at edge PoPs worldwide. After the first few thousand views, the segments are cached at every major PoP, and 99.9% of requests never hit origin. The real challenges are: (1) view counting โ 100M Kafka events/hour needs partitioning by video_id so a single consumer isn't overwhelmed, (2) comment section โ this becomes a write hotspot, so we use the same approach as Facebook likes: Redis INCR for the counter, Kafka for the event stream, batch flush to PostgreSQL, (3) recommendation cascade โ the video appears in many users' "Trending" feeds, which increases load on the recommendation serving layer. But since recs are pre-computed in Redis, this is just more cache reads. The one thing we'd actively do is ensure the video's origin copy is replicated to additional S3 regions in case the primary region's CDN connectivity degrades.
Why ClickHouse for analytics instead of PostgreSQL?
YouTube analytics queries are almost always "aggregate X over time window, grouped by dimension." Example: "total watch time per country per day for the last 90 days." This is a columnar workload: you're scanning billions of rows but only reading 3-4 columns (watch_time, country, date, video_id). ClickHouse is purpose-built for this: columnar storage means it only reads the columns needed, compression ratios are 10-20x better than row-based stores, and it can scan 1B+ rows/second on a modest cluster. PostgreSQL would need to read entire rows, can't compress as effectively, and aggregate queries would take minutes instead of seconds. The tradeoff: ClickHouse has no transactions, limited UPDATE support, and no foreign keys. But analytics data is append-only (events are immutable), so we don't need any of those features.
How do you decide which resolution to serve during adaptive bitrate streaming?
The client decides, not the server โ that's the key insight. We use HLS (HTTP Live Streaming) where the manifest file lists all available resolutions with their bandwidth requirements. The video player measures its actual download speed by timing how long each 2-second segment takes to fetch. If the last segment downloaded at 5Mbps, the player selects the highest resolution that requires โค5Mbps. If speed drops (user enters a tunnel), the player transparently switches to a lower resolution โ the next segment is fetched at 480p instead of 1080p. The server's job is just to have all segments pre-transcoded and available. The CDN serves whatever resolution the client requests. This client-side ABR is why we transcode into 5 resolutions upfront: 360p (0.5Mbps), 480p (1Mbps), 720p (2.5Mbps), 1080p (5Mbps), 4K (20Mbps). Each is independently cached at the CDN.
What's your content moderation pipeline?
It's a multi-stage funnel designed to catch harmful content before it reaches viewers while minimizing false positives. Stage 1 (upload time): lightweight ML classifiers check thumbnail and first 30 seconds for NSFW content, known CSAM hashes (PhotoDNA), and Copyright ID audio fingerprints. Takes <60 seconds. If flagged with high confidence (>95%), the video is blocked and queued for human review. Stage 2 (pre-publish): full-video analysis runs during transcoding โ scene classification, speech-to-text for hate speech detection, frame-by-frame nudity detection. If flagged, the video stays in "processing" state (creator doesn't know it's flagged). Stage 3 (post-publish): community reports trigger re-review. Viral videos (>10K views/hour) get automatic re-scan with higher-sensitivity models. Stage 4: human reviewers handle appeals and ambiguous cases. The key tension: aggressive moderation catches bad content but also blocks legitimate content (false positives). We tune for high precision on auto-blocks and high recall on "queue for human review."