- What outcome are we optimizing for? โ Link click-through rate and analytics accuracy. A URL shortener's value isn't the redirect โ it's the DATA about who clicked, when, from where. Secondary: redirect latency (every millisecond matters โ the redirect is in the critical path of user intent). This shapes architecture: the redirect path must be the fastest thing in the system (<10ms), while analytics can be eventually consistent (batch-processed seconds later).
- Read:write ratio? โ Massively read-heavy. A link is created once but clicked thousands/millions of times. ~100:1 to 1000:1 ratio. Optimize entirely for the redirect hot path.
- Link lifetime? โ Most clicks happen within 48 hours of creation (social media spike). But links must work forever โ a broken short link destroys trust.
| In Scope | Out of Scope |
|---|---|
| Shorten a URL โ short code | User accounts / auth (assume API key) |
| Redirect short URL โ original | Link-in-bio pages |
| Custom aliases (vanity URLs) | QR code generation |
| Click analytics (count, geo, device) | A/B testing / link routing |
| Link expiration (optional TTL) | Branded domains |
- Redirect latency <10ms โ this is the single most critical path. Every extra ms affects millions of users.
- Read:write ratio ~100:1 โ links are created once, clicked millions of times. Massively read-heavy.
- High availability โ if the redirect service is down, every short link on the internet is broken.
- Short codes must be unique โ no collisions. But eventual consistency on analytics is fine.
- Links are effectively immutable โ once created, the mapping never changes (unless deleted). This is extremely cache-friendly.
๐ Shortening Service WRITE
- Generate unique short code
- Store mapping in DB + cache
๐ Redirect Service HOT PATH
- Resolve short code โ URL
- 301/302 redirect, <10ms
- Read from cache (Redis), DB fallback
๐ Analytics Service ASYNC
- Consume click events from Kafka
- Aggregate: clicks, geo, device, referrer
- Store in time-series / OLAP store
๐ ID Generator INFRA
- Pre-generate unique IDs
- Base62-encode to short code
- No collision, no coordination needed
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Redirect latency <10ms (the product IS the redirect) | Redis cache-first for URL lookups | 95% cache hit rate. Power law: 1% of URLs get 90% of clicks. Cache miss โ PostgreSQL fallback. Redis GET is <1ms. | AP |
| URL mappings must be durable | PostgreSQL for URL store (not just Redis) | Redis is volatile. URL mappings are permanent records โ a broken link damages customer trust. PostgreSQL provides durability + ACID. | CP |
| 100:1 read-to-write ratio | Separate read path (Redis) from write path (PostgreSQL) | Writes go to PostgreSQL, then populate Redis. Reads never touch PostgreSQL unless cache miss. Each path scales independently. | โ |
| Click analytics at billions of events | Kafka โ ClickHouse (not PostgreSQL) | Append-only event stream. ClickHouse columnar scan for "clicks by country by hour" is 100x faster than PostgreSQL for aggregations. | AP |
| Short codes must be globally unique | Pre-allocated ID ranges per server (Snowflake-style) | No centralized counter bottleneck. Each server has its own range. Base62 encoding produces short strings. Coordination only on range allocation. | โ |
Options Compared
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Hash (MD5/SHA) | Hash the long URL, take first 7 chars | Deterministic, same URL โ same code | Collisions guaranteed at scale. Need collision resolution loop โ DB contention. |
| Random | Generate random 7-char base62 string | Simple | Must check for collision in DB. At 180B URLs, collision probability becomes non-trivial. |
| Counter (auto-increment) | Central counter, base62-encode | Zero collisions, sequential | Single point of failure. Sequential codes are predictable (security concern). |
| Range-based (Snowflake-like) | Each server gets a pre-allocated range of IDs. Base62-encode. | Zero collisions, no coordination at write time, horizontally scalable | Slightly more complex setup. Unused IDs if server dies mid-range. |
Custom Aliases (Vanity URLs)
- User provides desired code (e.g.,
bit.ly/my-brand). Check uniqueness in DB. If taken โ error. If free โ insert. - Custom aliases use a separate namespace โ they don't consume from the ID range. Stored in the same table with a
is_customflag. - Validation: alphanumeric + hyphens, 3โ30 chars, no reserved words.
Caching Strategy
- L1: Application-local cache โ in-process LRU cache (1GB per server). ~1ms. Holds the hottest 10M entries per node. Hit rate: ~70%.
- L2: Redis Cluster โ distributed cache. ~3ms. Holds all recently-accessed entries (~50GB). Hit rate: ~99%+ after L1 miss.
- L3: Database โ DynamoDB or sharded Postgres. ~10-20ms. Cold fallback. Should handle <1% of traffic.
301 vs 302 Redirect
- 301 (Permanent): Browser caches the redirect. Subsequent clicks never hit our servers. Great for performance, bad for analytics โ we can't count clicks.
- 302 (Temporary): Browser hits us every time. We can count every click. Slightly worse performance for the end user.
- Decision: Default to 302 to enable analytics. Offer 301 as an option for users who don't need analytics and want maximum performance.
- Capture: On every redirect, publish a lightweight event to Kafka:
{short_code, timestamp, ip, user_agent, referrer}. This is fire-and-forget โ doesn't block the 301/302 response. - Stream processing: Kafka โ Flink/Spark Streaming โ aggregate by link_id, time window (1min, 1hr, 1day), geo (IP โ country), device (user-agent parsing).
- Storage: Pre-aggregated counters in Redis (real-time) + raw events to ClickHouse/Druid (historical analytics).
- Query API:
GET /analytics/{code}?range=7dโ reads from pre-aggregated data. Response in <200ms.
| Data | Store | Access Pattern | Consistency |
|---|---|---|---|
| URL Mappings | DynamoDB | Key-value by short_code. 500K reads/sec (from cache). | Strong on write, eventual read (cache) |
| Hot URL Cache | Redis Cluster + local LRU | Two-tier: local (~1ms) + Redis (~3ms). 99%+ hit rate. | Eventual (immutable data = no staleness) |
| Real-Time Stats | Redis (counters) | Increment on click. Read for dashboard. | Eventual (seconds) |
| Historical Analytics | ClickHouse | Columnar scan. Aggregate queries over time ranges. | Eventual (Kafka lag, seconds) |
| Click Events | Kafka | 115K events/sec avg. Partitioned by short_code. | Ordered per partition |
| ID Ranges | PostgreSQL (small) | Allocated per server. Accessed every ~1M requests. | Strong (transactional) |
| Data | Store | Why This Store |
|---|---|---|
| URL mappings | PostgreSQL | short_code โ long_url. Sharded by short_code hash. The core lookup table โ must be fast and durable. |
| Hot URL cache | Redis | 90% of redirects hit <1% of URLs. LRU cache with TTL. Cache miss โ read-through to PostgreSQL. |
| Click events | Kafka โ ClickHouse | Every redirect generates a click event. Kafka for durability. ClickHouse for analytics queries. |
| User accounts & API keys | PostgreSQL | Relational data. API key โ user mapping. Rate limit counters per key. |
| Link metadata | PostgreSQL | Custom aliases, tags, expiration dates, QR codes. Lower traffic than redirect path. |
- Link preview: Add
+suffix to any short URL to see destination without redirecting (e.g.,bit.ly/abc1234+). - Malware/phishing scanning: Check destination URL against Safe Browsing API at creation time. Periodic re-scan for existing links.
- Rate limiting: 100 URLs/min per API key for creation. No rate limit on redirects (public).
- Enumeration protection: Short codes are base62-encoded integers, but ranges are non-sequential per server. Not easily enumerable. Custom aliases are user-chosen โ not predictable.
| Extension | Architecture Impact |
|---|---|
| Link-in-bio pages | New entity type (page โ list of links). Different read pattern (render page vs. redirect). |
| A/B testing (split URLs) | Redirect service needs routing rules per link. Short code maps to multiple destinations with traffic weights. |
| Geo-targeted redirects | IP โ country lookup at redirect time. Route to different URLs per geo. Slight latency increase. |
| Deep link support (mobile) | User-agent detection at redirect. Route to app store or app deep link vs. web URL. |
How do you generate globally unique short codes without a centralized counter?
We use a combination of techniques depending on the link type. For auto-generated codes: each API server has a pre-allocated range of IDs from a central sequence (similar to Twitter Snowflake). Server A gets IDs 1-1M, Server B gets 1M-2M. Within its range, each server increments locally โ no coordination needed. The ID is Base62-encoded to produce a short string. For custom aliases (vanity URLs like bit.ly/my-brand), we do a uniqueness check against PostgreSQL. This requires a distributed lock or a unique constraint on the short_code column โ we use the DB constraint because custom aliases are rare (maybe 1% of link creations). The pre-allocated range approach means we can create links at high throughput without a centralized bottleneck, and the ranges are large enough that a server running out and needing a new range is infrequent.
What's your cache invalidation strategy when a link is edited or deleted?
We use a write-through cache pattern: when a link is updated in PostgreSQL (URL changed, link disabled, expiration set), the same API call also updates or invalidates the Redis cache entry. For link deletion/disabling, we set a "tombstone" in Redis: the key still exists but maps to a 404/410 response. This prevents the deleted link from falling through to PostgreSQL (which would also return nothing, but with higher latency). The tombstone has a 24-hour TTL โ after that, requests for the deleted code hit PostgreSQL (which returns nothing) and are not cached (negative caching with short TTL to handle re-creation). For global cache consistency across PoPs: updates are published to a Kafka topic, and each PoP's cache subscriber invalidates locally. This means there's a brief window (typically <2 seconds) where a link edit in one region is stale in another. For the redirect use case, this is acceptable โ a 2-second window of serving the old URL is harmless.