- Restate the problem in your own words — confirm your understanding before designing anything.
- Identify 2–3 core use cases — ask "what must this system absolutely do?" Prioritize ruthlessly.
- Users & scale — Who are the users? DAU/MAU? Growth trajectory?
- Access patterns — Read-heavy or write-heavy? Bursty or steady? Real-time or batch?
- Non-functional requirements — Latency (p50/p99)? Consistency vs. availability? Durability? Compliance?
- Define scope explicitly — "I'll focus on X and Y, treating Z as out of scope — sound good?"
- Write requirements on the Excalidraw canvas — this becomes your contract for the rest of the interview.
- Traffic — DAU × avg requests/user ÷ 86,400 = QPS. Separate reads & writes. Apply peak multiplier (2–5×).
- Storage — Avg object size × new records/day × retention. Distinguish hot vs. cold data.
- Bandwidth — Write QPS × payload size (in). Read QPS × response size (out).
- Memory (caching) — 80/20 rule: cache 20% of daily data. Cache size = 0.2 × daily reads × avg size.
- Highlight the numbers that drive decisions — "At 50K reads/sec, we need a cache. At 2TB/day, we need partitioning early."
- Start at the client, work inward — Client → CDN/LB → API Gateway → Services → Data Stores.
- Draw read path and write path separately — narrate each as you draw.
- Label each service with its responsibility (e.g., "User Service," "Feed Service").
- Show data stores with what they hold and rough tech choice. Don't over-commit yet.
- Show async pathways — queues, event buses, consumers.
- Show external dependencies — third-party APIs, CDNs, DNS.
"Does this high-level flow make sense before I go deeper?""I think the most interesting challenges are X and Y — shall I start there?"Data Model & Schema
- Core entities & relationships
- Primary access patterns
- Schema evolution over time
- Sketch key tables/collections
Storage Technology
- Why this tech over alternatives?
- Frame as tradeoff: gain X, lose Y
- Match to access pattern
- Name what you considered & rejected
Partitioning & Sharding
- Partition key choice & rationale
- Hot partition mitigation
- Re-partitioning strategy
- Interaction with access patterns
Caching Strategy
- Which layer(s) benefit?
- Policy: aside / through / behind
- Invalidation: TTL / event / version
- Thundering herd mitigation
Async Processing
- What's async and why?
- Delivery: at-least-once / exactly-once
- Failures, retries, dead-letter queues
- Consumer idempotency
API Design
- Key endpoints & methods
- Pagination (cursor vs. offset)
- Rate limiting & auth
- Versioning approach
Fault Tolerance
- Single node failure → what happens?
- AZ failure → what happens?
- Downstream timeout → what happens?
- Circuit breakers, bulkheads, retries w/ backoff
- Graceful degradation
Consistency Model
- Where eventual? Where strong? Why?
- Conflict resolution strategy
- Stale reads: cache vs. DB
Scalability Bottlenecks
- What breaks at 10×? At 100×?
- Walk each component: compute, storage, network
- Hardest thing to scale?
Security
- AuthN/AuthZ (OAuth, JWT, RBAC)
- Encryption at rest & in transit
- Input validation, rate limiting
- PII handling
Observability
- Golden signals: latency, traffic, errors, saturation
- Structured logging
- Distributed tracing
- Alerting without fatigue
Operational Readiness
- Deployment strategy (blue/green, canary)
- Runbooks & on-call
- Capacity planning
- Chaos engineering
- Summarize in 2–3 sentences — the architecture, key decisions, and why they fit the requirements.
- Acknowledge deferred work — "I intentionally kept X simple — in production I'd add Y."
- System evolution — What changes at 10× or 100×? What's in v2 vs. v1?
- Operational investments — Better observability, chaos engineering, automated scaling.
- Tech debt & migration paths — "The denormalized schema works now, but we'd need CDC for analytics later."
📦 Data Storage
| Option | Reach For It When… | Watch Out For… |
|---|---|---|
| Relational | ACID transactions, complex joins, strong consistency, stable schema | Horizontal write scaling is hard |
| Document | Flexible schema, denormalized reads, key-document access | No joins — pay with data duplication |
| Wide-Column | Massive write throughput, time-series, range scans | Query patterns must be known upfront |
| Key-Value | Ultra-low latency lookups by key. Sessions, caches, flags | Zero query flexibility beyond the key |
| Graph | Relationships ARE the data — social graphs, fraud detection | Niche. Don't use just for foreign keys |
| Search | Full-text search, faceted filtering, log analytics | Not a primary store — always pair with source of truth |
| Time-Series | Metrics, IoT, event logs. Append-heavy, time-range queries | Limited non-time access patterns |
| Object Store | Blobs: images, videos, files, backups. Cheap & durable | High latency. Not for transactional data |
🔗 Communication Patterns
| Option | Reach For It When… | Watch Out For… |
|---|---|---|
| REST | Request-response, CRUD, public APIs, broad compatibility | Over-fetching / under-fetching, chatty |
| GraphQL | Flexible queries, multiple frontends with different data needs | Server complexity, N+1, harder caching |
| gRPC | Service-to-service, latency/bandwidth matters, streaming, typed contracts | Not browser-friendly, binary = harder debug |
| Queue | Point-to-point async, task offloading, write buffering | Not great for fan-out, consumed once |
| Event Stream | Fan-out, multiple consumers, replay, high throughput, event-driven | Ops complexity, ordering per partition only |
| WebSocket | Real-time bidirectional — chat, collab editing, live updates | Stateful connections complicate scaling |
| SSE | Server→client push, one-directional. Live feeds, notifications | One-way only, less control than WS |
⚙️ Compute & Deployment
| Option | Reach For It When… | Watch Out For… |
|---|---|---|
| Monolith | Early stage, small team, simple domain. Don't prematurely distribute | Can't scale individual components |
| Microservices | Independent teams, independent deploy, different scaling profiles | Distributed systems complexity everywhere |
| Serverless | Bursty traffic, event-driven glue, low ops overhead | Cold starts, time limits, vendor lock-in |
| Containers + K8s | Many services needing orchestration, discovery, auto-scaling | Significant ops overhead. Overkill for simple systems |
what needs to scale and why, not deployment tooling.🗄️ Caching
| Layer | When | Invalidation |
|---|---|---|
| Client / Browser | Static assets, predictable-TTL API responses | TTL, cache-control headers |
| CDN | Static content, media, geo-distributed users | TTL, purge on deploy |
| App (Redis) | Hot reads, query results, sessions, aggregations | Cache-aside + TTL (default), event-driven |
| Local / In-proc | Config, feature flags, small lookups | Short TTL, risk of cross-instance inconsistency |
⚡ Event-Driven Patterns
| Pattern | When | Tradeoff |
|---|---|---|
| Notification | "Something happened" — lightweight trigger | Consumers may need callback for details |
| State Transfer | Include full state so consumers don't call back | Larger payloads, data duplication |
| Event Sourcing | Full audit trail, rebuild state by replaying events | Complexity, eventual consistency |
| CQRS | Read/write models have different shapes or scale needs | Two models to maintain, sync lag |
| Saga | Distributed transactions where 2PC isn't feasible | Complex failure handling, compensating txns |
🔒 Consistency & Coordination
| Pattern | When | Cost |
|---|---|---|
| Strong (single leader) | Financial txns, inventory — stale reads cause harm | Higher latency, write bottleneck |
| Eventual | Feeds, analytics, notifications — staleness OK | Simpler to scale, possible stale reads |
| Read Replicas | Scale reads without sharding, tolerate lag | Stale reads from replicas |
| Multi-Leader | Multi-region writes | Conflict resolution is hard |
| Consensus | Leader election, distributed locks, config mgmt | Quorum latency. Not for hot-path data |
🌐 Load Balancing & API Gateway
| Option | When | Note |
|---|---|---|
| L4 (TCP) | Raw throughput, simple round-robin, no content inspection | Fastest, least flexible |
| L7 (HTTP) | Path routing, header inspection, SSL termination, rate limiting | Most common choice |
| API Gateway | AuthN, rate limiting, request transform, multi-backend routing | Single entry point for clients |
| Service Mesh | mTLS, fine-grained traffic control, network-level observability | Only at significant microservice scale |
- 1. State the requirement — "We need low-latency reads at 100K QPS with a simple key-based access pattern."
- 2. Name the decision criteria — "I'm optimizing for read latency and horizontal scalability over query flexibility."
- 3. Make the choice — "A key-value store like DynamoDB fits here."
- 4. Name the tradeoff — "We give up ad-hoc querying, but our access patterns are well-defined."
- 5. Mention what you rejected — "I considered Postgres with read replicas, but at this QPS with key lookups, a purpose-built KV store is more efficient."
- Don't panic — Constraints aren't traps. Say: "Great constraint. Let me think about how that changes the design."
- Trace the impact — Walk through which components are affected and which aren't.
- Revise the diagram live — Cross out the old approach, draw the new one, narrate your reasoning.
- Acknowledge uncertainty — "I haven't worked with this exact pattern, but my instinct is X because of Y."
Think out loud. The interviewer evaluates your reasoning, not your final diagram. Silence is your enemy.
Drive the conversation. At principal level, you lead. Don't wait for "what about caching?" — bring it up.
Stay grounded in requirements. Every decision traces to Phase 1. Catch yourself gold-plating → refocus.
Name the tradeoffs. "We gain X, give up Y, acceptable because Z." This is the principal-level sentence pattern.
"It depends" is fine — but always follow with what it depends on.
Excalidraw is a living doc. Annotate decisions, write numbers, mark failure zones. It should tell the story alone.
Manage your time. If 10 min on one component with 2 more to go — say so and move on.
Treat interviewer as collaborator. Ask preferences. Respond to cues. It's a design conversation, not a presentation.