01Clarify the Problem & Scope5–7 min
"We're designing a cloud monitoring and observability platform. It ingests three pillars of observability data β€” metrics, logs, and traces β€” from customers' infrastructure. Users build dashboards, set alerts, and search logs to debug production issues."
Questions I'd Ask
  • What outcome are we optimizing for? β†’ Customer Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Datadog exists to make incidents shorter. This means: ingestion lag must be minimal (stale data = late alerts = longer MTTD), dashboards must load fast (slow dashboards = slower diagnosis = longer MTTR), and alerts must be reliable (false positive fatigue β†’ engineers ignore alerts β†’ longer MTTD). The architecture optimizes for FRESHNESS and QUERYABILITY of observability data.
  • Which pillars? β†’ Three pillars of observability: metrics (time-series numerical data), logs (unstructured text events), traces (distributed request tracking). Each has fundamentally different storage and query patterns.
  • Multi-tenant? β†’ Yes, thousands of customers sharing infrastructure. Noisy-neighbor isolation is critical β€” one customer's traffic spike can't degrade another's dashboards.
Agreed Scope
In ScopeOut of Scope
Metrics ingestion & time-series storageAPM / distributed tracing deep dive
Log aggregation & searchSynthetics / browser testing
Dashboards & visualizationSecurity monitoring (SIEM)
Alerting & notificationCI/CD integration
Agent-based collectionNetwork monitoring
Non-Functional Requirements
  • Ingestion latency <30s β€” metric data points should be queryable within 30 seconds of emission.
  • MASSIVELY write-heavy β€” millions of data points per second. Writes dominate reads 100:1.
  • Multi-tenant β€” thousands of customers sharing infrastructure. Strict tenant isolation. Noisy neighbor prevention.
  • Alerting must be reliable β€” missing an alert during an outage is the worst possible failure. Alert evaluation must be near-real-time (<60s latency).
  • Query flexibility β€” users query by arbitrary tag combinations: avg(cpu.usage){env:prod, service:api, region:us-east}. Tags are high-cardinality.
  • Retention tiering β€” full resolution for 15 days, downsampled for 15 months, aggregated for 5 years.
The defining tension: ingestion must handle millions of data points/sec (write-optimized, append-only), while queries need sub-second aggregation across arbitrary tag combinations (read-optimized, indexed). Time-series databases exist specifically to resolve this tension.
02Back-of-the-Envelope Estimation3–5 min
Metric Data Points / Sec
~10M
50K customers Γ— avg 200 hosts Γ— ~1000 metrics/host Γ· 10s interval
Unique Time Series
~1B
Each unique {metric_name + tag_set} = one series. High cardinality.
Metrics Storage / Day
~8 TB
10M pts/sec Γ— 86,400 sec Γ— 8 bytes/point (compressed)
Log Events / Sec
~5M
Avg 1KB per log line β†’ ~5 GB/sec β†’ ~430 TB/day raw
Dashboard Queries / Sec
~50K
Each dashboard has ~20 widgets, each widget = 1 query. ~2,500 dashboards viewed/sec.
Active Alerts
~5M
50K customers Γ— avg 100 alert rules. Each evaluated every 30-60s.
Key insight #1: 10M metric writes/sec is the primary throughput challenge. This requires a purpose-built time-series storage engine β€” no general-purpose DB can handle this.
Key insight #2: 5M alert evaluations every 30–60s = ~100K-170K queries/sec just for alerting. This is MORE traffic than user dashboards. The alerting engine is a bigger consumer of the query tier than humans are.
Key insight #3: Logs at 430 TB/day dwarf metrics in raw volume. Different storage strategy needed (append-only, columnar, compressed, full-text indexed).
03High-Level Design8–12 min
Key Architecture Decisions
"Here's WHY I chose each technology β€” mapping requirements to tradeoffs. Every choice has a rejected alternative and a consequence."
RequirementDecisionWhy (and what was rejected)Consistency
Millions of metrics/sec across thousands of tenantsCustom TSDB (not Prometheus/InfluxDB)Multi-tenant by design. Per-tenant quotas, dynamic tag cardinality, pre-aggregation at write. Single-tenant tools (Prometheus) can't do per-tenant isolation.β€”
Metrics, logs, and traces are different workloadsIndependent pipelines per data typeMetrics pipeline scales independently from logs pipeline. A log volume spike doesn't affect metrics latency. Shared pipeline would create noisy neighbor issues.β€”
Noisy neighbor isolationPer-tenant rate limits + quotas at every layerIngestion (Kafka partitioned by tenant_id), processing (per-tenant CPU quotas), query (per-tenant concurrency limits). One customer can't degrade service for others.β€”
15-month metric retentionPre-aggregation rollups: 1s β†’ 10s β†’ 60s β†’ 5mRaw 1s data for recent queries. Rolled up to 5m for old data. 300x storage reduction over 15 months.β€”
Logs: full-text search + structured queriesCustom columnar log store (not Elasticsearch)Columnar compression 10x better than ES for structured logs. Full-text index for search. ES struggles at Datadog's ingestion volume.β€”
Unified Ingestion Pipeline
Questions I'd Ask DD Agent on every host Intake API regional endpoints, LB Kafka partitioned by {tenant_id, data_type} Metrics Logs Traces Pipeline Pipeline Pipeline TSDB Log Store all three queryable via unified API dashboards + alerting consume query API Trace Store
High-Level Architecture CLIENTS EDGE / LOAD BALANCING MESSAGE QUEUE / ASYNC APPLICATION SERVICES DATA STORES archive πŸ–₯️ Customer Hosts servers Β· containers Β· k8s πŸ‘€ DevOps / SRE dashboards Β· alerts πŸ• DD Agent per-host collector 🌐 Intake API regional LBs πŸ“¨ Kafka partitioned by tenant Γ— type πŸ“Š Metrics Pipeline aggregate Β· store πŸ“ Logs Pipeline parse Β· index πŸ”— Traces Pipeline sample Β· analyze πŸ”” Alert Engine monitors Β· PagerDuty πŸ“ˆ Custom TSDB time-series metrics πŸ”Ž Log Store Elasticsearch / custom πŸ”— Trace Store ClickHouse πŸ“¦ S3 long-term archive

πŸ”Œ DD Agent EDGE

  • Runs on every customer host
  • Collects OS/container/app metrics
  • Tails log files, forwards
  • Batches + compresses for efficiency

πŸ“₯ Intake API INGESTION

  • Regional endpoints (US, EU, AP)
  • AuthN via API key
  • Rate limiting per tenant
  • Route: metrics β†’ Kafka topic, logs β†’ Kafka topic

πŸ“Š Metrics TSDB HOT PATH

  • Custom time-series engine
  • Inverted index on tags
  • Columnar, compressed storage
  • Rollup / downsampling

πŸ“ Log Store VOLUME

  • Full-text search + structured fields
  • 430 TB/day ingestion
  • ClickHouse or custom columnar

🚨 Alerting Engine CRITICAL

  • Evaluates 5M rules every 30-60s
  • Queries TSDB, evaluates threshold
  • Sends PagerDuty, Slack, email

πŸ“ˆ Dashboard Service READ

  • Users compose dashboards with widgets
  • Each widget = a TSDB query
  • Auto-refresh every 30s
04Deep Dives25–30 min
Deep Dive 1: Metrics Ingestion & TSDB (~12 min)
The core challenge: Ingest 10M data points/sec with arbitrary tag combinations, store for 15 days at full resolution, and query with sub-second aggregation across any tag filter.
── Metric Data Point ── { metric: "cpu.usage", value: 72.5, timestamp: 1707890400, tags: { host: "web-prod-42", env: "production", service: "api-gateway", region: "us-east-1", container_id: "abc123" // high cardinality! }, tenant_id: "customer_xyz" } ── Unique time series = metric_name Γ— unique tag_set ── "cpu.usage{host=web-42,env=prod}" is ONE time series ── A customer with 1000 hosts Γ— 500 metrics = 500K unique series

TSDB Architecture (inspired by Gorilla + Prometheus)

  • Write path: Kafka β†’ Metrics Consumer β†’ writes to in-memory buffer (WAL-backed). When buffer fills (every 2 hours), flush to compressed on-disk block. Each block: sorted by series ID, delta-of-delta timestamp encoding, XOR float compression. Achieves ~1.4 bytes/data point.
  • Tag index: Inverted index mapping tag-value β†’ set of series IDs. env:prod β†’ {series_1, series_5, series_99, ...}. Stored alongside TSDB blocks. Enables fast filter: "give me all series where env=prod AND service=api".
  • Query path: Parse query β†’ resolve tag filters via inverted index β†’ fetch matching series' data blocks β†’ decompress β†’ aggregate (avg, sum, max, p99) over requested time range β†’ return.
  • Sharding: By tenant_id + metric_name hash. Each shard handles a subset of time series. Replicated for HA.
Why custom TSDB over InfluxDB/TimescaleDB? At 10M pts/sec across 1B unique series, off-the-shelf TSDBs hit limits in ingestion throughput and tag cardinality. A custom engine (like Datadog's actual backend, or VictoriaMetrics/Thanos-style) gives control over compression, indexing, and multi-tenancy. Tradeoff: massive engineering investment. In an interview, acknowledge this and say "at smaller scale, I'd use InfluxDB or TimescaleDB."

Retention & Downsampling

  • 0–15 days: Full resolution (10s intervals). Stored in hot tier (SSD).
  • 15 days–15 months: Downsampled to 5-min intervals. Background job aggregates (avg, min, max, count, sum). Stored in warm tier (HDD/S3).
  • 15 months–5 years: 1-hour intervals. Cold tier (S3).
  • Downsampling reduces storage by ~30Γ— for the 15dβ†’15mo tier and another 12Γ— for the 15moβ†’5y tier.
Deep Dive 2: Log Aggregation (~8 min)
The core challenge: Ingest 5M log events/sec (430 TB/day), enable full-text search and structured field filtering, and return results in <5 seconds even over large time ranges.
  • Ingestion: Kafka β†’ Log Pipeline β†’ parse structured fields (JSON, key-value) β†’ extract log level, service, timestamp β†’ index β†’ store.
  • Processing: Log Pipeline applies customer-defined processing rules: grok parsing (extract fields from unstructured logs), remapping attributes, filtering (drop debug-level logs to save cost), enrichment (add geo data from IP).
  • Storage: ClickHouse (columnar, compressed). Each log line stored with: timestamp, message (full text), parsed attributes (service, level, host, etc.), tenant_id. Partitioned by day and tenant. Full-text indexed on the message field.
  • Query: Hybrid: structured field filters (ClickHouse columnar scan) + full-text search (inverted index on message). The combination is what makes log search powerful: service:api-gateway level:error "timeout exceeded".
Why ClickHouse for logs over Elasticsearch? At 430 TB/day, Elasticsearch's indexing overhead becomes prohibitively expensive (RAM for inverted indexes, JVM GC pressure). ClickHouse ingests at higher throughput with lower resource usage due to columnar compression. Tradeoff: ClickHouse's full-text search is less mature than ES's, so we build a lightweight inverted index (bloom filters per column partition) on top. For customers who need heavy full-text search, consider a hybrid with ES for a smaller retention window.
Deep Dive 3: Alerting Engine (~5 min)
The core challenge: Evaluate 5M alert rules every 30–60s. Each rule = a TSDB query + threshold condition. Missing an alert during a customer's outage is catastrophic for our business.
  • Architecture: Alert rules are distributed across a fleet of alert evaluator workers. Each worker owns a partition of rules (sharded by tenant_id). Every 30s, each worker evaluates its rules by querying the TSDB.
  • Evaluation: Rule: avg(cpu.usage){env:prod} > 90% for 5 min. Worker queries TSDB for the last 5 min of data, computes avg, compares to threshold. If breached β†’ fire alert. If recovered β†’ resolve alert.
  • State management: Each alert has state: OK, WARN, ALERT, NO_DATA. State transitions stored in PostgreSQL. Prevents duplicate notifications (only notify on state change).
  • Notification: On state change β†’ publish to notification queue β†’ fan-out to configured channels: PagerDuty, Slack, email, webhook. Retry on failure.
Why not stream processing (Flink) for alerting? Streaming would give lower latency (process each data point as it arrives). But 5M rules Γ— 10M data points/sec creates an explosion of evaluations. Batch evaluation every 30s is more efficient β€” one query covers all data points in the window. Tradeoff: worst-case 30s alerting latency, which is acceptable for most use cases. For ultra-low-latency alerting (<5s), offer a premium tier that uses stream processing for a subset of rules.
Deep Dive 4: Data Model & Storage Summary
DataStoreScaleAccess Pattern
Metrics (hot)Custom TSDB (SSD)10M pts/sec write, 1B series, 15d retentionWrite-heavy, aggregation queries by tags
Metrics (warm/cold)S3 + query engineDownsampled. Years of retention.Read-only historical queries
LogsClickHouse5M events/sec, 430 TB/dayFull-text search + structured filters
Alert Rules & StatePostgreSQL5M rules. Low write frequency.Read rules for evaluation, write state changes
Dashboard DefinitionsPostgreSQL~1M dashboards. User-configured.CRUD by user. Read on page load.
Ingestion BufferKafka10M+ events/sec combinedDecouple agents from storage. Absorb spikes.
Tenant MetadataPostgreSQL + cache50K tenants. API keys, rate limits, config.Read at auth time. Cached.
05Cross-Cutting Concerns10–12 min
Multi-Tenancy & Noisy Neighbor
  • Per-tenant rate limiting: At the Intake API, enforce ingestion limits per API key. Excess data β†’ queued, then dropped with warning.
  • Query isolation: All queries include tenant_id in the filter. TSDB sharding by tenant ensures one customer's heavy query doesn't starve others.
  • Fair scheduling: Alert evaluator and query engine use per-tenant quotas. Large tenants get proportionally more resources but can't monopolize.
Storage Architecture Summary
What goes where and why. Each data store is chosen for its access pattern β€” not by default. The question isn't "which database?" but "what are the read/write patterns, consistency requirements, and scale characteristics?"
DataStoreWhy This Store
Metrics (time-series) Custom TSDB Purpose-built for high cardinality. Tag-based indexing. Rollup: 1s β†’ 10s β†’ 60s β†’ 5m over retention.
Logs Custom log store Columnar storage with full-text index. Retention tiers: online (15 days) β†’ archive (S3, years).
Traces (APM) ClickHouse Distributed traces with spans. Sampled at ingestion (keep 100% of errors, sample OK traces).
Alert state PostgreSQL Monitor definitions, alert history, notification routing. Transactional for state machine.
Long-term archive S3 Metrics, logs, and traces archived for compliance. Rehydratable on demand.
Ingestion pipeline Kafka Partitioned by {tenant_id, data_type}. Decouples intake from processing. Backpressure handling.
Failure Scenarios
TSDB shard downReplica takes over reads. WAL replayed on recovery. Ingestion buffered in Kafka (hours of retention). Metrics delayed but not lost.
Kafka partition lagConsumers behind β†’ metrics delayed. Alerts may evaluate stale data β†’ better to alert on stale data than not alert at all. Dashboard shows "data may be delayed" warning.
Customer sends 100Γ— normal volumeRate limiter at Intake API caps ingestion. Excess dropped. Customer notified. Other tenants unaffected.
Alert evaluator fleet overloadedAlert evaluation falls behind schedule. PRIORITY: critical alerts evaluated first. Best-effort for info/warning severity. Auto-scale evaluator fleet.
Security & Access Control
Security & Access Control. Multi-tenant isolation is the primary security concern: thousands of customers share infrastructure, and a metric from Company A must never be visible to Company B. Isolation is enforced at every layer: (1) Kafka partitions include tenant_id β€” consumers only read their tenant's partitions, (2) TSDB queries inject tenant_id as a mandatory filter (enforced at the query engine level, not the application), (3) API keys are scoped to a specific org β€” an API key can only submit data to and query from its org, (4) customer data can be configured to stay in specific regions (EU-only for GDPR). The Datadog Agent running on customer hosts is open-source and auditable β€” customers can verify it only sends the metrics they configure. For enterprise customers: SAML SSO, audit logs of all query access, and RBAC on dashboards/monitors (an SRE can see infrastructure metrics but not application-level PII in logs). SOC 2 Type II and HIPAA BAA available. Data retention policies are configurable per customer.
Scalability
Scalability. Datadog ingests millions of data points per second across all tenants. The key scaling patterns: (1) Kafka-first architecture: all data hits Kafka before any processing, providing a durable buffer that absorbs traffic spikes. Partitioning by tenant_id ensures tenant data is processed in-order. (2) Horizontal pipeline scaling: metrics, logs, and traces have independent processing pipelines that scale independently. A customer sending 10x more logs doesn't affect metrics pipeline performance. (3) Multi-tenant storage: the custom TSDB is designed for high cardinality (millions of unique tag combinations per tenant). Metrics are pre-aggregated at ingestion: raw 1-second data points are rolled up to 10s, then 60s, then 5m as they age, reducing storage 300x over 15 months. (4) Query scaling: dashboards often have 50+ widgets, each a separate query. These are parallelized across the TSDB cluster. The noisy neighbor problem is addressed via per-tenant query quotas β€” a single customer running an expensive query can't slow down the cluster for others.
Monitoring & SLOs
Monitoring & SLOs. Key SLOs: metrics queryable within 30 seconds of emission (ingestion freshness), dashboard render time p95 <2s, alert evaluation lag <60 seconds. Per-tenant health monitoring: each customer has a health score based on ingestion rate, query error rate, and alert trigger reliability. Anomaly detection on ingestion volume β€” if a customer's metric volume drops 50% unexpectedly, a system health alert fires (possible agent misconfiguration). Internal dogfooding: Datadog monitors itself with Datadog β€” every internal service emits metrics and traces through the same pipeline. Incident management: automated runbooks for common failures (Kafka consumer lag, TSDB compaction storms, query timeout spikes). Error budget tracked per service.
06Wrap-Up & Evolution3–5 min
"This system is defined by WRITE-HEAVY ingestion across three distinct data types β€” metrics, logs, traces β€” each with different storage and query characteristics. The architecture unifies ingestion through Kafka, then splits into specialized storage engines: a custom TSDB for metrics (optimized for tag-based aggregation), ClickHouse for logs (optimized for columnar scan + text search), and the alerting engine as the most critical consumer β€” evaluating 5M rules every 30s against the TSDB. Multi-tenancy is enforced at every layer through rate limiting, query isolation, and fair scheduling."
ExtensionArchitecture Impact
APM / Distributed TracingThird storage backend optimized for trace spans. Span β†’ trace assembly. Service dependency map derived from trace data.
Anomaly DetectionML models trained per time series. Run alongside alerting engine. Detect deviations without user-configured thresholds.
Log-to-MetricsGenerate metric time series from log patterns (e.g., count ERROR logs per service). Reduces log storage costs while preserving signal.
Notebooks / InvestigationsCollaborative debugging tool. Combines metrics, logs, traces in a single timeline. Reads from all three storage backends.
Closing framing: Datadog is defined by the tension between write-optimized ingestion (append-only, high throughput, multi-tenant) and read-optimized querying (arbitrary tag filtering, aggregation, sub-second response). The TSDB resolves this with an inverted tag index (for reads) combined with columnar, compressed storage (for write throughput). The alerting engine β€” not humans β€” is the system's biggest query consumer.
07 Interview Q&APractice
"Here are the hardest questions an interviewer would ask about this design, and how to answer them. Each answer demonstrates deep understanding of the tradeoffs, not just surface knowledge."
Q1

Why build a custom TSDB instead of using Prometheus or InfluxDB?

A

Scale and multi-tenancy. Prometheus is designed for single-tenant, pull-based monitoring. It works beautifully for one organization's metrics, but it doesn't have: (1) multi-tenant isolation with per-tenant quotas, (2) push-based ingestion at millions of points/second across thousands of tenants, (3) dynamic tag-based queries across arbitrary time ranges (Prometheus PromQL is great but the storage engine struggles above ~10M active time series). InfluxDB has similar single-tenant limitations. Datadog's custom TSDB is optimized for: high cardinality (millions of unique tag combinations), multi-tenant storage with per-tenant compaction, pre-aggregation at write time (reducing query-time work), and configurable retention tiers. The tradeoff: building a custom TSDB is a massive engineering investment (team of 50+ engineers). It only makes sense at Datadog's scale (trillions of data points). For smaller deployments, Prometheus or InfluxDB are the right choice.

Q2

How do you handle the noisy neighbor problem in a multi-tenant system?

A

At every layer with isolation mechanisms. Ingestion: per-tenant rate limits on the Intake API β€” if a customer suddenly sends 10x their normal volume, excess data is buffered in Kafka (not dropped) but processed at a throttled rate. This prevents one tenant's spike from consuming all Kafka consumer capacity. Processing: per-tenant resource quotas in the pipeline workers β€” each tenant gets a fair share of CPU/memory. If tenant A's custom log parsing rules are expensive, it slows down tenant A's processing, not tenant B's. Query: per-tenant query concurrency limits and execution time limits. A dashboard with 100 widgets (each a query) from one customer doesn't monopolize the TSDB query workers. Alert evaluation: monitors are distributed across an evaluation cluster β€” each evaluator handles a fixed number of monitors, and monitors from different tenants are interleaved to prevent hotspots. The key insight: noisy neighbor is not one problem but many problems at different layers, each requiring its own isolation mechanism.