- What outcome are we optimizing for? β Customer Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Datadog exists to make incidents shorter. This means: ingestion lag must be minimal (stale data = late alerts = longer MTTD), dashboards must load fast (slow dashboards = slower diagnosis = longer MTTR), and alerts must be reliable (false positive fatigue β engineers ignore alerts β longer MTTD). The architecture optimizes for FRESHNESS and QUERYABILITY of observability data.
- Which pillars? β Three pillars of observability: metrics (time-series numerical data), logs (unstructured text events), traces (distributed request tracking). Each has fundamentally different storage and query patterns.
- Multi-tenant? β Yes, thousands of customers sharing infrastructure. Noisy-neighbor isolation is critical β one customer's traffic spike can't degrade another's dashboards.
| In Scope | Out of Scope |
|---|---|
| Metrics ingestion & time-series storage | APM / distributed tracing deep dive |
| Log aggregation & search | Synthetics / browser testing |
| Dashboards & visualization | Security monitoring (SIEM) |
| Alerting & notification | CI/CD integration |
| Agent-based collection | Network monitoring |
- Ingestion latency <30s β metric data points should be queryable within 30 seconds of emission.
- MASSIVELY write-heavy β millions of data points per second. Writes dominate reads 100:1.
- Multi-tenant β thousands of customers sharing infrastructure. Strict tenant isolation. Noisy neighbor prevention.
- Alerting must be reliable β missing an alert during an outage is the worst possible failure. Alert evaluation must be near-real-time (<60s latency).
- Query flexibility β users query by arbitrary tag combinations:
avg(cpu.usage){env:prod, service:api, region:us-east}. Tags are high-cardinality. - Retention tiering β full resolution for 15 days, downsampled for 15 months, aggregated for 5 years.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Millions of metrics/sec across thousands of tenants | Custom TSDB (not Prometheus/InfluxDB) | Multi-tenant by design. Per-tenant quotas, dynamic tag cardinality, pre-aggregation at write. Single-tenant tools (Prometheus) can't do per-tenant isolation. | β |
| Metrics, logs, and traces are different workloads | Independent pipelines per data type | Metrics pipeline scales independently from logs pipeline. A log volume spike doesn't affect metrics latency. Shared pipeline would create noisy neighbor issues. | β |
| Noisy neighbor isolation | Per-tenant rate limits + quotas at every layer | Ingestion (Kafka partitioned by tenant_id), processing (per-tenant CPU quotas), query (per-tenant concurrency limits). One customer can't degrade service for others. | β |
| 15-month metric retention | Pre-aggregation rollups: 1s β 10s β 60s β 5m | Raw 1s data for recent queries. Rolled up to 5m for old data. 300x storage reduction over 15 months. | β |
| Logs: full-text search + structured queries | Custom columnar log store (not Elasticsearch) | Columnar compression 10x better than ES for structured logs. Full-text index for search. ES struggles at Datadog's ingestion volume. | β |
π DD Agent EDGE
- Runs on every customer host
- Collects OS/container/app metrics
- Tails log files, forwards
- Batches + compresses for efficiency
π₯ Intake API INGESTION
- Regional endpoints (US, EU, AP)
- AuthN via API key
- Rate limiting per tenant
- Route: metrics β Kafka topic, logs β Kafka topic
π Metrics TSDB HOT PATH
- Custom time-series engine
- Inverted index on tags
- Columnar, compressed storage
- Rollup / downsampling
π Log Store VOLUME
- Full-text search + structured fields
- 430 TB/day ingestion
- ClickHouse or custom columnar
π¨ Alerting Engine CRITICAL
- Evaluates 5M rules every 30-60s
- Queries TSDB, evaluates threshold
- Sends PagerDuty, Slack, email
π Dashboard Service READ
- Users compose dashboards with widgets
- Each widget = a TSDB query
- Auto-refresh every 30s
TSDB Architecture (inspired by Gorilla + Prometheus)
- Write path: Kafka β Metrics Consumer β writes to in-memory buffer (WAL-backed). When buffer fills (every 2 hours), flush to compressed on-disk block. Each block: sorted by series ID, delta-of-delta timestamp encoding, XOR float compression. Achieves ~1.4 bytes/data point.
- Tag index: Inverted index mapping tag-value β set of series IDs.
env:prod β {series_1, series_5, series_99, ...}. Stored alongside TSDB blocks. Enables fast filter: "give me all series where env=prod AND service=api". - Query path: Parse query β resolve tag filters via inverted index β fetch matching series' data blocks β decompress β aggregate (avg, sum, max, p99) over requested time range β return.
- Sharding: By tenant_id + metric_name hash. Each shard handles a subset of time series. Replicated for HA.
Retention & Downsampling
- 0β15 days: Full resolution (10s intervals). Stored in hot tier (SSD).
- 15 daysβ15 months: Downsampled to 5-min intervals. Background job aggregates (avg, min, max, count, sum). Stored in warm tier (HDD/S3).
- 15 monthsβ5 years: 1-hour intervals. Cold tier (S3).
- Downsampling reduces storage by ~30Γ for the 15dβ15mo tier and another 12Γ for the 15moβ5y tier.
- Ingestion: Kafka β Log Pipeline β parse structured fields (JSON, key-value) β extract log level, service, timestamp β index β store.
- Processing: Log Pipeline applies customer-defined processing rules: grok parsing (extract fields from unstructured logs), remapping attributes, filtering (drop debug-level logs to save cost), enrichment (add geo data from IP).
- Storage: ClickHouse (columnar, compressed). Each log line stored with: timestamp, message (full text), parsed attributes (service, level, host, etc.), tenant_id. Partitioned by day and tenant. Full-text indexed on the message field.
- Query: Hybrid: structured field filters (ClickHouse columnar scan) + full-text search (inverted index on message). The combination is what makes log search powerful:
service:api-gateway level:error "timeout exceeded".
- Architecture: Alert rules are distributed across a fleet of alert evaluator workers. Each worker owns a partition of rules (sharded by tenant_id). Every 30s, each worker evaluates its rules by querying the TSDB.
- Evaluation: Rule:
avg(cpu.usage){env:prod} > 90% for 5 min. Worker queries TSDB for the last 5 min of data, computes avg, compares to threshold. If breached β fire alert. If recovered β resolve alert. - State management: Each alert has state: OK, WARN, ALERT, NO_DATA. State transitions stored in PostgreSQL. Prevents duplicate notifications (only notify on state change).
- Notification: On state change β publish to notification queue β fan-out to configured channels: PagerDuty, Slack, email, webhook. Retry on failure.
| Data | Store | Scale | Access Pattern |
|---|---|---|---|
| Metrics (hot) | Custom TSDB (SSD) | 10M pts/sec write, 1B series, 15d retention | Write-heavy, aggregation queries by tags |
| Metrics (warm/cold) | S3 + query engine | Downsampled. Years of retention. | Read-only historical queries |
| Logs | ClickHouse | 5M events/sec, 430 TB/day | Full-text search + structured filters |
| Alert Rules & State | PostgreSQL | 5M rules. Low write frequency. | Read rules for evaluation, write state changes |
| Dashboard Definitions | PostgreSQL | ~1M dashboards. User-configured. | CRUD by user. Read on page load. |
| Ingestion Buffer | Kafka | 10M+ events/sec combined | Decouple agents from storage. Absorb spikes. |
| Tenant Metadata | PostgreSQL + cache | 50K tenants. API keys, rate limits, config. | Read at auth time. Cached. |
- Per-tenant rate limiting: At the Intake API, enforce ingestion limits per API key. Excess data β queued, then dropped with warning.
- Query isolation: All queries include tenant_id in the filter. TSDB sharding by tenant ensures one customer's heavy query doesn't starve others.
- Fair scheduling: Alert evaluator and query engine use per-tenant quotas. Large tenants get proportionally more resources but can't monopolize.
| Data | Store | Why This Store |
|---|---|---|
| Metrics (time-series) | Custom TSDB | Purpose-built for high cardinality. Tag-based indexing. Rollup: 1s β 10s β 60s β 5m over retention. |
| Logs | Custom log store | Columnar storage with full-text index. Retention tiers: online (15 days) β archive (S3, years). |
| Traces (APM) | ClickHouse | Distributed traces with spans. Sampled at ingestion (keep 100% of errors, sample OK traces). |
| Alert state | PostgreSQL | Monitor definitions, alert history, notification routing. Transactional for state machine. |
| Long-term archive | S3 | Metrics, logs, and traces archived for compliance. Rehydratable on demand. |
| Ingestion pipeline | Kafka | Partitioned by {tenant_id, data_type}. Decouples intake from processing. Backpressure handling. |
| Extension | Architecture Impact |
|---|---|
| APM / Distributed Tracing | Third storage backend optimized for trace spans. Span β trace assembly. Service dependency map derived from trace data. |
| Anomaly Detection | ML models trained per time series. Run alongside alerting engine. Detect deviations without user-configured thresholds. |
| Log-to-Metrics | Generate metric time series from log patterns (e.g., count ERROR logs per service). Reduces log storage costs while preserving signal. |
| Notebooks / Investigations | Collaborative debugging tool. Combines metrics, logs, traces in a single timeline. Reads from all three storage backends. |
Why build a custom TSDB instead of using Prometheus or InfluxDB?
Scale and multi-tenancy. Prometheus is designed for single-tenant, pull-based monitoring. It works beautifully for one organization's metrics, but it doesn't have: (1) multi-tenant isolation with per-tenant quotas, (2) push-based ingestion at millions of points/second across thousands of tenants, (3) dynamic tag-based queries across arbitrary time ranges (Prometheus PromQL is great but the storage engine struggles above ~10M active time series). InfluxDB has similar single-tenant limitations. Datadog's custom TSDB is optimized for: high cardinality (millions of unique tag combinations), multi-tenant storage with per-tenant compaction, pre-aggregation at write time (reducing query-time work), and configurable retention tiers. The tradeoff: building a custom TSDB is a massive engineering investment (team of 50+ engineers). It only makes sense at Datadog's scale (trillions of data points). For smaller deployments, Prometheus or InfluxDB are the right choice.
How do you handle the noisy neighbor problem in a multi-tenant system?
At every layer with isolation mechanisms. Ingestion: per-tenant rate limits on the Intake API β if a customer suddenly sends 10x their normal volume, excess data is buffered in Kafka (not dropped) but processed at a throttled rate. This prevents one tenant's spike from consuming all Kafka consumer capacity. Processing: per-tenant resource quotas in the pipeline workers β each tenant gets a fair share of CPU/memory. If tenant A's custom log parsing rules are expensive, it slows down tenant A's processing, not tenant B's. Query: per-tenant query concurrency limits and execution time limits. A dashboard with 100 widgets (each a query) from one customer doesn't monopolize the TSDB query workers. Alert evaluation: monitors are distributed across an evaluation cluster β each evaluator handles a fixed number of monitors, and monitors from different tenants are interleaved to prevent hotspots. The key insight: noisy neighbor is not one problem but many problems at different layers, each requiring its own isolation mechanism.