- What outcome does the customer measure? β Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) for production incidents. Splunk exists to make those numbers smaller. This means: ingestion latency matters (faster data in β faster detection), query speed matters (faster search β faster root cause), and alert reliability matters (fewer false positives β faster human response).
- Data sources? β Any machine data: application logs, syslogs, metrics, API events, network flows, security events. Schema-on-read β no predefined schema required.
- Query model? β Ad-hoc full-text search with a custom query language (SPL-like). Supports filtering, aggregation, statistical functions, time-range scoping. Results in seconds even over TB of data.
- Real-time vs. batch? β Both. Real-time alerting (detect anomaly within seconds of ingestion) AND historical search (query data from months ago).
- Retention? β Hot (searchable, fast): 30-90 days. Warm/Cold (searchable, slower): up to years. Frozen (archived, restore-on-demand): unlimited.
- Scale? β ~10 TB/day ingestion, ~50K search queries/day, retention of ~1 PB searchable, ~50K forwarder agents.
| In Scope | Out of Scope |
|---|---|
| Data collection (forwarders, HEC) | SIEM correlation rules engine |
| Parsing & event extraction | Machine learning toolkit internals |
| Indexing engine & storage tiers | Visualization / dashboard rendering |
| Distributed search | Billing / license management |
| Alerting & scheduled searches | App marketplace |
| Data replication & high availability | User authentication (RBAC exists, not deep-dived) |
- UC1: SRE searches
index=web status=500 | stats count by host | sort -countβ within 5 seconds, sees which hosts are throwing the most 500 errors in the last 24 hours. - UC2: Security analyst creates alert: "If more than 100 failed login attempts from the same IP in 5 minutes, fire PagerDuty." Alert evaluates in real-time as data is ingested.
- UC3: Compliance team searches 6 months of access logs for a specific user's activity. Searches across warm/cold data tiers, takes ~30 seconds.
- UC4: 50K forwarder agents continuously stream data β system must never drop events, even during spikes.
- Ingest reliability: ZERO data loss. Every log event that enters the system must be indexed and searchable. Data is the product.
- Time-to-search: Event should be searchable within 1-5 seconds of ingestion (near-real-time).
- Search speed: Ad-hoc queries over 24 hours of data (<5 seconds). Queries over 30 days (<30 seconds). Queries over 1 year (minutes, acceptable).
- Schema-on-read: Data is indexed WITHOUT a predefined schema. Structure (fields, key-value pairs) is extracted at SEARCH TIME, not ingest time. This is the fundamental difference from structured databases.
- Horizontal scalability: Add more indexers for more throughput, more search heads for more concurrent queries.
status=500 are extracted from raw events using configurable extraction rules. This trades query-time CPU for ingest-time simplicity.| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| 100K+ events/sec ingest with zero data loss | Custom bucket-based storage (not standard DB) | Append-only buckets with TSIDX index. Bloom filters skip non-matching buckets. Standard DB would need full-text index + time partitioning β slower. | β |
| Search across distributed indexers | MapReduce: scatter to indexers, gather at Search Head | Each indexer searches its local data. Search Head merges results. Parallelism scales with indexer count. | β |
| Long-term retention at reasonable cost | SmartStore: hot (SSD) β warm (SSD) β cold (S3) | Recent data on fast SSD for real-time search. Aged-off data on S3 (10x cheaper). Transparently thawed on search. | β |
| Schema-less data (logs have unpredictable fields) | Schema-on-read (not schema-on-write) | Extract fields at search time, not ingest time. Any log format works without pre-defining columns. Tradeoff: search-time extraction is slower than pre-indexed fields. | β |
| Multi-tenant: each index isolated | Role-based access per index | Analyst sees their indexes only. Raw data stored once, access policies applied at query time. No data duplication per role. | β |
π‘ Universal Forwarder COLLECTION
- Lightweight agent on source machines
- Monitors log files, collects metrics
- Forwards raw data to indexers
- Load-balances across indexer pool
- ~1-2% CPU overhead on host
π‘ Heavy Forwarder COLLECTION
- Full parsing pipeline at the source
- Can filter, mask, route data
- Used for: data masking (PII), API collection
- Higher resource usage than UF
π₯ HTTP Event Collector (HEC) COLLECTION
- REST API endpoint for event ingestion
- Applications POST JSON events directly
- Token-based auth, no agent needed
- Ideal for cloud-native / container workloads
βοΈ Indexer INDEXING
- THE core component. Receives, parses, indexes, stores.
- Breaks raw data into events (line breaking, timestamp extraction)
- Builds inverted index (tsidx) for fast search
- Stores in time-bucketed directories
- Responds to search queries from search heads
π Search Head SEARCH
- User interface + search coordinator
- Parses SPL query, creates search plan
- Dispatches search jobs to indexers in parallel
- Merges results from all indexers
- Runs scheduled searches and alerts
π Cluster Manager MANAGEMENT
- Manages indexer cluster membership
- Coordinates data replication
- Handles bucket fixup (repair under-replicated data)
- Distributes configuration to indexers
The Data Pipeline (4 phases)
| Phase | Where It Runs | What It Does |
|---|---|---|
| 1. Input | Forwarder | Read data from source (file tail, TCP/UDP listen, API poll). Break into 64KB blocks. Annotate with metadata: source, sourcetype, host. Forward to indexer. |
| 2. Parsing | Indexer (or Heavy Forwarder) | Line breaking: split block into individual log events (newline-delimited, or multi-line patterns like Java stack traces). Timestamp extraction: parse timestamp from event text, assign _time. Character encoding normalization. |
| 3. Indexing | Indexer | Tokenize event text β add tokens to inverted index (tsidx). Compress raw event β write to journal. Assign event to current "hot" bucket based on _time. |
| 4. Replication | Indexer Cluster | Replicate bucket data to peer indexers for HA. Replication factor (RF) = number of copies. Search factor (SF) = number of searchable copies. |
sourcetype=apache:access β parse as Apache combined log format, extract timestamp from position 1, line break on newline. sourcetype=log4j β multi-line, look for timestamp pattern yyyy-MM-dd HH:mm:ss. Getting sourcetype wrong means wrong timestamps, wrong events, wrong search results. Best practice: ALWAYS configure sourcetype explicitly rather than relying on auto-detection.Forwarder Load Balancing
- Auto-balancing: Each forwarder distributes data across the indexer pool. Uses a round-robin or volume-based strategy. If an indexer goes down, forwarder automatically redirects to surviving indexers.
- Even distribution is critical: If indexer A has 2Γ the data of indexer B, searches over time ranges where A has data will be 2Γ slower (more data to scan on A). The search head waits for the SLOWEST indexer. Even distribution = even search performance.
- Persistent queue: Forwarder has an on-disk queue. If all indexers are unreachable, events queue locally until connectivity is restored. Zero data loss.
The Bucket Model
Bucket Lifecycle
| State | Storage | Writable? | Searchable? | Typical Duration |
|---|---|---|---|---|
| Hot | SSD (local) | β Yes β actively receiving new events | β Full speed | Minutes to hours |
| Warm | SSD or HDD (local) | β Closed β no new writes | β Full speed | Days to weeks |
| Cold | HDD or network storage | β | β Slower (disk seek) | Weeks to months |
| Frozen | S3 / object store (SmartStore) | β | β οΈ Restore-on-demand | Months to years |
| Deleted | β | β | β | After retention policy |
- Hot bucket: One per index per indexer. Actively written to. When it exceeds a size limit or time range, it "rolls" to warm.
- Rolling: Hot β warm: bucket is closed (no more writes), tsidx is finalized and optimized. This is a local operation, very fast.
- Warm β cold: Bucket moved to cheaper storage. Searchable but slower (HDD random access instead of SSD sequential).
- SmartStore (frozen/archive): Bucket data moved to S3. Only the tsidx metadata stays local (tiny). On search, the cache manager fetches the bucket from S3 to local SSD. LRU eviction of local cache.
The Inverted Index (tsidx)
- Tokenization: Every event is broken into tokens at major delimiters (spaces, punctuation, special chars).
2024-01-15 10:30:22 ERROR [auth] login failed for user=admin ip=10.0.0.1produces tokens:2024, 01, 15, 10, 30, 22, ERROR, auth, login, failed, user, admin, ip, 10.0.0.1 - Inverted index maps: token β list of event offsets within the bucket.
"ERROR" β [offset_42, offset_891],"admin" β [offset_42, offset_1203]. - Search for
"ERROR admin": Intersect posting lists β[offset_42]. One matching event. No need to scan all events in the bucket. - Schema-on-read: The tsidx indexes ALL tokens, not just predefined fields. At search time, if user queries
user=admin, the search engine: (a) uses tsidx to find events containing "admin," then (b) applies a field extraction regex to determine if "admin" is the value of the "user" field. This is slower than pre-extracted fields but infinitely more flexible.
status, host), you can configure "indexed extraction" which creates a dedicated field index at ingest time. This makes queries on those fields 10-100Γ faster but increases storage and reduces ingest throughput. The tradeoff: index the fields you search constantly, leave everything else as search-time.index=web status=500 | stats count by host must scan potentially 1 PB of data across 40 indexers and return results in seconds. The search must parallelize across indexers AND within each indexer.Map-Reduce Search Architecture
Search Head Cluster
- Problem: A single search head can coordinate ~50-100 concurrent searches. With 50K queries/day (~30K scheduled), one search head isn't enough.
- Solution: Search Head Cluster (SHC) β 3+ search heads behind a load balancer. Searches are distributed across members. Scheduled searches and dashboards are replicated so any member can run them.
- Captain election: Majority-based protocol. Odd number of members (3, 5, 7) to prevent split-brain. Captain assigns scheduled searches to members evenly.
- Artifact replication: Search results (for dashboards, saved searches) replicated across SHC members for HA.
| Tier | Storage | Cost (approx) | Search Latency | Use Case |
|---|---|---|---|---|
| Hot/Warm | Local SSD | $$$$ | <1 sec | Recent data. Actively queried. Last 7-30 days. |
| SmartStore Cache | Local SSD (cache) | $$ | <5 sec (cache hit) / ~30s (fetch from S3) | Older data. Cache recently searched buckets. |
| SmartStore Remote | S3 / GCS | $ | ~30-60 sec (fetch + search) | Bulk of retained data. Pay only for storage. |
| Frozen Archive | S3 Glacier / equivalent | Β’ | Minutes to hours (restore) | Compliance data. Rarely accessed. |
| Data | Store | Why This Store |
|---|---|---|
| Hot buckets | Local SSD | Currently being written. One per index per indexer. Raw + TSIDX (time-series index). <24 hours of data typically. |
| Warm buckets | Local SSD | Closed for writing, still on fast storage. Searchable. Rolled from hot when size/time threshold hit. |
| Cold buckets | Cheap disk / NFS | Aged-off from warm. Still searchable but slower. Retention policy driven (30-90 days typical). |
| Frozen / archived | S3 (SmartStore) | Long-term archive. Not searchable without thawing. Compliance retention (1-7 years). Cheapest storage tier. |
| Search artifacts | Local disk | Search job results cached for re-display. TTL-based cleanup. Per-user isolation. |
| Configuration | Filesystem | Splunk configs (.conf files) replicated across cluster. Props, transforms, indexes, inputs. |
| Dimension | How It Scales | Bottleneck |
|---|---|---|
| Ingest throughput | Add indexers. Linear scaling: 40 indexers β 10 TB/day. 80 β 20 TB/day. | Forwarder connection limits, network bandwidth to indexer tier. |
| Search concurrency | Add search heads (SHC). Each member handles ~50-100 concurrent searches. | Indexer CPU β every search head dispatches to the SAME indexer pool. Too many searches saturate indexers. |
| Storage retention | SmartStore: add S3 capacity. Elastic, pay-per-GB. No indexer changes needed. | S3 fetch latency for uncached searches (~30s). |
| Data diversity | Separate indexes for different data types (security, app, infra). Independent retention policies per index. | Index proliferation increases search head overhead. |
- Immutable events: Once indexed, events cannot be modified or deleted individually. Only entire buckets can be deleted (by retention policy). This is a feature for compliance β audit logs can't be tampered with.
- Role-based access: Users see only indexes they have permission to search. Data masking at search time for sensitive fields (PII).
- Data masking at ingest: Heavy forwarders can mask fields before data leaves the customer's network. Credit card numbers replaced with
XXXX-XXXX-XXXX-1234before reaching indexers. - Chain of custody: Bucket metadata tracks: when ingested, from what source, on which indexer, replicated to which peers. Enables forensic verification of log integrity.
- Monitoring Console: Dedicated search head that monitors the Splunk deployment itself. Tracks: ingestion rate per indexer, search concurrency, queue depths, replication status, license usage.
- Key alerts: Ingestion lag > 60s, indexer queue depth > 100K events, search concurrency at limit, replication factor violation (under-replicated buckets), license approaching limit.
| Extension | Architecture Impact |
|---|---|
| Federated Search (across clusters) | Search head dispatches to indexers in multiple independent clusters. Requires cross-cluster auth, result merge across WAN. |
| Metrics Store (beyond logs) | Dedicated time-series store for numeric metrics (CPU, memory, latency). Different storage format than text logs β columnar, compressed, downsampled. Much more efficient for numeric aggregations. |
| SIEM Correlation Engine | Complex event processing: correlate events across multiple indexes and time windows. "If failed login on index=auth AND file access on index=fileserver within 5 minutes, alert." Requires streaming correlation, not just batch search. |
| AI/ML Anomaly Detection | Train models on historical patterns. Score incoming events in real-time. Requires feature extraction at ingest time (departure from pure schema-on-read). |
| Edge Processor | Filter, transform, and route data at the collection tier before it reaches indexers. Reduces ingest volume (and license cost) by dropping noise at the source. Requires a processing engine on/near forwarders. |
Why does Splunk use its own storage format instead of a standard database?
Because log data has a unique access pattern that databases aren't optimized for. Log events are (1) write-once, never updated, (2) always queried by time range, (3) need full-text search across arbitrary fields, (4) arrive at extremely high throughput (100K+ events/second per indexer). A traditional database would need: a time-partitioned table (for range queries), a full-text index (for search), and high write throughput β and it would struggle with all three simultaneously. Splunk's format is purpose-built: each "bucket" is a time-bounded directory containing the raw compressed data plus a TSIDX (time-series index) file that's essentially a bloom filter + sorted term dictionary. Searches first narrow by time range (skip entire buckets), then by term (bloom filter eliminates non-matching buckets), then scan only the matching segments. This means a search for "error" in the last 1 hour might read 0.1% of the data on disk. A general-purpose database would likely need to scan more because it can't exploit the time-partitioned bloom filter architecture.
How does MapReduce-style distributed search work across indexers?
The Search Head is the coordinator, indexers are the workers. When a user runs `search index=web status=500 | stats count by host`, the Search Head: (1) parses the SPL (Search Processing Language), (2) splits it into a "map" phase (filtering + partial aggregation) that runs on each indexer, and (3) a "reduce" phase (final aggregation) that runs on the Search Head. The map phase is parallelized: each indexer searches its local buckets for `status=500` events and computes partial counts by host. The Search Head collects these partial results and merges them. For simple stats (count, sum), the merge is trivial. For complex operations (dedup, sort, transaction), the Search Head may need to process more data. The key optimization: "search-time field extraction" means the filter `status=500` can be applied at the indexer without the Search Head seeing raw events β only matching events are sent back. This reduces network transfer dramatically. For very large result sets, the Search Head can become a bottleneck, which is why Splunk offers Search Head Clustering for horizontal scaling of the reduce phase.