- What outcome are we optimizing for? โ Mean Time to Remediate (MTTR) for GenAI-specific failures: hallucinations, retrieval drift, coherence degradation, agent loops. Current state: teams spend dozens of engineer-hours per incident manually correlating traces across 3-5 dashboards. Target state: guided diagnosis in minutes, not days. Secondary: time-to-first-metric (how fast can a team go from "no observability" to "seeing useful data"). This shapes the two-phase product: v0.1 solves time-to-first-metric, v1.0 solves MTTR.
- Why can't Datadog/Splunk solve this? โ Three reasons: (1) They track infrastructure metrics (latency, throughput) but not AI-specific quality signals (hallucination rate, retrieval relevance, coherence). (2) GenAI telemetry contains prompts and model outputs โ sensitive data that many enterprises CAN'T send to third-party SaaS without compliance violations. (3) GenAI failures are nondeterministic โ the same trace pattern can have different root causes, and the same root cause can produce different trace patterns. Traditional alerting rules don't work.
- What AI workloads does this cover? โ LLM inference (vLLM, Ollama), RAG pipelines, agentic AI systems (multi-agent orchestration), fine-tuned models, embedding services. Any system that produces OTLP-compatible traces. Not limited to one framework โ it's protocol-level (OpenTelemetry), not SDK-level.
- Deployment model? โ SELF-HOSTED FIRST. The telemetry pipeline runs entirely in the customer's VPC/infrastructure. Data never leaves. This is a hard requirement for enterprises dealing with regulated AI workloads. The Prove AI control plane (dashboard, case management, remediation) can be SaaS or self-hosted depending on customer tier.
- What's the open-source vs. proprietary split? โ Open source: OTel Collector + Prometheus + VictoriaMetrics + Envoy auth proxy = telemetry collection and storage. Proprietary: control plane dashboard, case management, guided troubleshooting, agentic remediation engine, GitHub/Jira integration, audit logging. The open-source layer is the "on-ramp" โ it earns trust and adoption. The proprietary layer is the revenue engine.
| In Scope | Out of Scope |
|---|---|
| Telemetry ingestion (OTLP traces + metrics) | Model training / fine-tuning infrastructure |
| Trace-to-metric conversion (spanmetrics) | LLM hosting (vLLM/Ollama are external) |
| Self-hosted storage (Prometheus + VictoriaMetrics) | Prompt engineering / prompt management |
| Auth gateway (Envoy proxy) | Model evaluation benchmarks (e.g., MMLU) |
| Control plane: dashboard, case management | Data labeling / annotation tooling |
| Agentic remediation engine (v1.0) | Cost optimization for LLM API spend |
| Data sovereignty architecture | Multi-cloud orchestration of models |
- UC1 (Instrument Once, Observe Twice): Team instruments their RAG pipeline with standard OpenTelemetry SDK โ sends traces to Prove AI's OTel Collector โ spanmetrics connector automatically derives Prometheus metrics (calls_total, latency histograms) โ team sees request rate, error rate, and latency in Prometheus within 10 seconds โ WITHOUT writing any metrics instrumentation. Traces and metrics from a single instrumentation path.
- UC2 (Hallucination Detection): RAG pipeline starts returning irrelevant context. Traditional monitoring shows: latency normal, error rate zero, throughput normal โ everything looks "green." But output quality has degraded. Prove AI's control plane captures the full execution chain: prompt โ retrieved chunks โ model output โ evaluation score. The remediation engine traces the failure back to a stale embedding index that wasn't refreshed after a document update.
- UC3 (Agentic Workflow Debugging): Multi-agent system: Planning Agent delegates to Research Agent and Synthesis Agent. Research Agent enters a loop (calls the same API 50 times). Traditional tracing shows a long trace with repeated spans โ but doesn't tell you WHETHER this is a problem or HOW to fix it. Prove AI's case management creates a case, the remediation engine identifies the loop pattern, correlates it with a recent prompt template change, and suggests reverting the template.
- UC4 (Compliance Audit): Enterprise needs to prove to auditors that their AI system's outputs meet quality thresholds, that failures are detected and remediated within SLA, and that all telemetry is stored in their infrastructure. Prove AI's audit log provides an immutable trail: what happened, when, what was the quality score, who investigated, what was the fix, and when was it verified.
- UC5 (Zero-to-Dashboard): New team, no existing observability. They run
docker compose --profile full up -dโ send a test trace withotel-cliโ see their first metric in Prometheus within 15 seconds. Total setup time: under 5 minutes. This is the "time-to-first-metric" experience that the v0.1 open-source stack is built for.
- Data sovereignty: ALL telemetry data stays in the customer's infrastructure. No phone-home, no cloud dependencies at runtime. Container images pulled at build time; after that, the stack runs air-gapped. GenAI telemetry can contain prompts and model outputs โ this is PII-adjacent data that cannot leave the customer's environment.
- Time-to-first-metric: <5 minutes: From
git cloneto seeing live metrics. This is the open-source on-ramp that drives adoption. If setup takes days, teams will defer observability (which is exactly the problem Prove AI exists to solve). - Ingestion latency: <30 seconds: From trace emission to queryable metric. Prometheus scrapes every 10 seconds. Total pipeline latency: trace โ OTel Collector โ spanmetrics โ Prometheus scrape โ queryable. Must be fast enough for near-real-time dashboards.
- 12-month retention: VictoriaMetrics stores metrics for 12 months. Essential for trend analysis, seasonal pattern detection, and compliance audits. Prometheus handles short-term (~15 days) for real-time queries.
- Zero lock-in: OTLP protocol, Prometheus exposition format, standard TSDB storage. If a customer leaves, their instrumentation code and historical data are fully portable. No proprietary data formats anywhere in the telemetry path.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| All telemetry stays in customer infrastructure | Self-hosted Docker Compose (not SaaS) | GenAI traces contain prompts, model outputs โ potentially proprietary IP. Zero data egress by design. Control plane reads metrics only (aggregated, not raw). | โ |
| Instrument once, observe twice | OTel spanmetrics connector (not dual instrumentation) | Single OTLP trace SDK produces both traces AND derived Prometheus metrics automatically. Dual instrumentation doubles maintenance burden and adoption friction. | โ |
| Remediation needs structured reasoning, not just alerts | Agentic LLM analysis (not rule-based) | 8 GenAI failure types require contextual reasoning (prompt regression vs context overflow). Rules can detect anomalies but can't diagnose root cause across failure types. | โ |
| Audit trail must be tamper-evident | Append-only log anchored to Hedera (distributed ledger) | SOC 2 / HIPAA require immutable audit trail. Hedera provides cryptographic proof that records haven't been modified. | CP |
| Customers may already have Prometheus/Grafana | Modular deployment profiles (BYO components) | Deployment profiles allow excluding components. "BYO Prometheus" profile skips bundled Prometheus. Reduces footprint and avoids conflicts. | โ |
๐ก๏ธ Envoy Auth Proxy GATEWAY
- All external traffic enters here
- API Key or Basic Auth (Lua filter)
- 4 listeners: gRPC :4317, HTTP :4318, Prometheus :9090, VM :8428
- Internal traffic unauthenticated (Docker network)
๐ก OTel Collector INGEST
- Receives OTLP traces (gRPC + HTTP)
- Batch processor for efficiency
- spanmetrics connector (traces โ metrics)
- Prometheus exporter on :8889
๐ Prometheus SHORT-TERM
- Scrapes Collector :8889 every 10 sec
- PromQL query engine for dashboards
- Remote-writes to VictoriaMetrics
- ~15 day TSDB retention
๐พ VictoriaMetrics LONG-TERM
- 12-month metric retention
- Prometheus-compatible API on :8428
- Aggressive compression (10-20ร vs. raw)
- Receives via Prometheus remote_write
๐ง Remediation Engine v1.0 CORE
- Agentic root-cause analysis
- Pattern matching across historical failures
- Full execution chain reconstruction
- Suggested fix paths + confidence scores
๐ Case Management WORKFLOW
- Automatic case creation on anomaly detection
- Full context: prompts, embeddings, outputs, scores
- GitHub/Jira integration for ticket creation
- Resolution tracking โ knowledge base feedback
๐ Dashboard & UI CONTROL PLANE
- Pre-built GenAI dashboards (token throughput, TTFT, latency)
- Unified view: infra metrics + AI quality signals
- Custom metric definition per customer
- Reads from Prometheus/VM via PromQL
๐ Audit Logging COMPLIANCE
- Immutable event log (SQL + optional Hedera)
- Every detection, investigation, remediation
- SOC 2, HIPAA-compatible audit trail
- Proves AI governance to auditors
What You Can Derive Without Custom Instrumentation
| Metric | PromQL Query | What It Tells You |
|---|---|---|
| Request rate | rate(llm_traces_span_metrics_calls_total[5m]) | Requests per second by service/operation |
| Error rate | rate(...{status_code="ERROR"}[5m]) / rate(...[5m]) | % of requests failing, broken down by service/operation |
| p50/p95/p99 latency | histogram_quantile(0.95, rate(llm_..._latency_bucket[5m])) | Latency distribution โ critical for LLM TTFT monitoring |
| Per-model throughput | rate(...{model="gpt-4"}[5m]) | If model is a span attribute, derived automatically |
| Per-agent calls | rate(...{service_name="research-agent"}[5m]) | In agentic systems: call rate per agent |
| Data | Store | Why This Store |
|---|---|---|
| Raw traces (spans) | OTel Collector โ export | OTLP traces stored via configured exporter. 1-5KB per span (10-100x traditional due to prompts/outputs). Retained per policy. |
| Derived metrics | Prometheus โ VictoriaMetrics | Auto-generated by spanmetrics connector. calls_total counter + latency histogram. Prometheus 15-day hot, VM 12-month cold. |
| Case data | PostgreSQL (control plane) | Full case lifecycle: context snapshots, remediation suggestions, resolution details. Linked to trace IDs. |
| Knowledge base | Vector DB (control plane) | Resolved case embeddings. (failure_signature, root_cause, fix) tuples. Searched by cosine similarity for new incidents. |
| Audit trail | Append-only store + Hedera | Every case action, every remediation suggestion, every fix applied. Immutable. Anchored to Hedera for tamper evidence. |
| Configuration | Docker Compose .env | API keys, endpoints, profile selection. Template-generated at startup. Lives in customer infrastructure. |
| Dimension | Traditional (Datadog/Splunk) | GenAI (Prove AI) |
|---|---|---|
| Failure mode | Deterministic: same input โ same error | Nondeterministic: same input โ different outputs, some correct, some not |
| Detection signal | Error rate, latency, HTTP status codes | Output quality scores, relevance, coherence, groundedness โ semantic metrics, not just infra metrics |
| Root cause | Usually singular: a bug, a config error, a resource limit | Often emergent: interaction between prompt + context + model + retrieval. Multiple plausible causes. |
| Trace content | HTTP headers, status codes, stack traces (~500B) | Prompts, retrieved documents, model outputs, eval scores (~5-50KB per span). Sensitive data. |
| Fix verification | "Error rate returned to zero" โ binary | "Output quality returned to baseline" โ continuous, probabilistic |
| Debugging effort | 80% observation, 20% remediation | 20% observation, 80% remediation. The failure is obvious โ the fix is not. |
| Hosting model | SaaS-first (send data to vendor) | Self-hosted-first (data cannot leave customer infra) |
- The wedge: The open-source telemetry stack (v0.1) is not the product โ it's the ON-RAMP. It solves time-to-first-metric (the "easy problem" that teams defer). Once teams have telemetry flowing, they discover the "hard problem" (remediation) โ which is where the proprietary platform (v1.0) delivers value.
- Trust before revenue: Self-hosted, open-source, zero-lock-in builds trust with security-conscious enterprises. They can inspect every line of code. This trust converts to paid adoption when the remediation engine (proprietary) proves its MTTR reduction.
- Community feedback loop: Open-source users report issues, contribute configurations for different AI frameworks, and surface GenAI-specific failure patterns. This community intelligence feeds the proprietary remediation engine's knowledge base.
- Lock-in avoidance as positioning: Whalen (CTO) explicitly warns against proprietary lock-in for GenAI telemetry. The market is early and evolving โ getting locked into a vendor that charges per-GB when GenAI spans are 10-100ร larger than traditional spans is an "expensive trap." Open standards (OTLP, PromQL) are the antidote.
| Extension | Architecture Impact |
|---|---|
| Automated Evaluation Pipeline | Move beyond relying on customer-defined eval scores. Build a built-in evaluation layer: LLM-as-judge for relevance/coherence/groundedness, factual consistency checking against retrieved context, and semantic drift detection. Runs as an additional OTel Collector processor that enriches spans with eval scores before they reach spanmetrics. These scores become first-class metric labels, enabling evaluation-based alerting out of the box. |
| Prompt Regression Testing | When a prompt template changes, automatically run the new template against a cached set of recent inputs, compare outputs to the previous template's outputs, and flag regressions BEFORE deployment. This is the GenAI equivalent of unit testing โ but for nondeterministic systems. Architecture: a "shadow mode" that runs both old and new templates in parallel and compares. |
| Multi-Cluster Federation | Large enterprises run GenAI across multiple clusters/regions. A federation layer that aggregates metrics and cases across clusters while keeping raw telemetry in each cluster (data sovereignty). The control plane provides a unified view. Architecture: each cluster runs its own telemetry stack; the control plane queries across clusters via PromQL federation. |
| Cost Attribution | Map every LLM API call to its cost (tokens ร price per token) and attribute costs to teams, features, or customers. Requires: token counting in the OTel span attributes, price lookup table per model provider, and aggregation by arbitrary dimensions. Output: "The RAG pipeline for customer support cost $4,200 last month, up 30% due to increased context window usage." |
| Compliance Report Generator | Auto-generate audit-ready reports: "In Q4, the AI system processed 2.3M requests, maintained a 94% quality score, experienced 12 incidents (avg MTTR: 43 min), and all telemetry remained within customer infrastructure." Pulls from: case management (incidents), metrics (quality scores), audit log (compliance trail). Output: PDF report suitable for SOC 2 auditors or board presentation. |
Why the spanmetrics connector instead of dual instrumentation?
The spanmetrics connector is the architectural wedge that makes the entire system practical. Without it, customers would need to instrument their GenAI applications twice: once with OTel for traces and once with a Prometheus client for metrics. Dual instrumentation means double the maintenance burden, double the risk of configuration drift, and double the chance of one being forgotten. The spanmetrics connector eliminates this: you instrument once (OTLP traces), and the connector automatically derives Prometheus metrics (calls_total counter + latency histogram) for every span. The "instrument once, observe twice" pattern means the barrier to adoption is a single SDK integration. Once traces flow, metrics appear automatically โ time-to-first-metric drops from days to minutes. The tradeoff: derived metrics have less flexibility than custom metrics (you can't create arbitrary counters), but for the observability use case, the auto-derived metrics (request rate, error rate, latency percentiles) cover 80%+ of what teams need.
How does the remediation engine avoid hallucinating root causes?
The engine operates on STRUCTURED telemetry, not open-ended text โ this significantly constrains the hallucination space. The analysis pipeline is: (1) Context Assembly gathers hard data: actual trace spans, metric values, config change timestamps โ these are facts, not interpretations, (2) Failure Classification uses a fixed taxonomy of 8 GenAI failure types โ the engine classifies into known categories, it doesn't invent new ones, (3) Root Cause Analysis scores candidates against factual evidence: temporal correlation (did a change precede the failure?), metric anomaly scores (is there a measurable degradation?), and similarity to resolved past cases. Each hypothesis has a confidence score โ if the highest confidence is below 0.5, the engine says "unable to determine root cause" rather than guessing. (4) The knowledge base acts as grounding: suggestions are based on what actually worked for similar failures, not generated from scratch. The irony is using GenAI to debug GenAI โ but the remediation engine's nondeterminism is constrained by the structured data it operates on.