01
Clarify the Problem & Scope5β7 min
"We're designing a universal, future-proof architecture for production GenAI systems. The core insight: every production LLM system is a data transformation pipeline with an LLM as one component β not a magic box you throw problems at. The architecture has five layers β Intake, Generation, Evaluation, Memory & Learning, and Orchestration β each independently swappable. The system must handle structured intake from any modality, composable prompt pipelines, multi-speed evaluation with LLM-as-Judge, five-tier cognitive memory, knowledge graphs with GraphRAG, hybrid retrieval, and adaptive agent orchestration. The architecture survives model changes; the components evolve."
The hospital analogy: no hospital lets a single doctor handle intake, diagnosis, treatment, quality assurance, and record-keeping simultaneously. Layer 1 is triage (unstructured β structured). Layer 2 is the doctor (LLM diagnosis). Layer 3 is the second opinion (evaluation). Layer 4 is the patient chart (memory). Layer 5 is the administrator routing complex cases (orchestration).
The Three Laws of Future-Proof GenAI
- Law 1 β Separate Concerns: The LLM is a reasoning engine, not the whole system. Retrieval, evaluation, memory, and orchestration are independent, swappable layers.
- Law 2 β Measure Everything: If you can't evaluate it, you can't improve it. Traces, evals, and golden sets are first-class citizens.
- Law 3 β Design for Replacement: Any model, any prompt, any component should be replaceable without rewriting the system.
Questions I'd Ask
- What domain? β Domain-agnostic reference architecture. We'll use code review, document intelligence, and customer support as concrete instantiations.
- Input modalities? β All: text, PDFs, images, code diffs, audio. The Intake Layer normalizes chaos into structured representations before any LLM call.
- How do we handle model changes? β Model Abstraction Layer: never call a model directly. A router selects the optimal model per-request based on complexity, latency, cost, and features.
- How does the system improve over time? β Experience Database records every strategy + outcome. Episodic memory recalls what worked. Weekly self-improvement cycles auto-fix weak spots.
- What about retrieval? β Hybrid: vector search (semantic) + knowledge graph traversal (relationships) + keyword/BM25 (exact match). Reciprocal Rank Fusion merges results. GraphRAG for local entity queries and global thematic analysis.
- How do we evaluate at scale? β Five speeds: (1) schema validation on every request, (2) automated metrics, (3) LLM-as-Judge on samples, (4) golden set regression on every deploy, (5) human expert review weekly.
Agreed Scope
| In Scope | Out of Scope |
|---|---|
| 5-layer universal architecture blueprint | Specific model training / fine-tuning |
| Intake: unstructured β structured + parser flywheel | MLOps / model deployment infrastructure |
| Generation: composable prompts + model router | Specific vector DB product selection |
| Evaluation: traces + LLM-as-Judge + golden sets | Prompt engineering details per domain |
| 5-tier cognitive memory architecture | RLHF / preference optimization |
| Knowledge Graphs + GraphRAG + hybrid retrieval | Compliance/regulatory frameworks |
| Orchestration: adaptive complexity + agents | Frontend/UX design |
The defining tension: flexibility vs. complexity. A future-proof system requires abstraction layers, eval infrastructure, memory tiers, and observability from day one. But you must ship incrementally β Week 1 is "prompt β LLM β validate β output." The architecture has slots for all components; the implementation fills them progressively.
02
Back-of-the-Envelope Estimation3β5 min
"Numbers for a mid-scale production GenAI system β say a code review or document intelligence platform. I'll estimate per-request costs, latency budgets, and storage for memory and knowledge graph layers."
Requests / Day
10Kβ500K
Internal tool: ~10K. SaaS platform: ~500K. Burst patterns during business hours.
Latency Budget / Request
2β10s
Intake: ~200ms. Retrieval: ~300ms. LLM: 1β6s (dominates). Evaluation: ~200ms. Total p50: ~2.5s.
Cost / Request
$0.01β0.10
Simple (cheap model): $0.01. Complex (frontier + RAG + refinement): $0.05β0.10. Agent loops: $0.20+.
Context Window Budget
128K tokens
System prompt ~2K. Core memory ~2K. Conversation ~8K. Retrieved context ~20K. Response ~20K. Buffer: ~76K.
Memory Store / User / Year
~1 GB
Core memory: ~2KB always loaded. Archival: ~10KB/session Γ 250/yr. Episodic: ~5KB Γ 1K episodes. KG: varies.
Knowledge Graph
100Kβ10M nodes
Enterprise codebase: ~100K entities. Document corpus: ~1M. Community summaries: 100β500 clusters Γ 3 hierarchy levels.
Golden Set Size
100β1000 cases
Cover edge, common, adversarial cases. Expert-validated. Run on every deploy. Grows from production failures.
Eval Cost / Deploy
$5β50
100 cases Γ $0.05 LLM-as-Judge = $5. 1000 cases = $50. Cheap insurance against regressions.
Parser Flywheel economics: Week 1: 100% LLM-parsed ($0.05/doc). Month 3: 90% deterministic parser ($0.0001/doc), 10% LLM fallback. Month 6: 95%/5%. The more you process, the cheaper it gets.
03
High-Level Design8β12 min
"Every GenAI system maps to five layers. They communicate through typed schemas β never raw strings. Each layer is independently deployable, testable, and replaceable. Data flows bottom-up with a feedback loop from orchestration back to generation and from memory into context assembly."
Universal 5-Layer Architecture
The Abstraction Sandwich
// The architecture that protects you from model churn
ββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LOGIC (Yours forever) β
β Business rules, workflows, domain knowledge β
ββββββββββββββββββββββββββββββββββββββββββββββββ€
β ABSTRACTION LAYER (Your insurance) β
β Model router, prompt assembler, eval runner β
ββββββββββββββββββββββββββββββββββββββββββββββββ€
β FOUNDATION LAYER (Will change) β
β Specific models, APIs, embedding services β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Key Architectural Decisions
| Decision | Choice | Why Not Alternative |
|---|---|---|
| Model coupling | Abstraction Layer (never call directly) | Direct API calls β vendor lock-in; model deprecations break production |
| Prompt management | Composable versioned components | Monolithic prompts are untestable; can't identify which piece regressed |
| Output format | Schema-first (Pydantic/JSON Schema) | Free-form text is unparseable; format varies by model version |
| Retrieval | Hybrid (vector + graph + keyword) | Vector-only misses exact matches and relationships; keyword-only misses semantics |
| Memory | 5-tier cognitive model | Flat store can't differentiate critical facts from stale trivia |
| Memory control | Self-managed (LLM decides via tools) | External-only management misses semantic nuance of what's worth remembering |
| Evaluation | Multi-speed (schema β metrics β judge β golden β human) | Single eval is either too slow or misses quality issues |
| Orchestration | Adaptive complexity escalation | Always-agent wastes cost on simple queries; always-pipeline under-serves complex ones |
04
Deep Dives25β30 min
Deep Dive 1 β Intake Layer & Parser Flywheel (6 min)
Goal: Transform raw chaos (PDFs, code diffs, images, audio) into typed, validated, structured representations before any LLM call.
Intake: Parse β Classify β Schema β Validate + Parser Flywheel
- Schema-First Extraction: Define your output schema (Pydantic model) BEFORE writing the prompt. Use JSON Schema constraints in the LLM call. Parse and validate with the schema. Cross-validate (e.g., line items sum = subtotal). This eliminates format drift across model versions.
- Parser Flywheel economics: 10,000 invoices/day Γ $0.05/LLM parse = $500/day. After the flywheel generates deterministic parsers: 9,500 Γ $0.0001 + 500 Γ $0.05 = $25.95/day. A 20Γ cost reduction that accelerates over time.
- Real-world use case β legal contracts: 500+ contracts/day. LLM classifies contract type, tries deterministic parser first, falls back to LLM extraction with schema enforcement. When a doc type accumulates 20+ examples, an LLM auto-generates a new parser, tests it, registers if β₯95% accurate.
Deep Dive 2 β Generation Engine: Composable Pipelines (7 min)
Goal: Build a generation layer that's model-agnostic, prompt-versionable, and self-refining.
Generation Pipeline
- Composable Prompts: Each prompt component (system role, context, few-shot, output schema, task instruction) is a named, versioned, testable unit. When you switch from GPT-4 to Claude, you change the generation call β not the prompt assembly. When output requirements change, update the schema component. Each piece iterates independently.
- Model Abstraction Layer: Never call
openai.chat.completions.create()directly. Go throughModelRouter.route()which selects provider based on: task complexity (cheap model for simple, frontier for complex), latency SLA, cost budget, feature needs (vision, structured output). When a model is deprecated: change one config line. - Recursive Refinement: Generate β Critique β Revise loop. A critic LLM evaluates the draft against a rubric (multi-dimensional: accuracy, actionability, completeness, tone). If score < threshold and iterations remain, revise. Each iteration traced. Typically converges in 1-2 iterations. Max iterations β flag for human review.
Deep Dive 3 β Evaluation Layer: Traces, Judges, Golden Sets (7 min)
Goal: Build the "immune system" β detect degradation, prove improvement, build trust. This is where most teams fail.
Multi-Speed Evaluation + Trace Architecture
- LLM-as-Judge: A separate LLM call evaluates output against a multi-dimensional rubric: accuracy (weight 0.3), security (0.25), actionability (0.25), completeness (0.1), tone (0.1). Scores 1-5 per criterion with specific evidence and suggestions. Returns structured JSON. Judge is calibrated weekly against human expert ratings.
- Experience Database β Self-Improvement: Every request's strategy + outcome is recorded. Weekly: identify weak categories (score < threshold), analyze failure patterns, generate fixes, test against golden set, commit if improved, rollback if not. The system gets better every week without manual intervention.
Deep Dive 4 β Memory, Knowledge Graphs & GraphRAG (10 min)
Goal: Solve the amnesia problem. Every LLM call starts from zero. We need a hierarchical, multi-tier memory architecture inspired by cognitive science, plus knowledge graphs for relational reasoning that vector search can't do.
5-Tier Cognitive Memory Architecture
- Context Window = RAM: Fixed budget. You page in from lower tiers via the Memory Controller, which predicts relevance. Recent conversation kept verbatim; older turns progressively summarized. Core memory (~2KB) always loaded. Archival results paged in per-query. Think: desk (context window) vs. filing cabinet (long-term memory).
- Self-Managed Memory: The LLM itself manages its memory via tool calls (
core_memory_update,archival_memory_search,memory_consolidate). The LLM has the semantic understanding to decide what's important, relevant, and contradictory. Application code enforces policies (limits, privacy, retention). - Temporal Knowledge Graph: Bi-temporal model tracks when facts were true in the real world (valid time) AND when the system learned about them (system time). Old facts are never deleted β they're closed and a new version created. Enables: "Who leads auth now?" (current), "Who led auth in Q1?" (point-in-time), "How has auth leadership changed?" (timeline).
- Community Detection for Global Search: Leiden algorithm clusters the KG into hierarchical communities (Level 1: 5-10 major themes, Level 2: 20-50 sub-themes, Level 3: 100+ specific topics). Each community gets a pre-computed LLM summary. Global queries read 10-20 summaries and synthesize, instead of scanning every document.
- Consolidation = Sleep: Background process clusters related memories, synthesizes higher-level insights, promotes high-confidence insights to core memory, prunes redundant entries, and applies temporal decay (5%/month confidence loss for unaccessed, non-pinned memories). Memories below 0.1 confidence move to cold storage.
Deep Dive 5 β Agent Orchestration & Adaptive Complexity (6 min)
Goal: When simple pipelines aren't enough, route requests dynamically. Start with the cheapest approach. Escalate only when quality demands it. Record what works for future routing.
Adaptive Complexity Escalation
- Adaptive Escalation: The complexity estimator routes simple queries (L1-L2) to cheap models, complex multi-step tasks (L3-L4) to pipelines, and open-ended autonomous tasks (L5-L6) to agents. When a level's quality score is insufficient and cost budget remains, the system escalates to the next level. The Experience Database records which level succeeds for which input type, so over time the router skips directly to the optimal level.
- Multi-Agent Collaboration: For the most complex tasks, specialized agents (e.g., code reviewer, security auditor, test coverage analyzer) work under an orchestrator agent. Each agent has its own prompt, tools, and evaluation criteria. The orchestrator decomposes the task, assigns sub-tasks, resolves conflicts between agents, and synthesizes the final output.
- Human-in-the-Loop Escalation: When automated quality is low and confidence is low, escalate to a human expert with a context summary. The human's corrections feed back into the Experience DB and golden sets, improving the system for similar future inputs.
- Self-Improvement Cycle: Weekly: (1) identify weak categories from Experience DB, (2) cluster failure patterns, (3) for each pattern generate a fix (new prompt, new RAG config, new golden entries), (4) test fix against golden set, (5) commit if improved, rollback if regressed. The system gets better every week without manual intervention.
05
Cross-Cutting Concerns10β12 min
Observability: Design for Visibility from Day 1
"Every component must emit standard signals: component name, operation, latency, success/fail, model used, tokens consumed, cost, quality score, confidence, and trace ID. If you can't see it, you can't fix it. Traces reveal exactly what to fix and in what priority."
AdaptiveOrchestrator.COMPLEXITY_LEVELS
Level 1: direct Single LLM call, no retrieval β $0.01
Level 2: rag LLM + retrieval-augmented generation β $0.03
Level 3: pipeline Multi-step decomposition pipeline β $0.05
Level 4: refinement Generate β Critique β Revise loop β $0.08
Level 5: agent Autonomous agent with tool use β $0.15
Level 6: multi_agent Multiple specialized agents β $0.30
Level 7: human Escalate to human expert β $$$
// For each level: execute β evaluate quality β if score β₯ threshold: return
// If score < threshold AND cost < budget: escalate to next level
// Experience DB records which level succeeded for which input type
// Over time, the router learns to skip directly to the right level
Failure Scenarios
Model deprecated overnightAbstraction Layer β change provider config. Model Router auto-routes to next-best. Golden set regression test validates no quality loss before production traffic.
LLM halluccinates outputSchema validation catches format errors. Cross-validation catches arithmetic errors. LLM-as-Judge catches factual errors. Guardrails catch safety issues. Low confidence β human queue.
RAG retrieves irrelevant contextTraces reveal retrieval quality per-request. Hybrid retrieval (vector + graph + BM25) covers different failure modes. Re-ranker cross-encoder scores relevance. Golden set tests retrieval quality.
Memory contradictionConsolidation process detects contradictions. Resolution: newer info wins (default), store both + flag (uncertain), user correction always wins (highest trust). Audit trail for all changes.
Context window overflowContext Window Manager enforces token budgets per section. Older conversation progressively summarized. Retrieved context ranked and truncated. Core memory always fits (~2KB).
Knowledge graph entity resolution errorFuzzy matching + embedding similarity for dedup. Confidence scores on all relationships. Provenance tracking (which source, when extracted). Graph cleanup during consolidation.
Cost spike (agent loop runaway)Per-request cost budget in orchestrator. Max iterations on refinement loops. Adaptive complexity only escalates when quality demands it. Experience DB learns optimal starting level per input type.
Eval drift (LLM judge becomes unreliable)Weekly human expert calibration against LLM judge scores. Track judge-human agreement rate. Re-calibrate judge prompt when agreement drops below threshold.
Anti-Patterns: What Goes Wrong
| Anti-Pattern | What Breaks | Correct Approach |
|---|---|---|
| Stuff everything in context | Token budget exceeded, important info pushed out, cost explodes | Tiered memory with intelligent paging via Memory Controller |
| No memory consolidation | Memory grows forever, contradictions accumulate, retrieval degrades | Periodic consolidation: synthesize, prune, decay (5%/month unaccessed) |
| Treat all memories equally | Critical preferences buried under trivia | Tiered importance: core (always loaded ~2KB) vs archival (searched) |
| No temporal awareness | Old facts override new ones, or vice versa, silently | Bi-temporal model: track valid time + system time. Never delete; close + version |
| No forgetting mechanism | System remembers stale, irrelevant info forever | Confidence decay + archival for unaccessed memories below 0.1 threshold |
| External-only memory mgmt | App code decides what to store; misses semantic nuance | Self-managed: LLM decides what's worth remembering via tool calls |
| Extract everything as entities | Noisy graph, slow traversal, low precision | Focused ontology: define entity and relationship types upfront |
| No entity resolution | Same entity appears as multiple nodes in KG | Fuzzy matching + embedding similarity for deduplication |
| Graph without vector search | Can't handle novel queries or semantic matching | Hybrid retrieval: always combine graph + vector + keyword search |
| No community summaries | Global queries require scanning entire graph | Pre-compute hierarchical community summaries via Leiden algorithm |
| Ignoring provenance | Can't trace why system believes something | Track source document, extraction confidence, timestamp for every fact |
| Monolithic prompts | Untestable, can't identify which piece regressed | Composable versioned prompt components, each iterates independently |
Core Data Schemas
Trace
trace_id UUID
spans[] Span[] -- name, start/end, input/output, metadata, status, error
total_duration_ms float
total_cost_usd float
total_tokens int
GoldenSetEntry
id UUID
version string
input_data dict
expected_output string | null
evaluation_criteria dict[]
metadata dict -- tags, difficulty, category, source
created_by string -- which expert validated
Experience
experience_id UUID
input_hash string -- stable hash of input features
input_category string -- "code_review", "contract_analysis"
strategy dict -- model, prompt version, RAG config
eval_scores dict
outcome string -- "success", "partial", "failure"
lessons string[]
Episode
episode_id UUID
situation string -- what was happening
strategy dict -- what approach was used
eval_scores dict -- how it turned out
lesson string -- what to learn from this
embedding float[] -- for similarity search
TemporalFact
entity_id string
predicate string -- "leads", "depends_on", "has_vulnerability"
object_id string
valid_from datetime -- when true in real world
valid_to datetime? -- null = still true
system_from datetime -- when system learned this
source string
confidence float
Complete Pattern Catalog
| Category | Pattern | What It Does |
|---|---|---|
| Intake | Schema-First Extraction | Define output schema before prompting. Pydantic + JSON Schema constraints. |
| Intake | LLM-as-Parser-Generator | LLM writes deterministic parser code from examples. Deploy parser, save 20Γ cost. |
| Intake | Progressive Structuring | Classify β Segment β Extract β Relate β Validate. For truly unstructured data. |
| Intake | Parser Flywheel | Auto-generate parsers as doc types accumulate. 100% LLM β 5% LLM over 6 months. |
| Generation | Composable Prompts | Versioned, testable prompt components. Each iterates independently. |
| Generation | Model Abstraction Layer | Never call a model directly. Router selects optimal provider per-request. |
| Generation | Recursive Refinement | Generate β Critique β Revise loop. Typically converges in 1-2 iterations. |
| Evaluation | LLM-as-Judge | Scalable quality assessment. Multi-dimensional rubric, structured JSON output. |
| Evaluation | Golden Sets | Expert-validated test suites. CI/CD gate. Grows from production failures. |
| Evaluation | Multi-Speed Testing | Schema β Metrics β Judge β Golden β Human. Five speeds, five cost levels. |
| Memory | 5-Tier Cognitive Model | Working β Short-Term β Long-Term β Episodic β Procedural. Paging architecture. |
| Memory | Self-Managed Memory | LLM decides what to store/retrieve via tool calls. App enforces policies. |
| Memory | Memory Consolidation | Background: cluster, synthesize, prune, decay. Like brain during sleep. |
| Knowledge | Knowledge Graph | Entities + relationships. Entity resolution via fuzzy match + embedding. |
| Knowledge | GraphRAG (Local) | Entity β traverse neighbors β connected chunks. For specific entity queries. |
| Knowledge | GraphRAG (Global) | Community summaries β map-reduce synthesis. For broad thematic queries. |
| Knowledge | Hybrid Retrieval | Vector + Graph + BM25 β Reciprocal Rank Fusion β cross-encoder re-rank. |
| Knowledge | Temporal KG | Bi-temporal facts (valid time + system time). Point-in-time + range queries. |
| Orchestration | Adaptive Escalation | Direct β RAG β Pipeline β Agent β Human. Start cheap, escalate when needed. |
| Architecture | Abstraction Sandwich | App Logic / Abstraction / Foundation. Foundation changes; app logic doesn't. |
| Architecture | Schema-First Design | Typed contracts between all components. Pydantic models, not raw strings. |
| Architecture | Progressive Enhancement | Ship Week 1, add capability each week. System always production-ready. |
06
Wrap-Up β Progressive Enhancement Timeline3β5 min
Build Incrementally β Never Big-Bang
- Week 1: Prompt β LLM β Output + Schema validation + Basic logging. Production-ready for simple use cases.
- Week 2: + RAG retrieval + LLM-as-Judge eval + Trace collection. Now you can measure quality.
- Week 3: + Golden set tests + Experience database + Model routing (cheap vs. expensive). Cost optimization begins.
- Week 4: + Recursive refinement + Confidence-based escalation + Human-in-the-loop. Quality ceiling rises.
- Month 2: + Multi-step pipelines + Parser generation flywheel + Automated improvement cycles. Self-improving system.
- Month 3: + Agent orchestration + Knowledge graph + Memory tiers. Full cognitive architecture.
- Month 4: + GraphRAG (local + global search) + Community detection + Temporal KG. Enterprise-grade knowledge system.
At each stage, the system is production-ready. Each addition is validated against evals before shipping. The architecture has slots for all components; the implementation fills them progressively. Design must be complete from day one β because retrofitting memory onto a stateless system is 10Γ harder than building in the extension points.
Real-World Instantiations
| Domain | Intake | Generation | Evaluation | Memory |
|---|---|---|---|---|
| Code Review | AST-level diff parsing. Classify: bug fix, feature, refactor. | Structured CodeReview schema. Confidence-scored findings. | Resolution rate (did dev fix it?). BugBench golden set. | Experience DB: winning model/prompt per language. KG: service dependencies. |
| Legal Documents | Parser Flywheel for 500+ contracts/day. 95% deterministic by Month 6. | Clause extraction, risk assessment, template comparison. | Lawyer reviews + corrections. Risk accuracy golden set. | Contract KG: parties, obligations, dependencies. Temporal: terms change over time. |
| Customer Support | Intent classification + sentiment analysis. | RAG over knowledge base + past tickets. Confidence-based routing. | CSAT surveys + agent edits of AI drafts. | Weekly self-improvement cycle: failure analysis β prompt refinement β golden set growth. |
The Final Insight
"Models come and go. Prompts get rewritten. APIs change. The patterns β intake, generation, evaluation, memory, orchestration β those are permanent. Build the architecture once. Build it right. Everything else is configuration. When someone asks you to build a second GenAI system, you reuse 70-80% of the architecture. The patterns are durable. The implementations evolve."
β
Interview Q&A β PracticePractice
Q1: Why shouldn't you just send everything to the LLM and parse the output?
"Three reasons. First, cost: running an LLM on every document at $0.05 each adds up to $500/day for 10K documents. The Parser Flywheel pattern generates deterministic parsers from examples, reducing cost to $0.0001/doc for 90-95% of inputs within months. Second, reliability: LLM outputs are nondeterministic β the same invoice might parse differently on Tuesday than Monday, or differently after a model update. Schema-first extraction with Pydantic validation catches format drift. Cross-validation catches arithmetic errors. Third, speed: a deterministic parser runs in milliseconds; an LLM call takes seconds. The insight is: you don't need the LLM to read every document β you need it to teach you how to read documents."
Q2: How does the 5-tier memory architecture differ from just using RAG?
"RAG is one retrieval mechanism. The 5-tier cognitive model is a complete memory system. Working Memory is the context window β your most precious, expensive resource. Short-Term Memory holds the full session state including scratchpad notes and accumulated reasoning, then consolidates durable knowledge at session end. Long-Term Memory has two stores: Core Memory (~2KB always in context β user preferences, key facts) and Archival Memory (unlimited, searched on demand). Episodic Memory stores situation-action-outcome chains β 'last time I saw a similar PR, what worked?' β enabling learning without retraining. Procedural Memory stores codified skills distilled from successful episodes. The key design choice: the LLM itself manages its memory through tool calls, because it has the semantic understanding to decide what's important. RAG is just one retrieval modality feeding into the working memory. The full system also uses graph traversal, keyword search, and re-ranking."
Q3: When would you use GraphRAG versus pure vector search?
"Vector search finds semantically similar text. It's great for 'find me documents about authentication' but fails at 'how does the auth service connect to billing?' β that requires traversing relationships. GraphRAG adds a knowledge graph with entity resolution and community detection. For local queries β specific entity questions like 'who owns the auth service?' β you look up the entity, traverse 1-2 hops of relationships, and retrieve connected source chunks. For global queries β broad themes like 'what are the main risks across our platform?' β you can't answer by retrieving a few chunks. Instead, you use pre-computed community summaries: the Leiden algorithm clusters the graph hierarchically, each cluster gets an LLM summary, and global queries read 10-20 summaries and synthesize. The temporal knowledge graph adds a bi-temporal model so you can query point-in-time facts. In production, you always combine all three: vector + graph + BM25 keyword, merged via Reciprocal Rank Fusion, then re-ranked with a cross-encoder."
Q4: How do you prevent evaluation drift β the LLM judge becoming unreliable?
"The multi-speed testing architecture handles this. Speed 1-2 (schema validation, automated metrics) are deterministic β they never drift. Speed 3 (LLM-as-Judge) is the one that can drift: the judge model may be updated, the rubric may not cover new failure modes, or the judge may develop blind spots. The countermeasure is Speed 5: weekly human expert calibration. Domain experts score a sample of outputs independently, then we compare judge scores against human scores. If agreement drops below threshold, we recalibrate the judge prompt β usually by adding new failure examples or adjusting criteria weights. Speed 4 (golden sets) provides the safety net: even if the judge drifts, the golden set regression tests catch it because golden sets are anchored to expert-validated ground truth, not to the judge. The key insight: every evaluation method has failure modes, so you layer them. Schema validation catches format errors. Automated metrics catch obvious regressions. LLM-as-Judge catches quality issues. Golden sets catch systematic drift. Humans calibrate everything."
Q5: How do you make this architecture cost-effective when frontier models charge $0.01-0.10 per request?
"Four strategies working together. First, Adaptive Complexity Escalation: start with the cheapest approach (single LLM call, small model) and only escalate when the evaluation layer reports insufficient quality. The Experience Database learns which complexity level succeeds for which input type, so over time the router skips directly to the right level instead of always starting at Level 1. Second, Model Routing: simple tasks go to cheap models (Haiku, GPT-4o-mini at $0.001/request), only complex tasks go to frontier models (Opus at $0.10/request). The router learns from the Experience DB which model works best for each input category. Third, the Parser Flywheel: document processing costs drop 20Γ over 6 months as deterministic parsers replace LLM calls. Fourth, caching and reuse: if the Experience DB has a high-quality response for a very similar input, skip generation entirely. In Alice's code review system, the average cost settled at $0.04/review with 94% developer satisfaction β because most TypeScript PRs use a mid-tier model, only complex multi-file refactors get the frontier model."
Q6: What's the hardest thing about adding memory to a production LLM system?
"Contradiction resolution and memory bloat. Without consolidation, memory grows forever: every session adds facts, many of which are redundant or contradictory. In March the system learns 'Alice leads auth.' In July, 'Bob leads auth.' A naive system keeps both, or overwrites without history. The bi-temporal model solves this for facts: it tracks when something was true (valid time) and when the system learned it (system time). The consolidation process β inspired by how the brain consolidates during sleep β runs periodically to cluster related memories, synthesize higher-level insights, prune redundant entries, and apply temporal decay to unaccessed memories. The hardest design decision is self-managed vs. external memory management. We chose self-managed: the LLM calls memory tools because it has the semantic understanding to decide what's worth remembering. But you must enforce policies β storage limits, privacy rules, retention periods β in the application layer. The LLM drives content decisions; the infrastructure enforces constraints. Getting this boundary right is the hardest part."
Q7: Walk me through how the system handles a query that requires both local entity lookup and broad thematic analysis.
"Consider: 'What risks does the auth service face based on patterns across our platform?' This is a hybrid query β it needs local search (auth service entity and its neighborhood) AND global search (risk patterns across communities). The query analyzer detects both intents. Path 1: entity lookup finds 'auth-service' in the knowledge graph, traverses 1-2 hops to find dependencies, owners, recent PRs, and vulnerabilities. Path 2: community summaries for risk-related clusters are retrieved β the Leiden algorithm has already clustered the graph into themes like 'authentication & security,' 'billing & payments,' 'infrastructure.' Each community has a pre-computed LLM summary. The system reads the relevant community summaries and runs a map-reduce synthesis. Both paths run in parallel. Results are merged via Reciprocal Rank Fusion β each result gets a score of weight/(k + rank) and documents appearing in both lists get boosted. A cross-encoder re-ranks the final set. The context assembly step fits everything into the token budget: graph-structured context for the entity neighborhood, plus synthesized community insights for the thematic patterns. The LLM generates an answer grounded in both specific facts about auth-service and broad patterns across the platform."
Q8: How does this architecture compare to just using a framework like LangChain or LangGraph?
"Frameworks are implementations; this is architecture. LangChain gives you chains and agents. LangGraph gives you graph-based state machines. CrewAI gives you multi-agent collaboration. But none of them give you the complete production architecture: intake normalization, composable versioned prompts, multi-speed evaluation with golden sets, five-tier cognitive memory, knowledge graphs with GraphRAG, temporal facts, or the experience database that enables self-improvement. Frameworks handle Layer 2 (generation) and parts of Layer 5 (orchestration). The architecture handles all five layers plus the cross-cutting concerns: traces, evaluation, memory management, and the feedback loops that make the system improve over time. In practice, you'd use a framework as one component inside this architecture. LangGraph might power your agent runtime in the orchestration layer. But the intake layer, evaluation layer, memory controller, knowledge graph, and experience database β those are yours. The framework is the doctor's stethoscope; this architecture is the hospital."