01 Clarify the Problem & Scope5–7 min
"We're designing a universal, future-proof architecture for production GenAI systems. The core insight: every production LLM system is a data transformation pipeline with an LLM as one component β€” not a magic box you throw problems at. The architecture has five layers β€” Intake, Generation, Evaluation, Memory & Learning, and Orchestration β€” each independently swappable. The system must handle structured intake from any modality, composable prompt pipelines, multi-speed evaluation with LLM-as-Judge, five-tier cognitive memory, knowledge graphs with GraphRAG, hybrid retrieval, and adaptive agent orchestration. The architecture survives model changes; the components evolve."
The hospital analogy: no hospital lets a single doctor handle intake, diagnosis, treatment, quality assurance, and record-keeping simultaneously. Layer 1 is triage (unstructured β†’ structured). Layer 2 is the doctor (LLM diagnosis). Layer 3 is the second opinion (evaluation). Layer 4 is the patient chart (memory). Layer 5 is the administrator routing complex cases (orchestration).
The Three Laws of Future-Proof GenAI
  • Law 1 β€” Separate Concerns: The LLM is a reasoning engine, not the whole system. Retrieval, evaluation, memory, and orchestration are independent, swappable layers.
  • Law 2 β€” Measure Everything: If you can't evaluate it, you can't improve it. Traces, evals, and golden sets are first-class citizens.
  • Law 3 β€” Design for Replacement: Any model, any prompt, any component should be replaceable without rewriting the system.
Questions I'd Ask
  • What domain? β†’ Domain-agnostic reference architecture. We'll use code review, document intelligence, and customer support as concrete instantiations.
  • Input modalities? β†’ All: text, PDFs, images, code diffs, audio. The Intake Layer normalizes chaos into structured representations before any LLM call.
  • How do we handle model changes? β†’ Model Abstraction Layer: never call a model directly. A router selects the optimal model per-request based on complexity, latency, cost, and features.
  • How does the system improve over time? β†’ Experience Database records every strategy + outcome. Episodic memory recalls what worked. Weekly self-improvement cycles auto-fix weak spots.
  • What about retrieval? β†’ Hybrid: vector search (semantic) + knowledge graph traversal (relationships) + keyword/BM25 (exact match). Reciprocal Rank Fusion merges results. GraphRAG for local entity queries and global thematic analysis.
  • How do we evaluate at scale? β†’ Five speeds: (1) schema validation on every request, (2) automated metrics, (3) LLM-as-Judge on samples, (4) golden set regression on every deploy, (5) human expert review weekly.
Agreed Scope
In ScopeOut of Scope
5-layer universal architecture blueprintSpecific model training / fine-tuning
Intake: unstructured β†’ structured + parser flywheelMLOps / model deployment infrastructure
Generation: composable prompts + model routerSpecific vector DB product selection
Evaluation: traces + LLM-as-Judge + golden setsPrompt engineering details per domain
5-tier cognitive memory architectureRLHF / preference optimization
Knowledge Graphs + GraphRAG + hybrid retrievalCompliance/regulatory frameworks
Orchestration: adaptive complexity + agentsFrontend/UX design
The defining tension: flexibility vs. complexity. A future-proof system requires abstraction layers, eval infrastructure, memory tiers, and observability from day one. But you must ship incrementally β€” Week 1 is "prompt β†’ LLM β†’ validate β†’ output." The architecture has slots for all components; the implementation fills them progressively.
02 Back-of-the-Envelope Estimation3–5 min
"Numbers for a mid-scale production GenAI system β€” say a code review or document intelligence platform. I'll estimate per-request costs, latency budgets, and storage for memory and knowledge graph layers."
Requests / Day
10K–500K
Internal tool: ~10K. SaaS platform: ~500K. Burst patterns during business hours.
Latency Budget / Request
2–10s
Intake: ~200ms. Retrieval: ~300ms. LLM: 1–6s (dominates). Evaluation: ~200ms. Total p50: ~2.5s.
Cost / Request
$0.01–0.10
Simple (cheap model): $0.01. Complex (frontier + RAG + refinement): $0.05–0.10. Agent loops: $0.20+.
Context Window Budget
128K tokens
System prompt ~2K. Core memory ~2K. Conversation ~8K. Retrieved context ~20K. Response ~20K. Buffer: ~76K.
Memory Store / User / Year
~1 GB
Core memory: ~2KB always loaded. Archival: ~10KB/session Γ— 250/yr. Episodic: ~5KB Γ— 1K episodes. KG: varies.
Knowledge Graph
100K–10M nodes
Enterprise codebase: ~100K entities. Document corpus: ~1M. Community summaries: 100–500 clusters Γ— 3 hierarchy levels.
Golden Set Size
100–1000 cases
Cover edge, common, adversarial cases. Expert-validated. Run on every deploy. Grows from production failures.
Eval Cost / Deploy
$5–50
100 cases Γ— $0.05 LLM-as-Judge = $5. 1000 cases = $50. Cheap insurance against regressions.
Parser Flywheel economics: Week 1: 100% LLM-parsed ($0.05/doc). Month 3: 90% deterministic parser ($0.0001/doc), 10% LLM fallback. Month 6: 95%/5%. The more you process, the cheaper it gets.
03 High-Level Design8–12 min
"Every GenAI system maps to five layers. They communicate through typed schemas β€” never raw strings. Each layer is independently deployable, testable, and replaceable. Data flows bottom-up with a feedback loop from orchestration back to generation and from memory into context assembly."
Universal 5-Layer Architecture LAYER 1 β€” INTAKE Unstructured β†’ Structured Raw Input Type Classifier Schema Detector Parser / LLM Validator Structured Representation LAYER 2 β€” GENERATION Composable Pipelines Context Asm Prompt Pipeline Model Router β†’ LLM Call Output Parser Guardrails Refinement Loop Generate β†’ Critique β†’ Revise LAYER 3 β€” EVALUATION Traces, Judges, Golden Sets Structural Validators LLM-as-Judge Trace Collector Golden Set Runner Human Feedback LAYER 4 β€” MEMORY & KNOWLEDGE 5-Tier Cognitive + Knowledge Graph Working Context Window 128K tokens Short-Term Session Buffer 1 session Long-Term Core + Archival Permanent Episodic Experience DB Append-only Procedural Learned Skills Versioned Knowledge Graph + GraphRAG + Hybrid Retrieval Entities Β· Relationships Β· Communities Vector + Graph Traversal + BM25 β†’ RRF Re-rank LAYER 5 β€” ORCHESTRATION Adaptive Complexity Router / Planner Complexity Estimator Workflow Engine Agent Runtime Human Escalation feedback loop context injection Raw User Input Response to User
The Abstraction Sandwich
// The architecture that protects you from model churn β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ APPLICATION LOGIC (Yours forever) β”‚ β”‚ Business rules, workflows, domain knowledge β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ ABSTRACTION LAYER (Your insurance) β”‚ β”‚ Model router, prompt assembler, eval runner β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ FOUNDATION LAYER (Will change) β”‚ β”‚ Specific models, APIs, embedding services β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Key Architectural Decisions
DecisionChoiceWhy Not Alternative
Model couplingAbstraction Layer (never call directly)Direct API calls β†’ vendor lock-in; model deprecations break production
Prompt managementComposable versioned componentsMonolithic prompts are untestable; can't identify which piece regressed
Output formatSchema-first (Pydantic/JSON Schema)Free-form text is unparseable; format varies by model version
RetrievalHybrid (vector + graph + keyword)Vector-only misses exact matches and relationships; keyword-only misses semantics
Memory5-tier cognitive modelFlat store can't differentiate critical facts from stale trivia
Memory controlSelf-managed (LLM decides via tools)External-only management misses semantic nuance of what's worth remembering
EvaluationMulti-speed (schema β†’ metrics β†’ judge β†’ golden β†’ human)Single eval is either too slow or misses quality issues
OrchestrationAdaptive complexity escalationAlways-agent wastes cost on simple queries; always-pipeline under-serves complex ones
04 Deep Dives25–30 min
Deep Dive 1 β€” Intake Layer & Parser Flywheel (6 min)
Goal: Transform raw chaos (PDFs, code diffs, images, audio) into typed, validated, structured representations before any LLM call.
Intake: Parse β†’ Classify β†’ Schema β†’ Validate + Parser Flywheel Raw Input Type Classifier Text Β· Doc Β· Code Β· Image Β· Audio Schema Detector Known? β†’ Map. Unknown? β†’ LLM infer Parser Deterministic or LLM fallback Validator Schema + cross-validate βœ“ Structured Output Parser Flywheel: "LLM as Parser Generator" Instead of running an LLM on every document ($0.05/doc), use the LLM to write a deterministic parser from 20 examples. Deploy the parser β€” process millions for pennies. For the 5% the parser can't handle, fall back to the LLM. Week 1: 100% LLM β†’ Month 3: 90% parser β†’ Month 6: 95% parser. Insight: "We need the LLM to teach us how to read documents." Auto-generate when: same doc type seen 20+ times with 95%+ LLM extraction accuracy. Test parser against examples. Register if β‰₯95% accurate. Progressive Structuring (for truly unstructured data) β‘  CLASSIFY type β‘‘ SEGMENT sections β‘’ EXTRACT entities β‘£ RELATE β†’ knowledge graph β‘€ VALIDATE quality Each stage produces structured output feeding the next. If any stage fails validation, you know exactly where the breakdown is.
  • Schema-First Extraction: Define your output schema (Pydantic model) BEFORE writing the prompt. Use JSON Schema constraints in the LLM call. Parse and validate with the schema. Cross-validate (e.g., line items sum = subtotal). This eliminates format drift across model versions.
  • Parser Flywheel economics: 10,000 invoices/day Γ— $0.05/LLM parse = $500/day. After the flywheel generates deterministic parsers: 9,500 Γ— $0.0001 + 500 Γ— $0.05 = $25.95/day. A 20Γ— cost reduction that accelerates over time.
  • Real-world use case β€” legal contracts: 500+ contracts/day. LLM classifies contract type, tries deterministic parser first, falls back to LLM extraction with schema enforcement. When a doc type accumulates 20+ examples, an LLM auto-generates a new parser, tests it, registers if β‰₯95% accurate.
Deep Dive 2 β€” Generation Engine: Composable Pipelines (7 min)
Goal: Build a generation layer that's model-agnostic, prompt-versionable, and self-refining.
Generation Pipeline PROMPT ASSEMBLY (Composable, versioned, testable components) System Role v2.1 Context v1.0 Few-Shot v3.2 Output Schema v1.1 Task Instruction v4.0 Custom Rules Model Router β†’ selects optimal provider per-request Simple β†’ Haiku/GPT-4o-mini ($0.01) | Complex β†’ Opus/GPT-5 ($0.10) | Vision, structured output, latency SLA Output Parser + Schema Validate JSON β†’ Pydantic + cross-validate Guardrails (Safety + Policy) PII Β· toxicity Β· brand safety Β· hallucination Recursive Refinement Loop Generate β†’ Critic scores rubric β†’ Revise if < threshold refine βœ“ Validated, typed, structured output
  • Composable Prompts: Each prompt component (system role, context, few-shot, output schema, task instruction) is a named, versioned, testable unit. When you switch from GPT-4 to Claude, you change the generation call β€” not the prompt assembly. When output requirements change, update the schema component. Each piece iterates independently.
  • Model Abstraction Layer: Never call openai.chat.completions.create() directly. Go through ModelRouter.route() which selects provider based on: task complexity (cheap model for simple, frontier for complex), latency SLA, cost budget, feature needs (vision, structured output). When a model is deprecated: change one config line.
  • Recursive Refinement: Generate β†’ Critique β†’ Revise loop. A critic LLM evaluates the draft against a rubric (multi-dimensional: accuracy, actionability, completeness, tone). If score < threshold and iterations remain, revise. Each iteration traced. Typically converges in 1-2 iterations. Max iterations β†’ flag for human review.
Deep Dive 3 β€” Evaluation Layer: Traces, Judges, Golden Sets (7 min)
Goal: Build the "immune system" β€” detect degradation, prove improvement, build trust. This is where most teams fail.
Multi-Speed Evaluation + Trace Architecture Five Speeds of Testing Speed 1: ms Schema Validation Every request Β· Cost: ~0 Format, types, ranges Speed 2: sec Automated Metrics Every request Β· Cost: minimal BLEU, exact match, length Speed 3: min LLM-as-Judge Sampled or async Β· $/eval Quality, accuracy, safety Speed 4: hours Golden Set Regression Every deploy Β· Block if regressed CI/CD gate Speed 5: weekly Human Expert Review Calibration Β· Cost: highest Calibrate LLM judge TRACES: The X-Ray of Every Request Each trace captures the complete lifecycle: every span (classify β†’ retrieve β†’ generate β†’ evaluate) with timing, tokens, cost, and metadata. Example: 1000-request analysis β†’ p99 latency 23s (LLM generation: 18.7s at p99). 10% failure rate: 45% insufficient RAG, 30% wrong language, 15% format error, 10% hallucinated API. Without traces, you're guessing. With traces, you know exactly what to fix and in what priority. GOLDEN SETS + EVAL-DRIVEN DEVELOPMENT 100+ expert-validated test cases covering edge, common, and adversarial inputs. Run on every deploy β€” block if critical dims regress. Growth cycle: production failures β†’ root cause β†’ new golden entries. Month 6 golden set is 3Γ— more comprehensive than Month 1. Eval-Driven Dev: (1) Define evals β†’ (2) Run baseline β†’ (3) Make change β†’ (4) Run evals β†’ (5) Compare per-category β†’ (6) Ship or iterate. TDD for AI.
  • LLM-as-Judge: A separate LLM call evaluates output against a multi-dimensional rubric: accuracy (weight 0.3), security (0.25), actionability (0.25), completeness (0.1), tone (0.1). Scores 1-5 per criterion with specific evidence and suggestions. Returns structured JSON. Judge is calibrated weekly against human expert ratings.
  • Experience Database β†’ Self-Improvement: Every request's strategy + outcome is recorded. Weekly: identify weak categories (score < threshold), analyze failure patterns, generate fixes, test against golden set, commit if improved, rollback if not. The system gets better every week without manual intervention.
Deep Dive 4 β€” Memory, Knowledge Graphs & GraphRAG (10 min)
Goal: Solve the amnesia problem. Every LLM call starts from zero. We need a hierarchical, multi-tier memory architecture inspired by cognitive science, plus knowledge graphs for relational reasoning that vector search can't do.
5-Tier Cognitive Memory Architecture TIER 1: Working Memory (Context Window = RAM) System prompt ~2K | Core memory ~2K | Conversation ~8K | RAG results ~20K | Response ~20K Lifespan: 1 LLM call | Cost: highest (per-token) | Capacity: 128K tokens | Must be intelligently curated TIER 2: Short-Term (Session Buffer) Full conversation | Scratchpad notes-to-self | Accumulated state | Tool cache Lifespan: 1 session. On end β†’ consolidate: extract facts, record episode, archive raw session. TIER 3: Long-Term (Core + Archival) Core: ~2KB always in context (user prefs, key facts, active goals). LLM edits via tools. Archival: unlimited, searched on demand. Vector + keyword + graph. Temporal decay for stale entries. Self-managed: LLM calls core_memory_update() and archival_memory_search() β€” app enforces policy. TIER 4: Episodic (Experience Database) Situation β†’ Action β†’ Outcome chains. "Have I seen something like this before?" Enables learning WITHOUT retraining. Recall similar episodes β†’ inject lessons into prompt. TIER 5: Procedural (Learned Skills) Distilled from successful episodes. "Best known way to do X." Prompt templates, tool sequences. Codified when consolidation finds 5+ similar successful episodes with confidence > 0.9. MEMORY CONTROLLER (Orchestrates All Tiers) Before every LLM call: β‘  Core memory (always in context) β†’ β‘‘ Search archival (relevant facts) β‘’ Recall episodes (similar experiences) β†’ β‘£ Find procedure (learned skills) β‘€ Assemble context window (page in, respect token budget) After every response: extract facts β†’ record episode β†’ update short-term Background: consolidation (hourly), procedure learning (weekly), decay (daily) KNOWLEDGE GRAPH + GraphRAG Entities (nodes) + Relationships (edges) + Community Summaries (clusters). Vector search finds similar text. KG finds connected facts. LOCAL SEARCH: entity queries "Who owns auth-service?" β†’ entity lookup β†’ traverse neighbors 1-2 hops β†’ chunks GLOBAL SEARCH: thematic queries "What are the main risks?" β†’ community summaries β†’ map-reduce synthesis HYBRID RETRIEVAL: Vector + Graph + BM25 β†’ Reciprocal Rank Fusion Adaptive weights: tech IDs β†’ 0.8 keyword. Relationships β†’ 0.7 graph. Broad β†’ 0.5 vector. Temporal KG: bi-temporal model (valid time + system time). Point-in-time + range queries. Tiers page into working memory. Consolidation flows down. Knowledge graph feeds hybrid retrieval.
  • Context Window = RAM: Fixed budget. You page in from lower tiers via the Memory Controller, which predicts relevance. Recent conversation kept verbatim; older turns progressively summarized. Core memory (~2KB) always loaded. Archival results paged in per-query. Think: desk (context window) vs. filing cabinet (long-term memory).
  • Self-Managed Memory: The LLM itself manages its memory via tool calls (core_memory_update, archival_memory_search, memory_consolidate). The LLM has the semantic understanding to decide what's important, relevant, and contradictory. Application code enforces policies (limits, privacy, retention).
  • Temporal Knowledge Graph: Bi-temporal model tracks when facts were true in the real world (valid time) AND when the system learned about them (system time). Old facts are never deleted β€” they're closed and a new version created. Enables: "Who leads auth now?" (current), "Who led auth in Q1?" (point-in-time), "How has auth leadership changed?" (timeline).
  • Community Detection for Global Search: Leiden algorithm clusters the KG into hierarchical communities (Level 1: 5-10 major themes, Level 2: 20-50 sub-themes, Level 3: 100+ specific topics). Each community gets a pre-computed LLM summary. Global queries read 10-20 summaries and synthesize, instead of scanning every document.
  • Consolidation = Sleep: Background process clusters related memories, synthesizes higher-level insights, promotes high-confidence insights to core memory, prunes redundant entries, and applies temporal decay (5%/month confidence loss for unaccessed, non-pinned memories). Memories below 0.1 confidence move to cold storage.
Deep Dive 5 β€” Agent Orchestration & Adaptive Complexity (6 min)
Goal: When simple pipelines aren't enough, route requests dynamically. Start with the cheapest approach. Escalate only when quality demands it. Record what works for future routing.
Adaptive Complexity Escalation User Request Router / Planner Complexity Estimator L1: Direct 1 LLM call Β· $0.01 L2: RAG Retrieve + Gen Β· $0.03 L3: Pipeline Multi-step Β· $0.05 L4: Refinement Genβ†’Critβ†’Rev Β· $0.08 L5: Agent Autonomous Β· $0.15 L6: Multi-Agent Specialists Β· $0.30 Human EVALUATION: Score β‰₯ threshold? YES β†’ return result | NO & budget remains β†’ escalate to next level | NO & max β†’ human EXPERIENCE DATABASE: Record strategy + outcome Over time, router learns to skip directly to the right level for each input type WEEKLY SELF-IMPROVEMENT: Identify weak categories β†’ Analyze failure patterns β†’ Generate fixes β†’ Test against golden set β†’ Commit if improved, rollback if not
  • Adaptive Escalation: The complexity estimator routes simple queries (L1-L2) to cheap models, complex multi-step tasks (L3-L4) to pipelines, and open-ended autonomous tasks (L5-L6) to agents. When a level's quality score is insufficient and cost budget remains, the system escalates to the next level. The Experience Database records which level succeeds for which input type, so over time the router skips directly to the optimal level.
  • Multi-Agent Collaboration: For the most complex tasks, specialized agents (e.g., code reviewer, security auditor, test coverage analyzer) work under an orchestrator agent. Each agent has its own prompt, tools, and evaluation criteria. The orchestrator decomposes the task, assigns sub-tasks, resolves conflicts between agents, and synthesizes the final output.
  • Human-in-the-Loop Escalation: When automated quality is low and confidence is low, escalate to a human expert with a context summary. The human's corrections feed back into the Experience DB and golden sets, improving the system for similar future inputs.
  • Self-Improvement Cycle: Weekly: (1) identify weak categories from Experience DB, (2) cluster failure patterns, (3) for each pattern generate a fix (new prompt, new RAG config, new golden entries), (4) test fix against golden set, (5) commit if improved, rollback if regressed. The system gets better every week without manual intervention.
05 Cross-Cutting Concerns10–12 min
Observability: Design for Visibility from Day 1
"Every component must emit standard signals: component name, operation, latency, success/fail, model used, tokens consumed, cost, quality score, confidence, and trace ID. If you can't see it, you can't fix it. Traces reveal exactly what to fix and in what priority."
AdaptiveOrchestrator.COMPLEXITY_LEVELS Level 1: direct Single LLM call, no retrieval β†’ $0.01 Level 2: rag LLM + retrieval-augmented generation β†’ $0.03 Level 3: pipeline Multi-step decomposition pipeline β†’ $0.05 Level 4: refinement Generate β†’ Critique β†’ Revise loop β†’ $0.08 Level 5: agent Autonomous agent with tool use β†’ $0.15 Level 6: multi_agent Multiple specialized agents β†’ $0.30 Level 7: human Escalate to human expert β†’ $$$ // For each level: execute β†’ evaluate quality β†’ if score β‰₯ threshold: return // If score < threshold AND cost < budget: escalate to next level // Experience DB records which level succeeded for which input type // Over time, the router learns to skip directly to the right level
Failure Scenarios
Model deprecated overnightAbstraction Layer β†’ change provider config. Model Router auto-routes to next-best. Golden set regression test validates no quality loss before production traffic.
LLM halluccinates outputSchema validation catches format errors. Cross-validation catches arithmetic errors. LLM-as-Judge catches factual errors. Guardrails catch safety issues. Low confidence β†’ human queue.
RAG retrieves irrelevant contextTraces reveal retrieval quality per-request. Hybrid retrieval (vector + graph + BM25) covers different failure modes. Re-ranker cross-encoder scores relevance. Golden set tests retrieval quality.
Memory contradictionConsolidation process detects contradictions. Resolution: newer info wins (default), store both + flag (uncertain), user correction always wins (highest trust). Audit trail for all changes.
Context window overflowContext Window Manager enforces token budgets per section. Older conversation progressively summarized. Retrieved context ranked and truncated. Core memory always fits (~2KB).
Knowledge graph entity resolution errorFuzzy matching + embedding similarity for dedup. Confidence scores on all relationships. Provenance tracking (which source, when extracted). Graph cleanup during consolidation.
Cost spike (agent loop runaway)Per-request cost budget in orchestrator. Max iterations on refinement loops. Adaptive complexity only escalates when quality demands it. Experience DB learns optimal starting level per input type.
Eval drift (LLM judge becomes unreliable)Weekly human expert calibration against LLM judge scores. Track judge-human agreement rate. Re-calibrate judge prompt when agreement drops below threshold.
Anti-Patterns: What Goes Wrong
Anti-PatternWhat BreaksCorrect Approach
Stuff everything in contextToken budget exceeded, important info pushed out, cost explodesTiered memory with intelligent paging via Memory Controller
No memory consolidationMemory grows forever, contradictions accumulate, retrieval degradesPeriodic consolidation: synthesize, prune, decay (5%/month unaccessed)
Treat all memories equallyCritical preferences buried under triviaTiered importance: core (always loaded ~2KB) vs archival (searched)
No temporal awarenessOld facts override new ones, or vice versa, silentlyBi-temporal model: track valid time + system time. Never delete; close + version
No forgetting mechanismSystem remembers stale, irrelevant info foreverConfidence decay + archival for unaccessed memories below 0.1 threshold
External-only memory mgmtApp code decides what to store; misses semantic nuanceSelf-managed: LLM decides what's worth remembering via tool calls
Extract everything as entitiesNoisy graph, slow traversal, low precisionFocused ontology: define entity and relationship types upfront
No entity resolutionSame entity appears as multiple nodes in KGFuzzy matching + embedding similarity for deduplication
Graph without vector searchCan't handle novel queries or semantic matchingHybrid retrieval: always combine graph + vector + keyword search
No community summariesGlobal queries require scanning entire graphPre-compute hierarchical community summaries via Leiden algorithm
Ignoring provenanceCan't trace why system believes somethingTrack source document, extraction confidence, timestamp for every fact
Monolithic promptsUntestable, can't identify which piece regressedComposable versioned prompt components, each iterates independently
Core Data Schemas
Trace trace_id UUID spans[] Span[] -- name, start/end, input/output, metadata, status, error total_duration_ms float total_cost_usd float total_tokens int GoldenSetEntry id UUID version string input_data dict expected_output string | null evaluation_criteria dict[] metadata dict -- tags, difficulty, category, source created_by string -- which expert validated Experience experience_id UUID input_hash string -- stable hash of input features input_category string -- "code_review", "contract_analysis" strategy dict -- model, prompt version, RAG config eval_scores dict outcome string -- "success", "partial", "failure" lessons string[] Episode episode_id UUID situation string -- what was happening strategy dict -- what approach was used eval_scores dict -- how it turned out lesson string -- what to learn from this embedding float[] -- for similarity search TemporalFact entity_id string predicate string -- "leads", "depends_on", "has_vulnerability" object_id string valid_from datetime -- when true in real world valid_to datetime? -- null = still true system_from datetime -- when system learned this source string confidence float
Complete Pattern Catalog
CategoryPatternWhat It Does
IntakeSchema-First ExtractionDefine output schema before prompting. Pydantic + JSON Schema constraints.
IntakeLLM-as-Parser-GeneratorLLM writes deterministic parser code from examples. Deploy parser, save 20Γ— cost.
IntakeProgressive StructuringClassify β†’ Segment β†’ Extract β†’ Relate β†’ Validate. For truly unstructured data.
IntakeParser FlywheelAuto-generate parsers as doc types accumulate. 100% LLM β†’ 5% LLM over 6 months.
GenerationComposable PromptsVersioned, testable prompt components. Each iterates independently.
GenerationModel Abstraction LayerNever call a model directly. Router selects optimal provider per-request.
GenerationRecursive RefinementGenerate β†’ Critique β†’ Revise loop. Typically converges in 1-2 iterations.
EvaluationLLM-as-JudgeScalable quality assessment. Multi-dimensional rubric, structured JSON output.
EvaluationGolden SetsExpert-validated test suites. CI/CD gate. Grows from production failures.
EvaluationMulti-Speed TestingSchema β†’ Metrics β†’ Judge β†’ Golden β†’ Human. Five speeds, five cost levels.
Memory5-Tier Cognitive ModelWorking β†’ Short-Term β†’ Long-Term β†’ Episodic β†’ Procedural. Paging architecture.
MemorySelf-Managed MemoryLLM decides what to store/retrieve via tool calls. App enforces policies.
MemoryMemory ConsolidationBackground: cluster, synthesize, prune, decay. Like brain during sleep.
KnowledgeKnowledge GraphEntities + relationships. Entity resolution via fuzzy match + embedding.
KnowledgeGraphRAG (Local)Entity β†’ traverse neighbors β†’ connected chunks. For specific entity queries.
KnowledgeGraphRAG (Global)Community summaries β†’ map-reduce synthesis. For broad thematic queries.
KnowledgeHybrid RetrievalVector + Graph + BM25 β†’ Reciprocal Rank Fusion β†’ cross-encoder re-rank.
KnowledgeTemporal KGBi-temporal facts (valid time + system time). Point-in-time + range queries.
OrchestrationAdaptive EscalationDirect β†’ RAG β†’ Pipeline β†’ Agent β†’ Human. Start cheap, escalate when needed.
ArchitectureAbstraction SandwichApp Logic / Abstraction / Foundation. Foundation changes; app logic doesn't.
ArchitectureSchema-First DesignTyped contracts between all components. Pydantic models, not raw strings.
ArchitectureProgressive EnhancementShip Week 1, add capability each week. System always production-ready.
06 Wrap-Up β€” Progressive Enhancement Timeline3–5 min
Build Incrementally β€” Never Big-Bang
  • Week 1: Prompt β†’ LLM β†’ Output + Schema validation + Basic logging. Production-ready for simple use cases.
  • Week 2: + RAG retrieval + LLM-as-Judge eval + Trace collection. Now you can measure quality.
  • Week 3: + Golden set tests + Experience database + Model routing (cheap vs. expensive). Cost optimization begins.
  • Week 4: + Recursive refinement + Confidence-based escalation + Human-in-the-loop. Quality ceiling rises.
  • Month 2: + Multi-step pipelines + Parser generation flywheel + Automated improvement cycles. Self-improving system.
  • Month 3: + Agent orchestration + Knowledge graph + Memory tiers. Full cognitive architecture.
  • Month 4: + GraphRAG (local + global search) + Community detection + Temporal KG. Enterprise-grade knowledge system.
At each stage, the system is production-ready. Each addition is validated against evals before shipping. The architecture has slots for all components; the implementation fills them progressively. Design must be complete from day one β€” because retrofitting memory onto a stateless system is 10Γ— harder than building in the extension points.
Real-World Instantiations
DomainIntakeGenerationEvaluationMemory
Code ReviewAST-level diff parsing. Classify: bug fix, feature, refactor.Structured CodeReview schema. Confidence-scored findings.Resolution rate (did dev fix it?). BugBench golden set.Experience DB: winning model/prompt per language. KG: service dependencies.
Legal DocumentsParser Flywheel for 500+ contracts/day. 95% deterministic by Month 6.Clause extraction, risk assessment, template comparison.Lawyer reviews + corrections. Risk accuracy golden set.Contract KG: parties, obligations, dependencies. Temporal: terms change over time.
Customer SupportIntent classification + sentiment analysis.RAG over knowledge base + past tickets. Confidence-based routing.CSAT surveys + agent edits of AI drafts.Weekly self-improvement cycle: failure analysis β†’ prompt refinement β†’ golden set growth.
The Final Insight
"Models come and go. Prompts get rewritten. APIs change. The patterns β€” intake, generation, evaluation, memory, orchestration β€” those are permanent. Build the architecture once. Build it right. Everything else is configuration. When someone asks you to build a second GenAI system, you reuse 70-80% of the architecture. The patterns are durable. The implementations evolve."
β˜… Interview Q&A β€” PracticePractice
Q1: Why shouldn't you just send everything to the LLM and parse the output?
"Three reasons. First, cost: running an LLM on every document at $0.05 each adds up to $500/day for 10K documents. The Parser Flywheel pattern generates deterministic parsers from examples, reducing cost to $0.0001/doc for 90-95% of inputs within months. Second, reliability: LLM outputs are nondeterministic β€” the same invoice might parse differently on Tuesday than Monday, or differently after a model update. Schema-first extraction with Pydantic validation catches format drift. Cross-validation catches arithmetic errors. Third, speed: a deterministic parser runs in milliseconds; an LLM call takes seconds. The insight is: you don't need the LLM to read every document β€” you need it to teach you how to read documents."
Q2: How does the 5-tier memory architecture differ from just using RAG?
"RAG is one retrieval mechanism. The 5-tier cognitive model is a complete memory system. Working Memory is the context window β€” your most precious, expensive resource. Short-Term Memory holds the full session state including scratchpad notes and accumulated reasoning, then consolidates durable knowledge at session end. Long-Term Memory has two stores: Core Memory (~2KB always in context β€” user preferences, key facts) and Archival Memory (unlimited, searched on demand). Episodic Memory stores situation-action-outcome chains β€” 'last time I saw a similar PR, what worked?' β€” enabling learning without retraining. Procedural Memory stores codified skills distilled from successful episodes. The key design choice: the LLM itself manages its memory through tool calls, because it has the semantic understanding to decide what's important. RAG is just one retrieval modality feeding into the working memory. The full system also uses graph traversal, keyword search, and re-ranking."
Q3: When would you use GraphRAG versus pure vector search?
"Vector search finds semantically similar text. It's great for 'find me documents about authentication' but fails at 'how does the auth service connect to billing?' β€” that requires traversing relationships. GraphRAG adds a knowledge graph with entity resolution and community detection. For local queries β€” specific entity questions like 'who owns the auth service?' β€” you look up the entity, traverse 1-2 hops of relationships, and retrieve connected source chunks. For global queries β€” broad themes like 'what are the main risks across our platform?' β€” you can't answer by retrieving a few chunks. Instead, you use pre-computed community summaries: the Leiden algorithm clusters the graph hierarchically, each cluster gets an LLM summary, and global queries read 10-20 summaries and synthesize. The temporal knowledge graph adds a bi-temporal model so you can query point-in-time facts. In production, you always combine all three: vector + graph + BM25 keyword, merged via Reciprocal Rank Fusion, then re-ranked with a cross-encoder."
Q4: How do you prevent evaluation drift β€” the LLM judge becoming unreliable?
"The multi-speed testing architecture handles this. Speed 1-2 (schema validation, automated metrics) are deterministic β€” they never drift. Speed 3 (LLM-as-Judge) is the one that can drift: the judge model may be updated, the rubric may not cover new failure modes, or the judge may develop blind spots. The countermeasure is Speed 5: weekly human expert calibration. Domain experts score a sample of outputs independently, then we compare judge scores against human scores. If agreement drops below threshold, we recalibrate the judge prompt β€” usually by adding new failure examples or adjusting criteria weights. Speed 4 (golden sets) provides the safety net: even if the judge drifts, the golden set regression tests catch it because golden sets are anchored to expert-validated ground truth, not to the judge. The key insight: every evaluation method has failure modes, so you layer them. Schema validation catches format errors. Automated metrics catch obvious regressions. LLM-as-Judge catches quality issues. Golden sets catch systematic drift. Humans calibrate everything."
Q5: How do you make this architecture cost-effective when frontier models charge $0.01-0.10 per request?
"Four strategies working together. First, Adaptive Complexity Escalation: start with the cheapest approach (single LLM call, small model) and only escalate when the evaluation layer reports insufficient quality. The Experience Database learns which complexity level succeeds for which input type, so over time the router skips directly to the right level instead of always starting at Level 1. Second, Model Routing: simple tasks go to cheap models (Haiku, GPT-4o-mini at $0.001/request), only complex tasks go to frontier models (Opus at $0.10/request). The router learns from the Experience DB which model works best for each input category. Third, the Parser Flywheel: document processing costs drop 20Γ— over 6 months as deterministic parsers replace LLM calls. Fourth, caching and reuse: if the Experience DB has a high-quality response for a very similar input, skip generation entirely. In Alice's code review system, the average cost settled at $0.04/review with 94% developer satisfaction β€” because most TypeScript PRs use a mid-tier model, only complex multi-file refactors get the frontier model."
Q6: What's the hardest thing about adding memory to a production LLM system?
"Contradiction resolution and memory bloat. Without consolidation, memory grows forever: every session adds facts, many of which are redundant or contradictory. In March the system learns 'Alice leads auth.' In July, 'Bob leads auth.' A naive system keeps both, or overwrites without history. The bi-temporal model solves this for facts: it tracks when something was true (valid time) and when the system learned it (system time). The consolidation process β€” inspired by how the brain consolidates during sleep β€” runs periodically to cluster related memories, synthesize higher-level insights, prune redundant entries, and apply temporal decay to unaccessed memories. The hardest design decision is self-managed vs. external memory management. We chose self-managed: the LLM calls memory tools because it has the semantic understanding to decide what's worth remembering. But you must enforce policies β€” storage limits, privacy rules, retention periods β€” in the application layer. The LLM drives content decisions; the infrastructure enforces constraints. Getting this boundary right is the hardest part."
Q7: Walk me through how the system handles a query that requires both local entity lookup and broad thematic analysis.
"Consider: 'What risks does the auth service face based on patterns across our platform?' This is a hybrid query β€” it needs local search (auth service entity and its neighborhood) AND global search (risk patterns across communities). The query analyzer detects both intents. Path 1: entity lookup finds 'auth-service' in the knowledge graph, traverses 1-2 hops to find dependencies, owners, recent PRs, and vulnerabilities. Path 2: community summaries for risk-related clusters are retrieved β€” the Leiden algorithm has already clustered the graph into themes like 'authentication & security,' 'billing & payments,' 'infrastructure.' Each community has a pre-computed LLM summary. The system reads the relevant community summaries and runs a map-reduce synthesis. Both paths run in parallel. Results are merged via Reciprocal Rank Fusion β€” each result gets a score of weight/(k + rank) and documents appearing in both lists get boosted. A cross-encoder re-ranks the final set. The context assembly step fits everything into the token budget: graph-structured context for the entity neighborhood, plus synthesized community insights for the thematic patterns. The LLM generates an answer grounded in both specific facts about auth-service and broad patterns across the platform."
Q8: How does this architecture compare to just using a framework like LangChain or LangGraph?
"Frameworks are implementations; this is architecture. LangChain gives you chains and agents. LangGraph gives you graph-based state machines. CrewAI gives you multi-agent collaboration. But none of them give you the complete production architecture: intake normalization, composable versioned prompts, multi-speed evaluation with golden sets, five-tier cognitive memory, knowledge graphs with GraphRAG, temporal facts, or the experience database that enables self-improvement. Frameworks handle Layer 2 (generation) and parts of Layer 5 (orchestration). The architecture handles all five layers plus the cross-cutting concerns: traces, evaluation, memory management, and the feedback loops that make the system improve over time. In practice, you'd use a framework as one component inside this architecture. LangGraph might power your agent runtime in the orchestration layer. But the intake layer, evaluation layer, memory controller, knowledge graph, and experience database β€” those are yours. The framework is the doctor's stethoscope; this architecture is the hospital."