- What kind of memory? Short-term (within session), long-term (across sessions), or both? β Focus on long-term cross-session memory. Session context is the LLM's job; Mem0 handles what persists after.
- What's the unit of memory? Raw chat chunks (RAG-style), or extracted facts? β Extracted facts: "User is vegetarian," not "User said 'I don't eat meat' on Jan 5." This is the key difference from RAG.
- Multi-tenant? Multiple apps, each with their own users? β Yes. Scoped by org β project β user/agent/session.
- Graph memory in scope? β Yes β the graph variant (Mem0α΅) stores entity-relationship triples for relational reasoning.
- Self-hosted or managed? β Both. Open-source (BYO vector DB + graph DB) and managed platform (fully hosted).
- Scale? β Millions of users across thousands of tenants. Billions of memories total. Sub-50ms retrieval latency.
| In Scope | Out of Scope |
|---|---|
| Memory extraction from conversations (LLM-based) | The LLM serving layer itself (OpenAI, Claude) |
| Memory CRUD: add, search, update, delete | Full RAG pipeline (document ingestion, chunking) |
| Hybrid store: vector + graph + key-value | Prompt engineering / agent orchestration |
| Multi-tenant isolation (org/project/user/agent/session) | Fine-tuning models per user |
| Graph memory (entity-relationship extraction) | Multimodal memory (images, audio) |
| Memory consolidation (dedup, conflict resolution) | Real-time streaming conversations |
- UC1: memory.add(messages, user_id) β After a conversation turn, extract salient facts and store them. Deduplicate against existing memories. Resolve conflicts (user changed preference).
- UC2: memory.search(query, user_id) β Before generating a response, retrieve relevant memories to inject into the system prompt. Must return in <50ms.
- UC3: memory.update(memory_id, data) β Explicitly modify a stored memory (admin correction, user request to forget).
- UC4: memory.delete(user_id) β GDPR right-to-be-forgotten. Delete all memories for a user across all stores.
- UC5: Graph traversal β "What does this user know about Python?" β traverse entity graph: User β knows β Python β related_to β Django, Flask.
- Memory retrieval <50ms p95: Memory lookup is on the hot path of every LLM call. 200ms retrieval + 2s LLM = unacceptable. Must be fast enough that adding memory doesn't noticeably increase response time.
- Memory extraction is async and can tolerate latency: Extracting facts from a conversation takes 1-3s (LLM call). This happens AFTER the response is sent to the user β not on the critical path.
- Multi-tenant isolation is non-negotiable: Tenant A's memories must NEVER leak to Tenant B. User X's memories must NEVER leak to User Y. This is PHI/PII data (medical history, dietary preferences, financial info).
- SOC 2 / HIPAA compliance: Audit logs, encryption at rest, encryption in transit, BYOK (bring your own encryption key).
- Eventual consistency for memory writes is acceptable: A memory extracted 2 seconds ago doesn't need to be immediately searchable. 5-second propagation delay is fine.
- Memory quality > memory quantity: 10 precise, deduplicated facts are better than 500 raw chat chunks. The extraction LLM must be ruthlessly selective.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Semantic search over facts (<50ms) | Vector DB (Qdrant / pgvector) | Embedding-based similarity search returns semantically relevant memories, not just keyword matches. Full-text search can't find "vegetarian" when query is "dinner suggestions." | AP |
| Relational reasoning ("who knows whom") | Graph DB (Neo4j / Neptune) | Entity-relationship triples enable multi-hop traversal. Vector search can't answer "what topics is the user's manager interested in?" β requires graph path traversal. | AP |
| Memory metadata: CRUD, TTL, audit trail | PostgreSQL for metadata | Memory records need ACID: created_at, updated_at, version history, tenant isolation. Vector DBs lack rich querying and transactions. | CP |
| Memory extraction from conversations | LLM with tool-calling (not regex/NER) | LLM understands context: "I stopped eating meat" β UPDATE existing memory "User eats chicken" to "User is vegetarian." NER/regex can't do conflict resolution. | β |
| Multi-tenant isolation (PHI/PII) | Namespace partitioning (not DB-per-tenant) | Every query includes mandatory tenant_id + user_id filter. Row-level security in PostgreSQL, namespace filtering in vector DB. Separate DBs don't scale to 10K+ tenants. | β |
| Extraction latency off critical path | Async write pipeline (Kafka/SQS) | memory.add() returns immediately. Extraction happens async. User never waits for the LLM extraction call β it fires after the response is sent. | Eventual |
| Operation | When | Example |
|---|---|---|
| ADD | New fact, no existing similar memory | "User has a 3-year-old daughter named Emma" β first mention, no related memory exists. |
| UPDATE | New info augments or modifies existing memory | Existing: "User lives in NYC." New: "User moved to SF last week." β UPDATE to "User lives in SF (moved from NYC recently)." |
| DELETE | New info contradicts existing memory entirely | Existing: "User is single." New: "I just got married!" β DELETE old memory, ADD new one. |
| NOOP | Fact already captured or not worth storing | "Nice weather today" β not a persistent user fact. Or "I'm vegetarian" when memory already says "User is vegetarian." |
| Store | What It Holds | Access Pattern | Why This Store |
|---|---|---|---|
| Vector DB (Qdrant) | Memory text + 1536-dim embedding | ANN search: "find memories similar to this query" | Semantic similarity at sub-10ms. Can't do this with PostgreSQL full-text search β "dinner suggestions" wouldn't match "vegetarian." |
| Graph DB (Neo4j) | Entity nodes + relationship edges | Traversal: "User β knows β Python β related_to β ?" | Multi-hop reasoning. Vector search returns isolated facts; graph connects them relationally. |
| PostgreSQL | Memory metadata, tenants, audit log | CRUD: list all memories, filter by date, version history | ACID transactions for tenant management. Rich filtering (created_at > X AND user_id = Y). Audit trail for compliance. |
| Redis | Recent memory cache, rate limit counters | Hot path cache for frequently accessed user memories | Sub-ms reads for users with active sessions. Avoids vector DB round-trip for recent memories. |
Returns: 202 Accepted + { extraction_id } β async, fires extraction pipeline
Returns: [{ id, memory, score, created_at, metadata }] β <50ms
| Data | Store | Why This Store |
|---|---|---|
| Memory embeddings | Qdrant / pgvector | ANN search at <10ms. Filtered by user_id namespace. HNSW index for approximate nearest neighbor. |
| Entity-relationship graph | Neo4j / Neptune | Cypher queries for multi-hop traversal. Property graph model. Indexed by entity name + user_id. |
| Memory metadata + audit | PostgreSQL | ACID for CRUD, version history, tenant management. Rich SQL filtering for admin dashboards. |
| Hot memory cache | Redis (with TTL) | Recently accessed user memories cached for sub-ms reads. Invalidated on memory update. |
| Extraction queue | Kafka / SQS | Async write path. Decouples API response from expensive LLM extraction. At-least-once delivery. |
| Conversation context | Redis (session-scoped TTL) | Recent 10 messages for extraction context. Expires with session. Not durable β reconstructable. |
- Memory decay & relevance scoring: Memories accessed frequently retain high relevance. Memories not accessed in 6 months decay in ranking. Analogous to human forgetting curve β keeps the memory store lean and relevant.
- Multimodal memory: Extract facts from images ("User shared a photo of their golden retriever named Max") and audio. Store visual embeddings alongside text embeddings for cross-modal retrieval.
- Hierarchical memory (episodic β semantic): Short-term: raw conversation events (episodic). Long-term: consolidated facts (semantic). Procedural: learned task patterns ("User always wants code in Python"). Mirrors human memory taxonomy.
- On-device memory: Privacy-sensitive use cases: memories stored on-device (phone, laptop), never sent to cloud. Smaller embedding models (384-dim) for edge inference. Syncs with cloud store when user opts in.
- Memory-aware fine-tuning: Use accumulated memories as training data to fine-tune a per-user LoRA adapter. The model itself learns the user's style and preferences, not just the prompt. Orders of magnitude more efficient than injecting 100 memories into every prompt.
- Collaborative memory: Team-level memories shared across an organization. "Our team uses Next.js for frontend" β shared fact available to all team members' AI agents. Conflict resolution when team members provide contradictory information.
How is Mem0 different from RAG? When would you use each?
RAG retrieves chunks of existing documents β it's stateless. You ingest a PDF, chunk it, embed the chunks, and retrieve relevant passages. The documents don't change based on user interaction. Mem0 is stateful: it extracts facts FROM conversations, consolidates them (dedup, conflict resolution), and builds a persistent knowledge base PER USER that evolves over time. The key difference: RAG answers "what does this document say?" Mem0 answers "what do I know about this user?" Use RAG for: knowledge bases, documentation search, customer support with fixed answers. Use Mem0 for: personalization, remembering user preferences, multi-session context, AI assistants that improve over time. They're complementary β a healthcare agent might use RAG for medical literature and Mem0 for patient history.
Why use an LLM for extraction instead of just embedding raw conversation chunks?
Three reasons: (1) Compression: a 20-turn conversation about dinner preferences compresses to "User is vegetarian, avoids dairy, loves Italian food" β 3 atomic facts instead of 20 raw chunks. This means 90%+ token savings when injecting memories into prompts. (2) Deduplication: if the user mentions being vegetarian in 10 different conversations, embedding raw chunks gives you 10 near-duplicate vectors. The LLM recognizes it's the same fact and stores it once. (3) Conflict resolution: raw chunks can't resolve "I love steak" from January and "I became vegetarian" from March. The LLM understands temporal ordering and updates the memory. The cost: $0.001-0.01 per extraction call. But the savings β fewer tokens per prompt, cleaner retrieval, no duplicates β more than compensate at scale. The paper shows 90% token cost reduction vs. full-context approaches.
What happens if the LLM hallucinates during extraction β stores a fact the user never said?
This is the biggest risk in the architecture. Mitigations: (1) Extraction prompt engineering: instruct the LLM to only extract explicitly stated facts, never infer. "User mentioned they have a dog" is valid; "User probably likes dogs" is not. (2) Confidence scoring: the extraction LLM assigns a confidence score per fact. Low-confidence facts are stored but flagged β they're deprioritized in retrieval scoring. (3) Source linking: every memory stores the source_conversation_id. On retrieval, the application can verify the memory against the original conversation. (4) User memory dashboard: users can view, edit, and delete their memories. This is both a product feature and a hallucination correction mechanism. (5) Feedback loop: when a user corrects or deletes a memory, that signal feeds back into extraction quality metrics. The honest answer: hallucinated memories will occasionally happen. The system is designed to make them discoverable, correctable, and low-impact (ranked lower than high-confidence memories).
How do you handle memory search at 10B+ memories without every query scanning the entire vector index?
Tenant-scoped partitioning eliminates the 10B problem entirely. Every query includes a mandatory user_id + org_id. In Qdrant, each org gets its own collection (or partition). A user with 500 memories searches 500 vectors β an HNSW lookup over 500 points is microseconds, not milliseconds. The 10B total is distributed across millions of users. Even the largest individual user (say, a power user with 10K memories) searches 10K vectors β still sub-10ms. The only time cross-user search is needed is for admin analytics ("how many users mention competitor X?"), which is a batch job, not a real-time query. For the managed platform: Qdrant clusters are sharded by org_id. Large enterprise tenants get dedicated shards. Small tenants are co-located with namespace isolation. The scaling unit is the tenant, not the total memory count.
Why do you need both a vector DB and a graph DB? Can't you just use one?
They solve different retrieval problems. Vector search answers: "find memories semantically similar to this query." It's great for fuzzy, open-ended retrieval β "anything about food preferences?" returns dietary memories even if the word "food" never appears in them. Graph traversal answers: "navigate relationships between entities." It's great for structured queries β "what did my manager recommend?" requires following the edge (manager)βrecommendedβ(thing). Vector search would return YOUR recommendations (semantically similar to "recommend"), not your manager's. You could use only vector β that's what base Mem0 does, and it works well. Graph adds ~2% accuracy on the LOCOMO benchmark, with the biggest gains on multi-hop questions. For simple use cases (chatbot personalization), vector-only is sufficient. For complex agent workflows (enterprise assistants that model org relationships), graph memory is essential. The architecture supports both: graph is optional and can be enabled per project.