Agent Memory
Memory turns a stateless LLM into a stateful agent. Short-term memory = conversation window. Long-term memory persists across sessions. Episodic = past events. Semantic = learned facts. Procedural = learned how-to. Without memory, every interaction is a blank slate. With memory, agents personalize, learn, and accumulate knowledge.
Agent Memory
Watch or read first
- Daily Dose DS, "Memory Types in AI Agents" and "Importance of Memory for Agentic Systems" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
- MemGPT / Letta paper: https://arxiv.org/abs/2310.08560 (Packer et al.)
- Zep (https://help.getzep.com/), Mem0 (https://docs.mem0.ai/), Letta (https://docs.letta.com/) docs - the three main memory frameworks in 2026.
TL;DR
Memory turns a stateless LLM into a stateful agent. Short-term memory = conversation window. Long-term memory persists across sessions. Episodic = past events. Semantic = learned facts. Procedural = learned how-to. Without memory, every interaction is a blank slate. With memory, agents personalize, learn, and accumulate knowledge.
The historical problem
An LLM call is pure function: prompt in, text out. No state between calls. For one-shot tasks, this is fine. For assistants, coaches, tutors, companions, it is a disaster:
- User: "My name is David."
- (5 minutes later) Agent: "Who is this? What should I call you?"
In 2022-2023, teams hacked around this by dumping entire conversation histories in the prompt every call. It worked up to ~8k tokens, then broke. Summarization workflows helped but lost detail.
Real memory is a system design problem, not a model feature. Daily Dose DS: memory is not a property of the model itself. It is a system design problem.
How it works: the memory hierarchy
Inspired by human cognition, agent memory follows a tiered structure.
Short-term memory
- Exists only during one execution or session
- Implemented as a conversation buffer
- Includes recent messages, tool observations, scratch work
- Bounded by context window
Examples: the current chat turn, ReAct loop's running log, the system prompt for this session.
Long-term memory
- Persists across sessions
- Stored in an external system (vector DB, graph DB, SQL, files)
- Must be retrieved on demand (it does not fit fully in the prompt)
Sub-types (humans-inspired):
Semantic memory
Facts and knowledge. "The company's return policy allows refunds within 30 days." These are indexed and retrieved when relevant.
Episodic memory
Past events and experiences. "Last Tuesday, David asked about refund policy X and we resolved it with case #12345." Useful for personalization and learning from past interactions.
Procedural memory
Learned how-to. Skills, instructions, workflows the agent internalized. Often stored as updated system prompts or learned tool usage patterns.
Entity memory
Tracks specific subjects (users, products, orders) with structured attributes. Like a CRM for the agent. "User David: location Israel, language FR, preferred channel email."
Contextual memory
Daily Dose DS's catch-all for keeping relevant context available across a session. Overlaps with short-term.
User memory
Specifically about the current user: preferences, history, preferences. A specialization of entity memory.
The simulation problem
LLMs do not "remember" in a biological sense. The system simulates memory by:
- Deciding what to keep (not everything fits)
- Storing it externally
- Retrieving relevant pieces before each new model call
- Inserting them in the prompt
Every memory framework (Zep, Letta, Mem0) is a specific implementation of "what to keep, when to retrieve, how to insert".
Architecture patterns
Pattern 1: conversation buffer (short-term only)
[system prompt] + [all messages so far] -> LLM
Simplest. Breaks when conversation exceeds context window.
Pattern 2: summarizing buffer
[system prompt] + [summary of old turns] + [recent turns verbatim] -> LLM
A cheap LLM summarizes older turns. Keeps the prompt small. Loses detail.
Pattern 3: RAG over conversation history
User query -> embed -> search past conversation -> retrieve relevant turns
[system prompt] + [retrieved turns] + [query] -> LLM
Scales to arbitrarily long history. Misses very recent turns if not indexed yet.
Pattern 4: tiered memory (MemGPT / Letta)
RAM-like: current working context (in-prompt)
Disk-like: long-term store (vector DB)
The LLM itself issues commands like "save to long-term memory" or "recall from long-term memory".
The LLM manages its own memory. Brilliant, complex. Letta (open-source MemGPT) is the reference.
Pattern 5: external memory with background updates
Main loop: agent runs, actions happen.
Background: a "memory agent" reads actions and updates stores (user profile, facts, episodes).
Retrieval: per query, memory agent fetches relevant items and injects.
Zep, Mem0 use variants of this.
Relevance today (2026)
Memory is the bottleneck for "real" assistants
A 2026 AI assistant that forgets your name by turn 3 is a dealbreaker. ChatGPT, Claude, Gemini all launched memory features in 2024-2025 for this reason. The ecosystem followed.
Frameworks are still immature
Contrast with vector DBs (mature) and function calling (mature). Memory in 2026 is where vector DBs were in 2022: many options, no clear winner.
Main players:
- Zep - graph-based memory, strong for entities and relationships
- Letta (ex-MemGPT) - self-editing memory, MIT-grade
- Mem0 - simple API, popular for quick starts
- LangMem (LangChain) - integrated with LangGraph
- OpenAI Memory / Anthropic Memory beta - built into the API for specific products
Each makes different trade-offs. Benchmark on your use case.
RAG and memory converge
Daily Dose DS makes the arc clear:
- RAG (2020-2023): read-only, one-shot
- Agentic RAG: read-only via tool calls
- Agent Memory: read-write via tool calls
Modern agents treat the memory store as just another tool: search_memory, save_to_memory, update_memory. It is RAG with writes.
Graph RAG and memory
Graph-based memory (Zep, Neo4j + LLMs) captures relationships: "David works at Geta.Team, lives in Israel, speaks FR/EN/HE". Graph queries unlock reasoning that flat vector search cannot.
Personalization is the killer app
Users do not care about "multi-agent orchestration". They care that the assistant remembers their kid's name, their project, their preferences. Memory, properly done, is the step from "cool demo" to "I use this every day".
Privacy
Memory is a privacy tight-rope. Storing user messages long-term means regulatory exposure (GDPR, HIPAA). Serious apps:
- Encrypt at rest
- Support user-initiated deletion
- Redact PII before storage
- Separate per-tenant
Critical questions
- Should long-term memory be managed by the LLM or by the system? (Trade-off: LLM-managed = more flexibility, less predictable cost; system-managed = bounded cost, less adaptive. Hybrid wins.)
- What happens when memory contradicts itself? (User changes preference. Old and new fact both stored. Retrieval surfaces both. Agent confused. Need conflict resolution: prefer newest, flag contradictions, or surface choice to user.)
- Does memory reduce the need for fine-tuning? (Partially. Memory adapts behavior via retrieval, no weight changes. But it does not teach new styles; fine-tuning still fits for that.)
- How much memory is enough? (Depends on use case. Personal assistant: thousands of facts. Enterprise support bot: millions. Start small.)
- Memory vs prompt caching, same thing? (No. Prompt caching is KV cache re-use for identical prefixes. Memory is a retrieval-based injection of relevant facts per query. Orthogonal, both useful.)
- Should memory be shared across users? (No for private info. Yes for anonymized learnings. Be explicit.)
Production pitfalls
- Unbounded growth. Every turn adds to memory. Year later, the store is enormous and retrieval is slow. Add eviction, summarization, archival.
- Stale facts override fresh ones. "User prefers X" from 6 months ago beats "user changed to Y" last week. Time-weight retrieval.
- Memory leakage between users. One user's fact retrieved for another. Scope every query strictly by user_id / tenant.
- No ground truth. You cannot tell if memory retrieved the right thing without eval. Build a memory eval suite.
- Hallucinated memory. LLM thinks it "remembers" something that was never stored. Always ground responses in actual retrieved content.
- Over-caching vs under-caching. Cached memory is stale; fresh memory is expensive. Tune the refresh cadence per memory type.
- Privacy blast radius. Memory stores contain raw user text. Treat like PII. Encryption, access control, retention limits.
- Compliance. GDPR "right to be forgotten" means per-user delete must work across all memory stores.
Alternatives / Comparisons
| Approach | Write? | Across sessions? | Scales? | Complexity |
|---|---|---|---|---|
| No memory | No | No | - | Trivial |
| Conversation buffer | No | No | Small | Low |
| Summarizing buffer | No | No | Medium | Low |
| RAG over history | No | Yes | High | Medium |
| Entity memory | Yes | Yes | Medium | Medium |
| MemGPT / Letta | Yes | Yes | High | High |
| Zep / Mem0 | Yes | Yes | High | Medium |
| KV cache persistence (Claude memory) | No (read-only) | Yes | High | Managed |
Mental parallels (non-AI)
- Human memory architecture: working memory (7 plus or minus 2 items), short-term (minutes), long-term (semantic, episodic, procedural). Agent memory maps directly.
- Note-taking: the LLM is the person. Short-term = scratch paper on the desk. Long-term = filed notebooks. You cannot fit the notebooks on the desk, so you fetch relevant pages.
- Database transactions: writes go to persistent storage; reads pull back into working memory. Classic OLTP pattern.
- CRM for a salesperson: the salesperson's memory is bad, so every customer interaction is logged. Before a call, the CRM surfaces relevant history. Same pattern as agent memory.
- The Pensieve (Harry Potter): extract a memory into an external vessel, revisit later. Agent memory externalizes what would not fit in the model's head.
Mini-lab
labs/agent-memory/ (to create):
- Build a personal assistant agent with three memory types:
- Short-term: conversation buffer (last 20 turns)
- Semantic: facts about the user (Mem0 or Zep)
- Episodic: past conversations (vector DB of embedded summaries)
- Simulate 10 sessions across 2 weeks.
- Test recall: "what did I ask you about last week?" "what is my favorite color?"
- Measure:
- Recall accuracy
- Memory store size growth
- Per-query cost
- Add a forgetting mechanism (decay weight on old episodes) and compare.
Stack: uv, langchain, mem0ai or zep-python, anthropic.
Further reading
Canonical
- Daily Dose DS, "Memory Types in AI Agents" (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
- Packer et al., "MemGPT: Towards LLMs as Operating Systems" (2023) - https://arxiv.org/abs/2310.08560
- Zep blog: https://www.getzep.com/blog
- Mem0 docs: https://docs.mem0.ai
Related in this KB
Frameworks
- Zep - https://github.com/getzep/zep
- Letta (ex-MemGPT) - https://github.com/letta-ai/letta
- Mem0 - https://github.com/mem0ai/mem0
- LangMem (LangChain): https://langchain-ai.github.io/langmem/
- OpenAI Memory (product-level feature in ChatGPT and Assistants API): https://help.openai.com/en/articles/8590148-memory-faq
- Anthropic Memory (beta): https://www.anthropic.com/news