Agent Memory and Context Management

Guide · updated 2026-06-15 · Markdown variant

Architecture reference for agent memory: types (working, long-term, episodic, semantic, procedural), context-management techniques (summarization, RAG, sliding windows, prompt caching), storage substrates, and memory frameworks — with security notes and cross-links to related guides.

Agent "memory" is not the model remembering — it is engineering around a stateless model. Every inference call starts fresh; what looks like memory is context you explicitly assemble and inject. Understanding this distinction is the foundation of every memory architecture decision.

The core split: context window vs persistent storage

Context window (working memory) — tokens currently visible to the model during one inference call. Ephemeral: discarded after the call completes. Token-limited: each model has a nominal window (e.g. 128K, 1M tokens) but usable performance degrades before the limit is reached (see "context rot" below). Everything that reaches the model must pass through this window.

Persistent storage — facts, histories, and knowledge held outside the model and selectively retrieved into future context windows. Survives session boundaries. Unlimited in principle, but retrieval quality and token cost are the binding constraints.

Five types of memory in agent systems

Type	What it holds	Typical substrate
Working / short-term	Current conversation buffer, active task state	Context window directly
Long-term	Persisted user facts, preferences, entity relationships	Key-value store, vector DB, graph DB
Episodic	Past task trajectories, conversation summaries	Vector DB (semantic retrieval) or append log
Semantic	World knowledge, domain facts, reference documents	Vector DB (RAG)
Procedural	Learned skills, reusable tool patterns, updated system prompts	Prompt store; model fine-tune in extreme cases

Practical note: most production agents implement working memory plus one or two others. Full five-tier implementations exist but are rare outside research.

Context management techniques

Summarization / compaction — compress older conversation turns into a shorter summary and drop the originals. Simple, widely used. Lossy: fine-grained details in compressed turns are irretrievable. Summarize on a sliding trigger (e.g. every N turns, or when context exceeds a threshold).

Retrieval-Augmented Generation (RAG) — embed documents or memories and store them in a vector index. At inference time, embed the current query and pull only the top-k most relevant chunks into context. Keeps the context window lean; works best when the memory corpus is large and any single query needs only a small slice of it.

Sliding window — keep only the most recent N turns in context, dropping the oldest. Predictable cost; loses long-range dependencies. Combine with summarization for a hybrid: summarize the dropped segment rather than discarding it.

Prompt caching (stable-prefix cost reduction) — if the first part of your context (system prompt, large reference document, few-shot examples) is identical across many calls, cache it at the provider level and pay a reduced rate on cache hits. Anthropic, OpenAI, and Google Gemini all offer this: Anthropic's cache reads cost ~10% of fresh input tokens; OpenAI caches automatically on prompts >1,024 tokens (50–90% discount on cached tokens); Gemini 2.5 enables implicit caching by default. Prompt caching is a cost technique, not a memory technique — it does not persist state across sessions.

Context rot: nominal window vs usable performance

Long-context models advertise large nominal windows, but attention is not uniform. The "Lost in the Middle" effect (Liu et al., arXiv:2307.03172, TACL 2024) showed that LLMs perform worst on information placed in the middle of a long context, with performance drops of >30% at middle positions compared to the start or end. Practical implication: placing critical instructions or retrieved facts near the start or end of the context is safer than relying on the model to attend to them uniformly at any position. When in doubt, prefer retrieval (smaller, targeted context) over stuffing (large undifferentiated context).

Storage substrates

Vector databases (semantic retrieval)

Store embeddings alongside source text; query by cosine or dot-product similarity. Three verifiable open-source options:

pgvector — PostgreSQL extension adding vector similarity search. Keeps memory in your existing Postgres instance; supports HNSW and IVFFlat indexes. Source: github.com/pgvector/pgvector.
Chroma — embeddable Python/JS vector database; runs in-process for development or as a server for production; Apache 2.0. Source: github.com/chroma-core/chroma.
Qdrant — high-performance vector database written in Rust; RESTful and gRPC APIs; supports dense, sparse, and multi-vector search; available self-hosted or as a managed service. Source: github.com/qdrant/qdrant.

Key-value and document stores

For structured memory (user preferences, entity facts, session state), standard key-value stores (Redis, DynamoDB, a Postgres table) are often simpler and faster than a vector database. Use vector retrieval when the access pattern is semantic ("what do I know about this topic?"); use KV when it is exact ("what is this user's timezone?").

Memory frameworks

Three verifiable open-source frameworks address the engineering overhead of memory management:

mem0 — framework-agnostic memory layer combining vector embeddings, a knowledge graph, and a key-value store. Works with LangChain, LlamaIndex, CrewAI, and raw APIs. Source: github.com/mem0ai/mem0.
LangMem — LangChain's memory SDK for LangGraph agents; supports episodic, semantic, and procedural memory with native integration into LangGraph's storage layer. Source: github.com/langchain-ai/langmem.
Letta (formerly MemGPT) — platform for stateful agents with tiered memory inspired by OS virtual memory (MemGPT paper: arXiv:2310.08560); agents manage their own memory via self-editing tools. Source: github.com/letta-ai/letta.

Security: retrieved memory as an untrusted surface

Vector store content is not inherently safe. An adversary who can write to your memory store (via a poisoned document, a prompt-injection in a past conversation, or a compromised ingestion pipeline) can inject instructions that surface in future retrievals. Mitigations:

Validate retrieved chunks against a known-good schema before injecting into context.
Apply the same distrust to retrieved memory as to raw web content (see /resources/agentic-security-checklist, section 8 — Memory & context poisoning).
Separate long-term persistent store (high trust, hard to update) from short-term episodic store (lower trust, frequently updated); apply stricter validation before promoting content from episodic to long-term.
Periodically audit long-term stores for injected instructions.

Cross-links

For tool-call reliability in agent pipelines: /resources/reliable-tool-calling
For frameworks with built-in memory support: /resources/agent-frameworks-compared
For memory & context poisoning as a security threat: /resources/agentic-security-checklist

Verified sources

pgvector (open-source vector search for Postgres): https://github.com/pgvector/pgvector
Chroma (AI-native open-source vector database): https://github.com/chroma-core/chroma
Qdrant (high-performance open-source vector database): https://github.com/qdrant/qdrant
mem0 (universal memory layer for AI agents): https://github.com/mem0ai/mem0
LangMem (LangChain memory SDK for LangGraph): https://github.com/langchain-ai/langmem
Letta / MemGPT (stateful agents platform): https://github.com/letta-ai/letta
MemGPT paper — Towards LLMs as Operating Systems (arXiv:2310.08560): https://arxiv.org/abs/2310.08560
"Lost in the Middle" paper (Liu et al., TACL 2024, arXiv:2307.03172): https://arxiv.org/abs/2307.03172
Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
OpenAI prompt caching announcement: https://openai.com/index/api-prompt-caching/
Google Gemini context caching docs: https://ai.google.dev/gemini-api/docs/caching

#memory #context-window #rag #vector-databases #agents #architecture

Category: Guide