ChangeGamer

← All resources

Agent Memory and Context Management

Guide · updated 2026-06-15 · Markdown variant

Architecture reference for agent memory: types (working, long-term, episodic, semantic, procedural), context-management techniques (summarization, RAG, sliding windows, prompt caching), storage substrates, and memory frameworks — with security notes and cross-links to related guides.


Agent "memory" is not the model remembering — it is engineering around a stateless model. Every inference call starts fresh; what looks like memory is context you explicitly assemble and inject. Understanding this distinction is the foundation of every memory architecture decision.

The core split: context window vs persistent storage

Context window (working memory) — tokens currently visible to the model during one inference call. Ephemeral: discarded after the call completes. Token-limited: each model has a nominal window (e.g. 128K, 1M tokens) but usable performance degrades before the limit is reached (see "context rot" below). Everything that reaches the model must pass through this window.

Persistent storage — facts, histories, and knowledge held outside the model and selectively retrieved into future context windows. Survives session boundaries. Unlimited in principle, but retrieval quality and token cost are the binding constraints.

Five types of memory in agent systems

Type What it holds Typical substrate
Working / short-term Current conversation buffer, active task state Context window directly
Long-term Persisted user facts, preferences, entity relationships Key-value store, vector DB, graph DB
Episodic Past task trajectories, conversation summaries Vector DB (semantic retrieval) or append log
Semantic World knowledge, domain facts, reference documents Vector DB (RAG)
Procedural Learned skills, reusable tool patterns, updated system prompts Prompt store; model fine-tune in extreme cases

Practical note: most production agents implement working memory plus one or two others. Full five-tier implementations exist but are rare outside research.

Context management techniques

Summarization / compaction — compress older conversation turns into a shorter summary and drop the originals. Simple, widely used. Lossy: fine-grained details in compressed turns are irretrievable. Summarize on a sliding trigger (e.g. every N turns, or when context exceeds a threshold).

Retrieval-Augmented Generation (RAG) — embed documents or memories and store them in a vector index. At inference time, embed the current query and pull only the top-k most relevant chunks into context. Keeps the context window lean; works best when the memory corpus is large and any single query needs only a small slice of it.

Sliding window — keep only the most recent N turns in context, dropping the oldest. Predictable cost; loses long-range dependencies. Combine with summarization for a hybrid: summarize the dropped segment rather than discarding it.

Prompt caching (stable-prefix cost reduction) — if the first part of your context (system prompt, large reference document, few-shot examples) is identical across many calls, cache it at the provider level and pay a reduced rate on cache hits. Anthropic, OpenAI, and Google Gemini all offer this: Anthropic's cache reads cost ~10% of fresh input tokens; OpenAI caches automatically on prompts >1,024 tokens (50–90% discount on cached tokens); Gemini 2.5 enables implicit caching by default. Prompt caching is a cost technique, not a memory technique — it does not persist state across sessions.

Context rot: nominal window vs usable performance

Long-context models advertise large nominal windows, but attention is not uniform. The "Lost in the Middle" effect (Liu et al., arXiv:2307.03172, TACL 2024) showed that LLMs perform worst on information placed in the middle of a long context, with performance drops of >30% at middle positions compared to the start or end. Practical implication: placing critical instructions or retrieved facts near the start or end of the context is safer than relying on the model to attend to them uniformly at any position. When in doubt, prefer retrieval (smaller, targeted context) over stuffing (large undifferentiated context).

Storage substrates

Vector databases (semantic retrieval)

Store embeddings alongside source text; query by cosine or dot-product similarity. Three verifiable open-source options:

Key-value and document stores

For structured memory (user preferences, entity facts, session state), standard key-value stores (Redis, DynamoDB, a Postgres table) are often simpler and faster than a vector database. Use vector retrieval when the access pattern is semantic ("what do I know about this topic?"); use KV when it is exact ("what is this user's timezone?").

Memory frameworks

Three verifiable open-source frameworks address the engineering overhead of memory management:

Security: retrieved memory as an untrusted surface

Vector store content is not inherently safe. An adversary who can write to your memory store (via a poisoned document, a prompt-injection in a past conversation, or a compromised ingestion pipeline) can inject instructions that surface in future retrievals. Mitigations:

Cross-links

Verified sources

#memory #context-window #rag #vector-databases #agents #architecture

Category: Guide