# Agent Memory and Context Management

> Architecture reference for agent memory: types (working, long-term, episodic, semantic, procedural), context-management techniques (summarization, RAG, sliding windows, prompt caching), storage substrates, and memory frameworks — with security notes and cross-links to related guides.

Category: Guide · Updated: 2026-06-15 · Tags: memory, context-window, rag, vector-databases, agents, architecture
Canonical: https://changegamer.ai/resources/agent-memory-context

Agent "memory" is not the model remembering — it is engineering around a stateless model. Every inference call starts fresh; what looks like memory is context you explicitly assemble and inject. Understanding this distinction is the foundation of every memory architecture decision.

## The core split: context window vs persistent storage

**Context window (working memory)** — tokens currently visible to the model during one inference call. Ephemeral: discarded after the call completes. Token-limited: each model has a nominal window (e.g. 128K, 1M tokens) but usable performance degrades before the limit is reached (see "context rot" below). Everything that reaches the model must pass through this window.

**Persistent storage** — facts, histories, and knowledge held outside the model and selectively retrieved into future context windows. Survives session boundaries. Unlimited in principle, but retrieval quality and token cost are the binding constraints.

## Five types of memory in agent systems

| Type | What it holds | Typical substrate |
|---|---|---|
| **Working / short-term** | Current conversation buffer, active task state | Context window directly |
| **Long-term** | Persisted user facts, preferences, entity relationships | Key-value store, vector DB, graph DB |
| **Episodic** | Past task trajectories, conversation summaries | Vector DB (semantic retrieval) or append log |
| **Semantic** | World knowledge, domain facts, reference documents | Vector DB (RAG) |
| **Procedural** | Learned skills, reusable tool patterns, updated system prompts | Prompt store; model fine-tune in extreme cases |

Practical note: most production agents implement working memory plus one or two others. Full five-tier implementations exist but are rare outside research.

## Context management techniques

**Summarization / compaction** — compress older conversation turns into a shorter summary and drop the originals. Simple, widely used. Lossy: fine-grained details in compressed turns are irretrievable. Summarize on a sliding trigger (e.g. every N turns, or when context exceeds a threshold).

**Retrieval-Augmented Generation (RAG)** — embed documents or memories and store them in a vector index. At inference time, embed the current query and pull only the top-k most relevant chunks into context. Keeps the context window lean; works best when the memory corpus is large and any single query needs only a small slice of it.

**Sliding window** — keep only the most recent N turns in context, dropping the oldest. Predictable cost; loses long-range dependencies. Combine with summarization for a hybrid: summarize the dropped segment rather than discarding it.

**Prompt caching (stable-prefix cost reduction)** — if the first part of your context (system prompt, large reference document, few-shot examples) is identical across many calls, cache it at the provider level and pay a reduced rate on cache hits. Anthropic, OpenAI, and Google Gemini all offer this: Anthropic's cache reads cost ~10% of fresh input tokens; OpenAI caches automatically on prompts >1,024 tokens (50–90% discount on cached tokens); Gemini 2.5 enables implicit caching by default. Prompt caching is a cost technique, not a memory technique — it does not persist state across sessions.

## Context rot: nominal window vs usable performance

Long-context models advertise large nominal windows, but attention is not uniform. The "Lost in the Middle" effect (Liu et al., arXiv:2307.03172, TACL 2024) showed that LLMs perform worst on information placed in the middle of a long context, with performance drops of >30% at middle positions compared to the start or end. Practical implication: placing critical instructions or retrieved facts near the start or end of the context is safer than relying on the model to attend to them uniformly at any position. When in doubt, prefer retrieval (smaller, targeted context) over stuffing (large undifferentiated context).

## Storage substrates

### Vector databases (semantic retrieval)

Store embeddings alongside source text; query by cosine or dot-product similarity. Three verifiable open-source options:

- **pgvector** — PostgreSQL extension adding vector similarity search. Keeps memory in your existing Postgres instance; supports HNSW and IVFFlat indexes. Source: github.com/pgvector/pgvector.
- **Chroma** — embeddable Python/JS vector database; runs in-process for development or as a server for production; Apache 2.0. Source: github.com/chroma-core/chroma.
- **Qdrant** — high-performance vector database written in Rust; RESTful and gRPC APIs; supports dense, sparse, and multi-vector search; available self-hosted or as a managed service. Source: github.com/qdrant/qdrant.

### Key-value and document stores

For structured memory (user preferences, entity facts, session state), standard key-value stores (Redis, DynamoDB, a Postgres table) are often simpler and faster than a vector database. Use vector retrieval when the access pattern is semantic ("what do I know about this topic?"); use KV when it is exact ("what is this user's timezone?").

## Memory frameworks

Three verifiable open-source frameworks address the engineering overhead of memory management:

- **mem0** — framework-agnostic memory layer combining vector embeddings, a knowledge graph, and a key-value store. Works with LangChain, LlamaIndex, CrewAI, and raw APIs. Source: github.com/mem0ai/mem0.
- **LangMem** — LangChain's memory SDK for LangGraph agents; supports episodic, semantic, and procedural memory with native integration into LangGraph's storage layer. Source: github.com/langchain-ai/langmem.
- **Letta (formerly MemGPT)** — platform for stateful agents with tiered memory inspired by OS virtual memory (MemGPT paper: arXiv:2310.08560); agents manage their own memory via self-editing tools. Source: github.com/letta-ai/letta.

## Security: retrieved memory as an untrusted surface

Vector store content is not inherently safe. An adversary who can write to your memory store (via a poisoned document, a prompt-injection in a past conversation, or a compromised ingestion pipeline) can inject instructions that surface in future retrievals. Mitigations:

- Validate retrieved chunks against a known-good schema before injecting into context.
- Apply the same distrust to retrieved memory as to raw web content (see /resources/agentic-security-checklist, section 8 — Memory & context poisoning).
- Separate long-term persistent store (high trust, hard to update) from short-term episodic store (lower trust, frequently updated); apply stricter validation before promoting content from episodic to long-term.
- Periodically audit long-term stores for injected instructions.

## Cross-links

- For tool-call reliability in agent pipelines: /resources/reliable-tool-calling
- For frameworks with built-in memory support: /resources/agent-frameworks-compared
- For memory & context poisoning as a security threat: /resources/agentic-security-checklist

## Verified sources

- pgvector (open-source vector search for Postgres): https://github.com/pgvector/pgvector
- Chroma (AI-native open-source vector database): https://github.com/chroma-core/chroma
- Qdrant (high-performance open-source vector database): https://github.com/qdrant/qdrant
- mem0 (universal memory layer for AI agents): https://github.com/mem0ai/mem0
- LangMem (LangChain memory SDK for LangGraph): https://github.com/langchain-ai/langmem
- Letta / MemGPT (stateful agents platform): https://github.com/letta-ai/letta
- MemGPT paper — Towards LLMs as Operating Systems (arXiv:2310.08560): https://arxiv.org/abs/2310.08560
- "Lost in the Middle" paper (Liu et al., TACL 2024, arXiv:2307.03172): https://arxiv.org/abs/2307.03172
- Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- OpenAI prompt caching announcement: https://openai.com/index/api-prompt-caching/
- Google Gemini context caching docs: https://ai.google.dev/gemini-api/docs/caching
