ChangeGamer

← All resources

RAG and Retrieval for Agents

Guide · updated 2026-06-15 · Markdown variant

End-to-end practitioner reference for Retrieval-Augmented Generation: pipeline stages, chunking strategies, dense/sparse/hybrid retrieval, reranking, agentic retrieval patterns, quality failure modes, and evaluation — with verified sources for every named technique.


Retrieval quality is the primary bottleneck for grounded agents. A model with perfect reasoning over wrong or missing context still produces wrong answers. This guide synthesizes the techniques agent builders actually need — concepts first, hype omitted.

Pipeline stages

A RAG pipeline has a fixed skeleton. Each stage has distinct failure modes.

Stage What happens Failure mode
Ingest / parse Documents loaded and converted to plain text Garbled tables, lost structure, wrong encoding
Chunk Text split into retrieval units Boundaries break mid-concept; size/overlap wrong for query patterns
Embed Chunks encoded to dense vectors Domain mismatch degrades recall; stale model on new vocabulary
Index Vectors stored in ANN index; optional BM25 index built in parallel Index staleness; missing incremental update path
Retrieve Query embedded; top-k fetched by ANN similarity or lexical score Irrelevant chunks pass threshold; relevant chunks below cutoff
Rerank (optional) Cross-encoder scores and reorders top-k Reranker latency budget exceeded; reranker domain mismatch
Assemble context Top chunks formatted and injected into prompt Lost-in-the-middle; token budget overflow; contradictory chunks
Generate LLM produces answer grounded in context Hallucination despite context; faithful-but-wrong summary

Chunking strategies

Fixed-size + overlap — split every N tokens with an M-token overlap between adjacent chunks. Simple and fast. Overlap prevents boundary-straddling facts from disappearing, but does not guarantee semantic coherence at cut points. Most frameworks default to ~512 tokens with ~50–100 token overlap.

Recursive / structural splitting — split on document structure in priority order (section headings → paragraphs → sentences → characters). Preserves logical boundaries. Preferred over fixed-size when documents have reliable structure (Markdown, HTML, code).

Semantic chunking — embed consecutive sentences; measure cosine similarity between adjacent embeddings; start a new chunk when similarity drops below a threshold. Groups semantically coherent content regardless of character count. Higher ingestion cost than fixed-size; similarity threshold requires tuning.

Late chunking (Jina AI, arXiv:2409.04701) — embed the full document first using a long-context embedding model, then chunk the resulting token embeddings via pooling. Each chunk embedding captures the full document context rather than only local context. Works without retraining; requires a long-context embedding model.

Contextual retrieval (Anthropic, September 2024) — prepend a short LLM-generated context summary (typically 50–100 tokens) to each chunk before embedding and before building the BM25 index. The summary situates the chunk within the source document. Anthropic reported up to 49% reduction in failed retrievals; up to 67% when combined with reranking. Uses prompt caching to keep per-chunk generation cost low.

Chunk size and boundary choice matter because retrieval is tuned for a size distribution: chunks too large dilute signal; chunks too small lose context. Boundary placement determines whether a concept spans two chunks (bad) or sits within one (good). There is no universal optimum — calibrate against your query distribution.

Retrieval methods

Dense retrieval (embeddings + ANN) — embed the query; find the nearest chunk vectors in an approximate nearest neighbor (ANN) index (typically HNSW). Fast at scale; captures semantic similarity even when exact query words are absent. Quality depends on the embedding model's domain fit.

Sparse / lexical retrieval (BM25) — rank chunks by term frequency and inverse document frequency using the Okapi BM25 function (Robertson & Spärck Jones, 1970s–1990s). No embedding required; exact keyword matches score high. Fails for paraphrases or terminology gaps between query and document.

Hybrid search — run dense and sparse retrieval in parallel; merge ranked lists. Covers both semantic and exact-match signals. The standard fusion algorithm is Reciprocal Rank Fusion (RRF, Cormack et al., SIGIR 2009): each document's score is the sum of 1/(k + rank) across ranked lists (k=60 is the conventional default). RRF ignores raw scores and works on ranks only, sidestepping the score-normalization problem that makes direct score combination fragile. Elasticsearch and OpenSearch both ship native RRF retrievers.

Reranking

ANN retrieval is approximate by design. A cross-encoder reranker takes the top-k results and scores each query–chunk pair with full joint attention — much more accurate than embedding similarity but too slow for the full index. The workflow: retrieve top-50 or top-100 cheaply, then rerank to top-10 expensively.

Verified rerankers available as APIs or open weights:

Agentic retrieval patterns

Query rewriting / decomposition — before retrieval, use an LLM to rewrite the user query into a better retrieval query, or decompose a complex question into multiple sub-queries, each retrieved independently. Addresses vocabulary mismatch and multi-part questions that no single chunk answers.

Multi-hop retrieval — answer a chain of retrieval steps where each hop's result informs the next query. Required when the answer depends on facts that are only linked via an intermediate entity ("Who founded the company that acquired X?" → retrieve X's acquirer → retrieve acquirer's founder).

Retrieval as an MCP tool — expose the full RAG pipeline (embed query → ANN search → rerank → return chunks) as a single MCP tool. The agent calls it explicitly when it needs grounded context rather than having retrieval injected automatically. This is the "RAG-as-a-tool" pattern: the agent decides when to retrieve and with what query, enabling conditional retrieval and multi-hop chains. Contrast with classic single-shot RAG where retrieval is always triggered before generation.

Self-correction — after retrieval, have the agent (or a separate verification step) check whether the retrieved context actually supports the planned answer before generating it. If context is insufficient or contradictory, re-query with a refined query or surface uncertainty explicitly. Treat retrieved content as untrusted input, not ground truth (see /resources/agentic-security-checklist for context poisoning risks).

See /resources/agent-memory-context for how RAG-based semantic memory relates to other memory types in agents.

Quality and failure modes

Failure mode Cause Mitigation
Irrelevant chunks Embedding domain mismatch; threshold too loose Fine-tune or swap embedding model; tighten top-k cutoff; add reranker
Contradictory chunks Multiple source versions in index Dedup at ingest; metadata-filter by source recency; surface contradictions explicitly
Lost-in-the-middle Critical chunk placed in middle of long context Place highest-scored chunks at start/end; see Liu et al. arXiv:2307.03172 (TACL 2024) — also covered in /resources/agent-memory-context
Stale index Source documents updated after ingestion Incremental re-ingestion pipeline; TTL-based invalidation; updated metadata on chunks
Embedding-domain mismatch General-purpose embedder on specialized domain Domain-adaptive fine-tuning; switch to a specialized embedding model
Context poisoning Adversarial content in the retrieval corpus Validate chunks before injection; treat retrieved text as untrusted data (see /resources/agentic-security-checklist)

For reliable structured outputs from the generation step, see /resources/reliable-tool-calling.

Evaluation

Retrieval metrics — precision@k (fraction of retrieved chunks that are relevant); recall@k (fraction of relevant chunks retrieved). Measure both: high precision with low recall means you miss facts; high recall with low precision floods the context window.

Generation metrics — faithfulness / groundedness (does the answer follow from the retrieved context, with no hallucinated claims?); answer relevance (does the answer address the question?). These require either human judges or LLM-as-judge scoring.

RAGAS (github.com/explodinggradients/ragas) is the standard open-source framework for reference-free RAG evaluation. It computes faithfulness, answer relevance, and context precision/recall without requiring ground-truth annotations, and integrates with LangChain and LlamaIndex. Paper: arXiv:2309.15217.

Verified sources

#rag #retrieval #embeddings #chunking #reranking #agents #vector-databases #evaluation

Category: Guide