RAG and Retrieval for Agents

Guide · updated 2026-06-15 · Markdown variant

End-to-end practitioner reference for Retrieval-Augmented Generation: pipeline stages, chunking strategies, dense/sparse/hybrid retrieval, reranking, agentic retrieval patterns, quality failure modes, and evaluation — with verified sources for every named technique.

Retrieval quality is the primary bottleneck for grounded agents. A model with perfect reasoning over wrong or missing context still produces wrong answers. This guide synthesizes the techniques agent builders actually need — concepts first, hype omitted.

Pipeline stages

A RAG pipeline has a fixed skeleton. Each stage has distinct failure modes.

Stage	What happens	Failure mode
Ingest / parse	Documents loaded and converted to plain text	Garbled tables, lost structure, wrong encoding
Chunk	Text split into retrieval units	Boundaries break mid-concept; size/overlap wrong for query patterns
Embed	Chunks encoded to dense vectors	Domain mismatch degrades recall; stale model on new vocabulary
Index	Vectors stored in ANN index; optional BM25 index built in parallel	Index staleness; missing incremental update path
Retrieve	Query embedded; top-k fetched by ANN similarity or lexical score	Irrelevant chunks pass threshold; relevant chunks below cutoff
Rerank (optional)	Cross-encoder scores and reorders top-k	Reranker latency budget exceeded; reranker domain mismatch
Assemble context	Top chunks formatted and injected into prompt	Lost-in-the-middle; token budget overflow; contradictory chunks
Generate	LLM produces answer grounded in context	Hallucination despite context; faithful-but-wrong summary

Chunking strategies

Fixed-size + overlap — split every N tokens with an M-token overlap between adjacent chunks. Simple and fast. Overlap prevents boundary-straddling facts from disappearing, but does not guarantee semantic coherence at cut points. Most frameworks default to ~512 tokens with ~50–100 token overlap.

Recursive / structural splitting — split on document structure in priority order (section headings → paragraphs → sentences → characters). Preserves logical boundaries. Preferred over fixed-size when documents have reliable structure (Markdown, HTML, code).

Semantic chunking — embed consecutive sentences; measure cosine similarity between adjacent embeddings; start a new chunk when similarity drops below a threshold. Groups semantically coherent content regardless of character count. Higher ingestion cost than fixed-size; similarity threshold requires tuning.

Late chunking (Jina AI, arXiv:2409.04701) — embed the full document first using a long-context embedding model, then chunk the resulting token embeddings via pooling. Each chunk embedding captures the full document context rather than only local context. Works without retraining; requires a long-context embedding model.

Contextual retrieval (Anthropic, September 2024) — prepend a short LLM-generated context summary (typically 50–100 tokens) to each chunk before embedding and before building the BM25 index. The summary situates the chunk within the source document. Anthropic reported up to 49% reduction in failed retrievals; up to 67% when combined with reranking. Uses prompt caching to keep per-chunk generation cost low.

Chunk size and boundary choice matter because retrieval is tuned for a size distribution: chunks too large dilute signal; chunks too small lose context. Boundary placement determines whether a concept spans two chunks (bad) or sits within one (good). There is no universal optimum — calibrate against your query distribution.

Retrieval methods

Dense retrieval (embeddings + ANN) — embed the query; find the nearest chunk vectors in an approximate nearest neighbor (ANN) index (typically HNSW). Fast at scale; captures semantic similarity even when exact query words are absent. Quality depends on the embedding model's domain fit.

Sparse / lexical retrieval (BM25) — rank chunks by term frequency and inverse document frequency using the Okapi BM25 function (Robertson & Spärck Jones, 1970s–1990s). No embedding required; exact keyword matches score high. Fails for paraphrases or terminology gaps between query and document.

Hybrid search — run dense and sparse retrieval in parallel; merge ranked lists. Covers both semantic and exact-match signals. The standard fusion algorithm is Reciprocal Rank Fusion (RRF, Cormack et al., SIGIR 2009): each document's score is the sum of 1/(k + rank) across ranked lists (k=60 is the conventional default). RRF ignores raw scores and works on ranks only, sidestepping the score-normalization problem that makes direct score combination fragile. Elasticsearch and OpenSearch both ship native RRF retrievers.

Reranking

ANN retrieval is approximate by design. A cross-encoder reranker takes the top-k results and scores each query–chunk pair with full joint attention — much more accurate than embedding similarity but too slow for the full index. The workflow: retrieve top-50 or top-100 cheaply, then rerank to top-10 expensively.

Verified rerankers available as APIs or open weights:

Cohere Rerank (API) — rerank-v4.0-pro and rerank-v4.0-fast; supports structured data formatted as YAML strings. Docs: docs.cohere.com/docs/rerank-overview.
BGE reranker (BAAI / open-weight) — BAAI/bge-reranker-v2-m3 (multilingual, M3 backbone); BAAI/bge-reranker-v2.5-gemma2-lightweight (token compression for efficiency). Available on Hugging Face; used via the FlagEmbedding library.
Jina reranker (Jina AI / API + open-weight) — jina-reranker-v3 is the current API model; jinaai/jina-reranker-v2-base-multilingual is the open-weight cross-encoder (100+ languages, 1 024-token context, CC-BY-NC 4.0). Docs: jina.ai/reranker.

Agentic retrieval patterns

Query rewriting / decomposition — before retrieval, use an LLM to rewrite the user query into a better retrieval query, or decompose a complex question into multiple sub-queries, each retrieved independently. Addresses vocabulary mismatch and multi-part questions that no single chunk answers.

Multi-hop retrieval — answer a chain of retrieval steps where each hop's result informs the next query. Required when the answer depends on facts that are only linked via an intermediate entity ("Who founded the company that acquired X?" → retrieve X's acquirer → retrieve acquirer's founder).

Retrieval as an MCP tool — expose the full RAG pipeline (embed query → ANN search → rerank → return chunks) as a single MCP tool. The agent calls it explicitly when it needs grounded context rather than having retrieval injected automatically. This is the "RAG-as-a-tool" pattern: the agent decides when to retrieve and with what query, enabling conditional retrieval and multi-hop chains. Contrast with classic single-shot RAG where retrieval is always triggered before generation.

Self-correction — after retrieval, have the agent (or a separate verification step) check whether the retrieved context actually supports the planned answer before generating it. If context is insufficient or contradictory, re-query with a refined query or surface uncertainty explicitly. Treat retrieved content as untrusted input, not ground truth (see /resources/agentic-security-checklist for context poisoning risks).

See /resources/agent-memory-context for how RAG-based semantic memory relates to other memory types in agents.

Quality and failure modes

Failure mode	Cause	Mitigation
Irrelevant chunks	Embedding domain mismatch; threshold too loose	Fine-tune or swap embedding model; tighten top-k cutoff; add reranker
Contradictory chunks	Multiple source versions in index	Dedup at ingest; metadata-filter by source recency; surface contradictions explicitly
Lost-in-the-middle	Critical chunk placed in middle of long context	Place highest-scored chunks at start/end; see Liu et al. arXiv:2307.03172 (TACL 2024) — also covered in /resources/agent-memory-context
Stale index	Source documents updated after ingestion	Incremental re-ingestion pipeline; TTL-based invalidation; `updated` metadata on chunks
Embedding-domain mismatch	General-purpose embedder on specialized domain	Domain-adaptive fine-tuning; switch to a specialized embedding model
Context poisoning	Adversarial content in the retrieval corpus	Validate chunks before injection; treat retrieved text as untrusted data (see /resources/agentic-security-checklist)

For reliable structured outputs from the generation step, see /resources/reliable-tool-calling.

Evaluation

Retrieval metrics — precision@k (fraction of retrieved chunks that are relevant); recall@k (fraction of relevant chunks retrieved). Measure both: high precision with low recall means you miss facts; high recall with low precision floods the context window.

Generation metrics — faithfulness / groundedness (does the answer follow from the retrieved context, with no hallucinated claims?); answer relevance (does the answer address the question?). These require either human judges or LLM-as-judge scoring.

RAGAS (github.com/explodinggradients/ragas) is the standard open-source framework for reference-free RAG evaluation. It computes faithfulness, answer relevance, and context precision/recall without requiring ground-truth annotations, and integrates with LangChain and LlamaIndex. Paper: arXiv:2309.15217.

Verified sources

Late Chunking paper (Jina AI / Weaviate, arXiv:2409.04701): https://arxiv.org/abs/2409.04701
Late Chunking GitHub (Jina AI): https://github.com/jina-ai/late-chunking
Contextual Retrieval (Anthropic engineering blog, September 2024): https://www.anthropic.com/engineering/contextual-retrieval
Reciprocal Rank Fusion — hybrid retrieval analysis (arXiv:2210.11934): https://arxiv.org/abs/2210.11934
RRF in OpenSearch: https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/
ColBERT — late interaction retrieval (arXiv:2004.12832, SIGIR 2020): https://arxiv.org/abs/2004.12832
Cohere Rerank overview: https://docs.cohere.com/docs/rerank-overview
Cohere Rerank API reference: https://docs.cohere.com/reference/rerank
BGE reranker-v2-m3 (BAAI, Hugging Face): https://huggingface.co/BAAI/bge-reranker-v2-m3
BGE reranker docs: https://bge-model.com/bge/bge_reranker_v2.html
Jina reranker API: https://jina.ai/reranker/
jina-reranker-v3 (Jina AI): https://jina.ai/models/jina-reranker-v3/
jina-reranker-v2-base-multilingual (Hugging Face): https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual
RAGAS framework (GitHub, explodinggradients): https://github.com/explodinggradients/ragas
RAGAS paper (arXiv:2309.15217, EACL 2024): https://aclanthology.org/2024.eacl-demo.16/
Okapi BM25 (Wikipedia reference): https://en.wikipedia.org/wiki/Okapi_BM25
"Lost in the Middle" — context position effects (Liu et al., TACL 2024, arXiv:2307.03172): https://arxiv.org/abs/2307.03172

#rag #retrieval #embeddings #chunking #reranking #agents #vector-databases #evaluation

Category: Guide