Embeddings and Vector Search for Agents
How to pick an embedding model, understand distance metrics, choose an ANN index type, and operate a vector store reliably in agent retrieval pipelines.
Embeddings map text (or other content) to dense numeric vectors so that semantic similarity can be measured by vector distance. This reference covers the layer between raw text and the retrieval results your agent acts on. For the full retrieval pipeline see /resources/rag-retrieval-for-agents; for memory store architecture see /resources/agent-memory-context.
Embedding model dimensions and selection criteria
Key parameters to evaluate before choosing a model:
- Dimensionality — typical range 256–3072. Higher dimensions can capture more nuance but cost more storage and compute. Matryoshka Representation Learning (MRL, Kusupati et al., NeurIPS 2022, arXiv:2205.13147) trains embeddings so that truncating to fewer dimensions still yields useful representations — use MRL-trained models to right-size vectors at query time without re-embedding.
- Max input tokens — models differ widely (256–128k tokens). A model that silently truncates long documents will produce misleading embeddings.
- Language coverage — English-only models underperform on multilingual corpora. Verify coverage for your target languages.
- Domain fit — general-purpose MTEB scores may not reflect performance on code, legal, medical, or financial text. Benchmark on your own data.
- Cost / quality / latency tradeoff — API-hosted models charge per token; open-weight models traded compute for zero marginal cost. Quantized open-weight models reduce memory at the cost of some recall.
Standard benchmark: MTEB / MMTEB
The Massive Text Embedding Benchmark (MTEB) — maintained by the embeddings-benchmark organization on HuggingFace (huggingface.co/spaces/mteb/leaderboard) — is the standard leaderboard for comparing embedding models across retrieval, classification, clustering, semantic similarity, and other tasks. It covers 112+ languages and 5,000+ submissions.
MMTEB (Massive Multilingual Text Embedding Benchmark, arXiv:2502.13595, ICLR 2025) is a community-driven expansion covering 500+ tasks across 250+ languages, hosted on HuggingFace alongside MTEB. Use MMTEB scores when multilingual recall matters.
Do not treat leaderboard rankings as permanent. Models update frequently; always check the current leaderboard and run domain-specific recall tests on your data.
Notable embedding model families (as of mid-2026)
Listed as available/notable — rankings shift; verify current MTEB position before choosing:
API / hosted:
- OpenAI text-embedding-3-small / text-embedding-3-large — 1536 and 3072 default dimensions; both support MRL-style dimension reduction via the
dimensionsparameter (openai.com). - Cohere Embed v4 — multimodal (text + images), 128k token context, supports MRL dimensions (256/512/1024/1536), 100+ languages (cohere.com).
- Voyage AI (voyage-3.5 / voyage-3.5-lite) — strong retrieval scores; MRL dimensions (256/512/1024/2048); multilingual support (voyageai.com).
- Google Gemini Embedding (gemini-embedding-001) — up to 3072 dimensions, MRL-truncatable, 100+ languages; text-embedding-004 deprecated as of Jan 2026 (ai.google.dev).
Open-weight:
- BGE / BAAI (e.g., BGE-M3) — multi-functionality (dense, sparse, multi-vector), multilingual, widely deployed, top open-weight download counts on HuggingFace (huggingface.co/BAAI).
- E5 / multilingual-E5 (Microsoft) — strong multilingual retrieval; multiple sizes available (huggingface.co/intfloat).
- Nomic Embed — long-context, Apache 2.0, reproducibly trained (nomic.ai; arxiv:2402.01613).
- Jina Embeddings — extended context; self-hostable (jina.ai).
- Qwen3 Embedding (Alibaba/QwenLM) — 0.6B/4B/8B sizes; Apache 2.0; ranked highly on MTEB multilingual as of June 2025 (github.com/QwenLM).
Cross-reference open-weight model infrastructure at /resources/open-weight-models-for-agents.
Distance metrics
| Metric | Formula basis | Best for |
|---|---|---|
| Cosine similarity | Angle between vectors | Default for most retrieval; normalizes magnitude |
| Dot product | Magnitude × angle | Equivalent to cosine when vectors are unit-normalized; faster |
| Euclidean (L2) | Absolute distance | Useful when magnitude carries information (e.g., sparse embeddings) |
Normalization caveat: most retrieval models produce unit-normalized vectors by default, making cosine and dot product equivalent. If you normalize, dot product is cheaper to compute. Verify your model's output normalization before choosing.
ANN index types
Approximate Nearest Neighbor (ANN) indexes trade exact recall for speed. The main types:
- Flat (brute-force) — exact exhaustive scan; 100% recall; scales linearly with corpus size. Use only for small corpora (<100k vectors) or as a recall baseline.
- IVF (Inverted File Index) — k-means clusters vectors at build time; at query time only nearby clusters are searched (multi-probe tunable). Faster build than HNSW; lower memory; requires a training step. Implemented in FAISS (github.com/facebookresearch/faiss) and pgvector (IVFFlat).
- HNSW (Hierarchical Navigable Small World graphs) — multi-layer proximity graph; logarithmic search complexity; superior speed-recall tradeoff at query time; slower to build and more memory than IVF; no training step required. See Malkov & Yashunin, arXiv:1603.09320. Implemented in FAISS and pgvector.
- Product Quantization (PQ) / Scalar Quantization (SQ) — compression schemes that reduce vector storage by encoding sub-vectors with a codebook. Typically combined with IVF (IVFPQ) or HNSW. Reduces memory 4–32× at the cost of some recall. Use when corpus exceeds available RAM.
Practical guidance
- Match query and document models. Embeddings from different models occupy different vector spaces — they are not comparable. Always embed queries with the same model used to index documents.
- Re-embed on model change. Embeddings are not portable across model versions. Upgrading an embedding model requires re-indexing the full corpus.
- Normalize before cosine search. If your vector store does not auto-normalize, do it at index and query time to avoid misleading distance scores.
- Use quantization at scale. PQ or SQ cuts memory 4–32× with modest recall loss. Benchmark recall on your own data before and after.
- Benchmark recall on your own data. Leaderboard scores are averages. Your domain, query style, and document length distribution all affect real-world recall.
- Start with HNSW for online search; use IVFPQ for large memory-constrained corpora.
Verified sources
- MTEB leaderboard (HuggingFace): https://huggingface.co/spaces/mteb/leaderboard
- MMTEB paper (arXiv:2502.13595): https://arxiv.org/abs/2502.13595
- Matryoshka Representation Learning (Kusupati et al., NeurIPS 2022, arXiv:2205.13147): https://arxiv.org/abs/2205.13147
- HNSW paper (Malkov & Yashunin, arXiv:1603.09320): https://arxiv.org/abs/1603.09320
- OpenAI embedding models (openai.com): https://openai.com/index/new-embedding-models-and-api-updates/
- Cohere Embed v4 announcement: https://cohere.com/blog/embed-4
- Voyage AI embedding docs: https://docs.voyageai.com/docs/embeddings
- Google Gemini Embedding GA (Google Developers Blog): https://developers.googleblog.com/gemini-embedding-available-gemini-api/
- Qwen3 Embedding (QwenLM GitHub): https://github.com/QwenLM/Qwen3-Embedding
- Nomic Embed paper (arXiv:2402.01613): https://arxiv.org/abs/2402.01613
- FAISS (Meta AI): https://github.com/facebookresearch/faiss
- pgvector (PostgreSQL vector extension): https://github.com/pgvector/pgvector