Fine-Tuning vs RAG vs Prompting

Guide · updated 2026-06-21 · Markdown variant

Decision guide for agent builders: when to use prompting, RAG, or fine-tuning — and how they combine. Covers SFT, LoRA/QLoRA, DPO, distillation, and a symptom-to-fix table.

The mental model

These three techniques solve different problems and are routinely combined in production systems. Treating them as competitors leads to the wrong choice every time.

Prompting — changes what you ask the model, at zero cost. The right first lever.
RAG — changes what the model knows at inference time by injecting external content into context. Use when the knowledge is proprietary, current, or too large for the weights.
Fine-tuning — changes how the model behaves: its style, format reliability, or a narrow skill. Does not reliably teach fresh facts (weights go stale; fine-tuned knowledge does not update itself).

Key rule: knowledge that changes often belongs in RAG, not fine-tuned weights.

Try in this order

Prompting first — clearer instructions, few-shot examples, structured delimiters, output schemas. See /resources/prompt-context-engineering. Fastest to iterate; fully reversible.
Add RAG before fine-tuning if the bottleneck is missing or stale knowledge. See /resources/rag-retrieval-for-agents.
Fine-tune only when you have a persistent behavioral defect that prompting and RAG cannot fix, and you have enough high-quality labeled examples to train on.

Fine-tuning methods

Supervised fine-tuning (SFT) — train on input/output pairs demonstrating desired behavior. OpenAI documents SFT as a supported method for style, format, and task adaptation (platform.openai.com/docs/guides/supervised-fine-tuning).

Parameter-efficient fine-tuning (PEFT) — instead of updating all weights, inject small trainable matrices. The dominant method is LoRA (Low-Rank Adaptation, Hu et al., arXiv:2106.09685): freeze the base weights and add a pair of low-rank matrices (W = W₀ + AB) to each transformer layer. QLoRA extends LoRA by quantizing the base weights to 4-bit before adding the adapters, dramatically reducing GPU memory requirements. PEFT methods produce swappable adapter files that share the base model, making multi-task serving much cheaper than keeping separate full copies.

Preference fine-tuning (DPO / RLHF) — align the model to human preferences via ranked pairs of outputs (preferred vs. rejected). RLHF (Reinforcement Learning from Human Feedback) uses a learned reward model and policy-gradient updates. DPO (Direct Preference Optimization, Rafailov et al., arXiv:2305.18290) simplifies this: it directly optimizes a classification-style loss over preference pairs, eliminating the separate reward model and RL training loop, while matching or exceeding RLHF quality. Standard practice is SFT first, then DPO.

Distillation — train a smaller model to mimic a larger one's outputs on a narrow task. Use when you need a smaller, cheaper, faster model that matches a frontier model on a specific task. Requires a dataset of (input, large-model-output) pairs. Cross-link: /resources/open-weight-models-for-agents for which base models are fine-tunable.

Symptom-to-fix table

Symptom	Likely fix
Model lacks current or proprietary facts	RAG
Output format or schema is unreliable	Better prompt + structured outputs; fine-tune (SFT) if persistent
Tone or style is wrong	Improve system prompt; fine-tune (SFT) if consistent across inputs
Model too slow or expensive at scale	Distillation/fine-tune a smaller model; see /resources/agent-cost-latency-optimization
Model makes tool-calling mistakes	Structured output + typed schemas; SFT on tool-call examples
Need model to follow complex instructions reliably	Few-shot prompting first; SFT if it fails at scale
Behavior must reflect human ranking preferences	DPO or RLHF after SFT

Honest tradeoffs of fine-tuning

Fine-tuning carries real costs that teams underestimate:

Data preparation — collecting, cleaning, and labeling high-quality training examples is the hardest part. Diverse, high-quality data matters more than quantity.
Training and evaluation infrastructure — requires GPU compute, experiment tracking, and offline evaluation before each deploy.
Serving a custom model — you now own the hosting of a bespoke artifact. When the base model is updated by the provider, your fine-tuned version stays behind.
Staleness — fine-tuned knowledge does not update itself. Mixing RAG into the fine-tuned model's inference pipeline is the standard production pattern to keep knowledge current.

Combining all three

The 2026 production default for complex agents is: a fine-tuned (or instruction-tuned) model that has been preference-aligned, served with RAG for current knowledge, and steered per-request via structured system prompts. These layers are additive: adding RAG to a fine-tuned model is normal; adding a better system prompt to a RAG-augmented fine-tuned model is normal.

Cross-links: /resources/rag-retrieval-for-agents · /resources/prompt-context-engineering · /resources/reliable-tool-calling · /resources/agent-cost-latency-optimization

Verified sources

LoRA paper (Hu et al., 2021): https://arxiv.org/abs/2106.09685
DPO paper (Rafailov et al., 2023): https://arxiv.org/abs/2305.18290
OpenAI model optimization guide: https://platform.openai.com/docs/guides/fine-tuning
OpenAI supervised fine-tuning guide: https://platform.openai.com/docs/guides/supervised-fine-tuning
OpenAI DPO guide: https://platform.openai.com/docs/guides/direct-preference-optimization
OpenAI fine-tuning best practices: https://platform.openai.com/docs/guides/fine-tuning-best-practices

#fine-tuning #rag #prompting #lora #dpo #sft #distillation #agents #decision-guide

Category: Guide