# Fine-Tuning vs RAG vs Prompting

> Decision guide for agent builders: when to use prompting, RAG, or fine-tuning — and how they combine. Covers SFT, LoRA/QLoRA, DPO, distillation, and a symptom-to-fix table.

Category: Guide · Updated: 2026-06-21 · Tags: fine-tuning, rag, prompting, lora, dpo, sft, distillation, agents, decision-guide
Canonical: https://changegamer.ai/resources/fine-tuning-vs-rag

## The mental model

These three techniques solve different problems and are routinely combined in production systems. Treating them as competitors leads to the wrong choice every time.

- **Prompting** — changes what you *ask* the model, at zero cost. The right first lever.
- **RAG** — changes what the model *knows* at inference time by injecting external content into context. Use when the knowledge is proprietary, current, or too large for the weights.
- **Fine-tuning** — changes how the model *behaves*: its style, format reliability, or a narrow skill. Does not reliably teach fresh facts (weights go stale; fine-tuned knowledge does not update itself).

Key rule: **knowledge that changes often belongs in RAG, not fine-tuned weights.**

## Try in this order

1. **Prompting first** — clearer instructions, few-shot examples, structured delimiters, output schemas. See /resources/prompt-context-engineering. Fastest to iterate; fully reversible.
2. **Add RAG** before fine-tuning if the bottleneck is missing or stale knowledge. See /resources/rag-retrieval-for-agents.
3. **Fine-tune** only when you have a persistent behavioral defect that prompting and RAG cannot fix, and you have enough high-quality labeled examples to train on.

## Fine-tuning methods

**Supervised fine-tuning (SFT)** — train on input/output pairs demonstrating desired behavior. OpenAI documents SFT as a supported method for style, format, and task adaptation (platform.openai.com/docs/guides/supervised-fine-tuning).

**Parameter-efficient fine-tuning (PEFT)** — instead of updating all weights, inject small trainable matrices. The dominant method is LoRA (Low-Rank Adaptation, Hu et al., arXiv:2106.09685): freeze the base weights and add a pair of low-rank matrices (W = W₀ + AB) to each transformer layer. QLoRA extends LoRA by quantizing the base weights to 4-bit before adding the adapters, dramatically reducing GPU memory requirements. PEFT methods produce swappable adapter files that share the base model, making multi-task serving much cheaper than keeping separate full copies.

**Preference fine-tuning (DPO / RLHF)** — align the model to human preferences via ranked pairs of outputs (preferred vs. rejected). RLHF (Reinforcement Learning from Human Feedback) uses a learned reward model and policy-gradient updates. DPO (Direct Preference Optimization, Rafailov et al., arXiv:2305.18290) simplifies this: it directly optimizes a classification-style loss over preference pairs, eliminating the separate reward model and RL training loop, while matching or exceeding RLHF quality. Standard practice is SFT first, then DPO.

**Distillation** — train a smaller model to mimic a larger one's outputs on a narrow task. Use when you need a smaller, cheaper, faster model that matches a frontier model on a specific task. Requires a dataset of (input, large-model-output) pairs. Cross-link: /resources/open-weight-models-for-agents for which base models are fine-tunable.

## Symptom-to-fix table

| Symptom | Likely fix |
|---|---|
| Model lacks current or proprietary facts | RAG |
| Output format or schema is unreliable | Better prompt + structured outputs; fine-tune (SFT) if persistent |
| Tone or style is wrong | Improve system prompt; fine-tune (SFT) if consistent across inputs |
| Model too slow or expensive at scale | Distillation/fine-tune a smaller model; see /resources/agent-cost-latency-optimization |
| Model makes tool-calling mistakes | Structured output + typed schemas; SFT on tool-call examples |
| Need model to follow complex instructions reliably | Few-shot prompting first; SFT if it fails at scale |
| Behavior must reflect human ranking preferences | DPO or RLHF after SFT |

## Honest tradeoffs of fine-tuning

Fine-tuning carries real costs that teams underestimate:

- **Data preparation** — collecting, cleaning, and labeling high-quality training examples is the hardest part. Diverse, high-quality data matters more than quantity.
- **Training and evaluation infrastructure** — requires GPU compute, experiment tracking, and offline evaluation before each deploy.
- **Serving a custom model** — you now own the hosting of a bespoke artifact. When the base model is updated by the provider, your fine-tuned version stays behind.
- **Staleness** — fine-tuned knowledge does not update itself. Mixing RAG into the fine-tuned model's inference pipeline is the standard production pattern to keep knowledge current.

## Combining all three

The 2026 production default for complex agents is: a fine-tuned (or instruction-tuned) model that has been preference-aligned, served with RAG for current knowledge, and steered per-request via structured system prompts. These layers are additive: adding RAG to a fine-tuned model is normal; adding a better system prompt to a RAG-augmented fine-tuned model is normal.

Cross-links: /resources/rag-retrieval-for-agents · /resources/prompt-context-engineering · /resources/reliable-tool-calling · /resources/agent-cost-latency-optimization

## Verified sources

- LoRA paper (Hu et al., 2021): https://arxiv.org/abs/2106.09685
- DPO paper (Rafailov et al., 2023): https://arxiv.org/abs/2305.18290
- OpenAI model optimization guide: https://platform.openai.com/docs/guides/fine-tuning
- OpenAI supervised fine-tuning guide: https://platform.openai.com/docs/guides/supervised-fine-tuning
- OpenAI DPO guide: https://platform.openai.com/docs/guides/direct-preference-optimization
- OpenAI fine-tuning best practices: https://platform.openai.com/docs/guides/fine-tuning-best-practices
