# Testing AI Agents in CI

> How to write deterministic, fast, CI-friendly tests for non-deterministic agents: the three-layer test pyramid, LLM mocking, cassette/VCR-style replay, snapshot testing of tool-call trajectories, pass@k thresholds, and verified tooling.

Category: Guide · Updated: 2026-06-21 · Tags: testing, CI, agents, mocking, determinism, tool-calling, pytest
Canonical: https://changegamer.ai/resources/testing-ai-agents

The core tension: agents are non-deterministic, but CI pipelines need tests that are deterministic, fast, cheap, and reliable. You can get both — by being deliberate about which layer of the stack you test at which level.

## The three-layer test pyramid for agents

### Layer 1 — deterministic unit tests (run on every commit)

Test the code *around* the model: tool functions, parsers, prompt-template renderers, schema validators, retry logic, and output-format coercers. Mock or stub the LLM client entirely. These tests are ordinary unit tests — no API calls, no network, fast and free. They catch the majority of regressions because most bugs live in the glue, not the model.

### Layer 2 — recorded/replayed LLM interactions (run on every commit)

Use cassette/VCR-style fixtures: the first time a test runs, it hits the real API and serialises the full HTTP exchange to a YAML file. Every subsequent run replays that cassette instead of making a live call — fast, free, and network-independent. Commit cassettes to version control. Re-record only when prompt templates or schema change. In CI, pass `--vcr-record=none` (pytest-recording) so a missing cassette is a test failure, not a live API call.

**Important**: scrub credentials and sensitive headers from cassettes before committing. VCR.py and pytest-recording both support `filter_headers` and `filter_post_data_parameters` for this.

### Layer 3 — live smoke / eval tests (nightly or pre-release, NOT per-commit)

A small, hand-curated set of end-to-end tasks run against the real model. Gate these on a separate CI job (nightly, or triggered manually for releases). They are expensive and inherently flaky — keep the set small and treat failures as signals, not hard blockers per PR.

## Key techniques

**Mocking the LLM client** — use `unittest.mock.patch` or a dependency-injection seam to replace the model client with a fixture that returns a pre-canned structured response. This is the cheapest form of Layer 1 testing.

**temperature=0 and seeds** — setting temperature to zero and a fixed seed reduces variance but does not guarantee bit-for-bit identical outputs across runs. Floating-point non-associativity from GPU batching and MoE routing means the same prompt can yield different tokens in different batch contexts. Never rely on temperature=0 as a substitute for proper mocking or cassette replay.

**Snapshot testing of tool-call sequences** — store the expected sequence of tool calls (names + arguments) as a snapshot. Assert on structure and argument values, not on the free-text reasoning. A tool-call diff in CI surfaces unintended trajectory changes before they reach production. Cross-link: /resources/reliable-tool-calling.

**Structured-output assertions** — if your agent emits JSON or a Pydantic schema, assert against the schema and the key field values, not against the exact prose. This tolerates benign rephrasing while catching real regressions.

**LLM-as-judge in tests** — a second model grades the output against a rubric. Useful for Layer 3 smoke tests but carries its own flakiness: the judge model can disagree with itself across runs. Treat judge scores as soft signals; set wide pass/fail thresholds and aggregate over multiple runs. See /resources/evaluating-ai-agents for benchmark-grade eval methodology.

## Handling flakiness

- **Separate suites** — keep deterministic (Layers 1–2) and probabilistic (Layer 3) suites in distinct test files and CI jobs. Never let a non-deterministic test gate a PR.
- **pass@k thresholds** — for probabilistic tests, run k trials and assert that at least m succeed (e.g., pass@5 with m=4). This is more honest than a single run and absorbs natural variance without hiding real regressions.
- **Retries vs quarantine** — automatic retries mask real failures; prefer quarantining a flaky test into the nightly suite until its failure mode is understood.
- **Cost controls** — set token-budget limits per test job. Tag each Layer 3 job with expected cost and alert when actual cost drifts more than 20%. Cross-links: /resources/agent-cost-latency-optimization, /resources/agent-observability (traces from production runs can seed new cassettes and test cases).

## Verified tooling

**VCR.py** (`vcrpy`) — Python HTTP record/replay. Intercepts HTTP at the library level; serialises to YAML cassettes. Works with any HTTP-based LLM SDK.

**pytest-recording** — a pytest plugin wrapping VCR.py. Adds `--vcr-record` flag and `@pytest.mark.vcr` decorator. Maintained by kiwicom on GitHub.

**promptfoo** — YAML-driven test runner for prompts and agents with native CI/CD integration (GitHub Action available). Supports structured assertions (`is-json`, `contains-json`, `llm-rubric`), cost/latency thresholds, and red-teaming. MIT-licensed; acquired by OpenAI in March 2026 but remains open-source.

**DeepEval** — pytest-based LLM evaluation framework. `assert_test()` and `deepeval test run` plug directly into CI pipelines; supports parallel execution via `-n` flag. Maintained by Confident AI.

## Verified sources

- VCR.py repo (kevin1024/vcrpy): https://github.com/kevin1024/vcrpy
- VCR.py docs: https://vcrpy.readthedocs.io/en/latest/
- pytest-recording repo (kiwicom): https://github.com/kiwicom/pytest-recording
- pytest-recording on PyPI: https://pypi.org/project/pytest-recording/
- promptfoo CI/CD integration docs: https://www.promptfoo.dev/docs/integrations/ci-cd/
- promptfoo GitHub: https://github.com/promptfoo/promptfoo
- DeepEval unit testing in CI/CD: https://deepeval.com/docs/evaluation-unit-testing-in-ci-cd
- DeepEval GitHub (confident-ai): https://github.com/confident-ai/deepeval
- temperature=0 non-determinism explained: https://www.zansara.dev/posts/2026-03-24-temp-0-llm/
