ChangeGamer

← All resources

Testing AI Agents in CI

Guide · updated 2026-06-21 · Markdown variant

How to write deterministic, fast, CI-friendly tests for non-deterministic agents: the three-layer test pyramid, LLM mocking, cassette/VCR-style replay, snapshot testing of tool-call trajectories, pass@k thresholds, and verified tooling.


The core tension: agents are non-deterministic, but CI pipelines need tests that are deterministic, fast, cheap, and reliable. You can get both — by being deliberate about which layer of the stack you test at which level.

The three-layer test pyramid for agents

Layer 1 — deterministic unit tests (run on every commit)

Test the code around the model: tool functions, parsers, prompt-template renderers, schema validators, retry logic, and output-format coercers. Mock or stub the LLM client entirely. These tests are ordinary unit tests — no API calls, no network, fast and free. They catch the majority of regressions because most bugs live in the glue, not the model.

Layer 2 — recorded/replayed LLM interactions (run on every commit)

Use cassette/VCR-style fixtures: the first time a test runs, it hits the real API and serialises the full HTTP exchange to a YAML file. Every subsequent run replays that cassette instead of making a live call — fast, free, and network-independent. Commit cassettes to version control. Re-record only when prompt templates or schema change. In CI, pass --vcr-record=none (pytest-recording) so a missing cassette is a test failure, not a live API call.

Important: scrub credentials and sensitive headers from cassettes before committing. VCR.py and pytest-recording both support filter_headers and filter_post_data_parameters for this.

Layer 3 — live smoke / eval tests (nightly or pre-release, NOT per-commit)

A small, hand-curated set of end-to-end tasks run against the real model. Gate these on a separate CI job (nightly, or triggered manually for releases). They are expensive and inherently flaky — keep the set small and treat failures as signals, not hard blockers per PR.

Key techniques

Mocking the LLM client — use unittest.mock.patch or a dependency-injection seam to replace the model client with a fixture that returns a pre-canned structured response. This is the cheapest form of Layer 1 testing.

temperature=0 and seeds — setting temperature to zero and a fixed seed reduces variance but does not guarantee bit-for-bit identical outputs across runs. Floating-point non-associativity from GPU batching and MoE routing means the same prompt can yield different tokens in different batch contexts. Never rely on temperature=0 as a substitute for proper mocking or cassette replay.

Snapshot testing of tool-call sequences — store the expected sequence of tool calls (names + arguments) as a snapshot. Assert on structure and argument values, not on the free-text reasoning. A tool-call diff in CI surfaces unintended trajectory changes before they reach production. Cross-link: /resources/reliable-tool-calling.

Structured-output assertions — if your agent emits JSON or a Pydantic schema, assert against the schema and the key field values, not against the exact prose. This tolerates benign rephrasing while catching real regressions.

LLM-as-judge in tests — a second model grades the output against a rubric. Useful for Layer 3 smoke tests but carries its own flakiness: the judge model can disagree with itself across runs. Treat judge scores as soft signals; set wide pass/fail thresholds and aggregate over multiple runs. See /resources/evaluating-ai-agents for benchmark-grade eval methodology.

Handling flakiness

Verified tooling

VCR.py (vcrpy) — Python HTTP record/replay. Intercepts HTTP at the library level; serialises to YAML cassettes. Works with any HTTP-based LLM SDK.

pytest-recording — a pytest plugin wrapping VCR.py. Adds --vcr-record flag and @pytest.mark.vcr decorator. Maintained by kiwicom on GitHub.

promptfoo — YAML-driven test runner for prompts and agents with native CI/CD integration (GitHub Action available). Supports structured assertions (is-json, contains-json, llm-rubric), cost/latency thresholds, and red-teaming. MIT-licensed; acquired by OpenAI in March 2026 but remains open-source.

DeepEval — pytest-based LLM evaluation framework. assert_test() and deepeval test run plug directly into CI pipelines; supports parallel execution via -n flag. Maintained by Confident AI.

Verified sources

#testing #CI #agents #mocking #determinism #tool-calling #pytest

Category: Guide