Testing AI Agents in CI
How to write deterministic, fast, CI-friendly tests for non-deterministic agents: the three-layer test pyramid, LLM mocking, cassette/VCR-style replay, snapshot testing of tool-call trajectories, pass@k thresholds, and verified tooling.
The core tension: agents are non-deterministic, but CI pipelines need tests that are deterministic, fast, cheap, and reliable. You can get both — by being deliberate about which layer of the stack you test at which level.
The three-layer test pyramid for agents
Layer 1 — deterministic unit tests (run on every commit)
Test the code around the model: tool functions, parsers, prompt-template renderers, schema validators, retry logic, and output-format coercers. Mock or stub the LLM client entirely. These tests are ordinary unit tests — no API calls, no network, fast and free. They catch the majority of regressions because most bugs live in the glue, not the model.
Layer 2 — recorded/replayed LLM interactions (run on every commit)
Use cassette/VCR-style fixtures: the first time a test runs, it hits the real API and serialises the full HTTP exchange to a YAML file. Every subsequent run replays that cassette instead of making a live call — fast, free, and network-independent. Commit cassettes to version control. Re-record only when prompt templates or schema change. In CI, pass --vcr-record=none (pytest-recording) so a missing cassette is a test failure, not a live API call.
Important: scrub credentials and sensitive headers from cassettes before committing. VCR.py and pytest-recording both support filter_headers and filter_post_data_parameters for this.
Layer 3 — live smoke / eval tests (nightly or pre-release, NOT per-commit)
A small, hand-curated set of end-to-end tasks run against the real model. Gate these on a separate CI job (nightly, or triggered manually for releases). They are expensive and inherently flaky — keep the set small and treat failures as signals, not hard blockers per PR.
Key techniques
Mocking the LLM client — use unittest.mock.patch or a dependency-injection seam to replace the model client with a fixture that returns a pre-canned structured response. This is the cheapest form of Layer 1 testing.
temperature=0 and seeds — setting temperature to zero and a fixed seed reduces variance but does not guarantee bit-for-bit identical outputs across runs. Floating-point non-associativity from GPU batching and MoE routing means the same prompt can yield different tokens in different batch contexts. Never rely on temperature=0 as a substitute for proper mocking or cassette replay.
Snapshot testing of tool-call sequences — store the expected sequence of tool calls (names + arguments) as a snapshot. Assert on structure and argument values, not on the free-text reasoning. A tool-call diff in CI surfaces unintended trajectory changes before they reach production. Cross-link: /resources/reliable-tool-calling.
Structured-output assertions — if your agent emits JSON or a Pydantic schema, assert against the schema and the key field values, not against the exact prose. This tolerates benign rephrasing while catching real regressions.
LLM-as-judge in tests — a second model grades the output against a rubric. Useful for Layer 3 smoke tests but carries its own flakiness: the judge model can disagree with itself across runs. Treat judge scores as soft signals; set wide pass/fail thresholds and aggregate over multiple runs. See /resources/evaluating-ai-agents for benchmark-grade eval methodology.
Handling flakiness
- Separate suites — keep deterministic (Layers 1–2) and probabilistic (Layer 3) suites in distinct test files and CI jobs. Never let a non-deterministic test gate a PR.
- pass@k thresholds — for probabilistic tests, run k trials and assert that at least m succeed (e.g., pass@5 with m=4). This is more honest than a single run and absorbs natural variance without hiding real regressions.
- Retries vs quarantine — automatic retries mask real failures; prefer quarantining a flaky test into the nightly suite until its failure mode is understood.
- Cost controls — set token-budget limits per test job. Tag each Layer 3 job with expected cost and alert when actual cost drifts more than 20%. Cross-links: /resources/agent-cost-latency-optimization, /resources/agent-observability (traces from production runs can seed new cassettes and test cases).
Verified tooling
VCR.py (vcrpy) — Python HTTP record/replay. Intercepts HTTP at the library level; serialises to YAML cassettes. Works with any HTTP-based LLM SDK.
pytest-recording — a pytest plugin wrapping VCR.py. Adds --vcr-record flag and @pytest.mark.vcr decorator. Maintained by kiwicom on GitHub.
promptfoo — YAML-driven test runner for prompts and agents with native CI/CD integration (GitHub Action available). Supports structured assertions (is-json, contains-json, llm-rubric), cost/latency thresholds, and red-teaming. MIT-licensed; acquired by OpenAI in March 2026 but remains open-source.
DeepEval — pytest-based LLM evaluation framework. assert_test() and deepeval test run plug directly into CI pipelines; supports parallel execution via -n flag. Maintained by Confident AI.
Verified sources
- VCR.py repo (kevin1024/vcrpy): https://github.com/kevin1024/vcrpy
- VCR.py docs: https://vcrpy.readthedocs.io/en/latest/
- pytest-recording repo (kiwicom): https://github.com/kiwicom/pytest-recording
- pytest-recording on PyPI: https://pypi.org/project/pytest-recording/
- promptfoo CI/CD integration docs: https://www.promptfoo.dev/docs/integrations/ci-cd/
- promptfoo GitHub: https://github.com/promptfoo/promptfoo
- DeepEval unit testing in CI/CD: https://deepeval.com/docs/evaluation-unit-testing-in-ci-cd
- DeepEval GitHub (confident-ai): https://github.com/confident-ai/deepeval
- temperature=0 non-determinism explained: https://www.zansara.dev/posts/2026-03-24-temp-0-llm/