agentrial - pytest for AI agents. Statistics, not anecdotes.

agentrial

pytest for AI agents. Statistics, not anecdotes.

Pitch

Your agent passes Monday, fails Wednesday. Same prompt, same code. agentrial runs it N times — statistics, not guesswork.

We tested 5 agents 400 times each: the expensive one costs 10× more per success for 3.5 fewer accuracy points.

Pass rates with confidence intervals, cost per success, step-level failure attribution, CI/CD regression detection.

pip install agentrial — MIT, local-first.

Description

Why agentrial exists

AI agents are non-deterministic. Running a test once tells you nothing. We ran 5 agent archetypes 400 times each:

Agent	Pass Rate	95% CI	Cost/Success
Reliable RAG	91.0%	[87.8%, 93.4%]	$0.016
Expensive Multi-Model	87.5%	[83.9%, 90.4%]	$0.161
Inconsistent	69.2%	[64.6%, 73.6%]	$0.052
Flaky Coding	65.5%	[60.7%, 70.0%]	$0.079
Fast-But-Wrong	45.2%	[40.4%, 50.1%]	$0.007

Every agent fails at a specific step. The Flaky Coding agent: 71% execution failures, 29% planning failures. Knowing WHERE it fails changes WHAT you fix.

Key Features

Multi-trial execution — run N times (default 10), get confidence intervals instead of pass/fail
Wilson score intervals — accurate even with small samples (N=10)
Step-level failure attribution — Fisher exact test + Benjamini-Hochberg correction
Cost per success — the metric that matters in production, not cost per call
6 framework adapters — LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, smolagents
CI/CD integration — GitHub Action blocks PRs when reliability drops below threshold
Snapshot testing — statistical regression detection across versions
LLM-as-judge — calibrated with Krippendorff's alpha, rule-based fallback included
MCP security scanner — prompt injection, tool shadowing detection
Production monitoring — CUSUM + Page-Hinkley drift detection

Quick start

pip install agentrial
agentrial init
agentrial run --trials 100

Who it's for

Developers building AI agents with LangGraph, CrewAI, AutoGen, or similar frameworks who need to know if their agent still works after a prompt change, model swap, or code update — before it hits production.

MIT licensed. Everything runs locally. 349 tests, zero external dependencies required.

0 comments

No comments yet.

New comment