Your agent passes Monday, fails Wednesday. Same prompt, same code. agentrial runs it N times — statistics, not guesswork.
We tested 5 agents 400 times each: the expensive one costs 10× more per success for 3.5 fewer accuracy points.
Pass rates with confidence intervals, cost per success, step-level failure attribution, CI/CD regression detection.
pip install agentrial — MIT, local-first.
Why agentrial exists
AI agents are non-deterministic. Running a test once tells you nothing. We ran 5 agent archetypes 400 times each:
| Agent | Pass Rate | 95% CI | Cost/Success |
|---|---|---|---|
| Reliable RAG | 91.0% | [87.8%, 93.4%] | $0.016 |
| Expensive Multi-Model | 87.5% | [83.9%, 90.4%] | $0.161 |
| Inconsistent | 69.2% | [64.6%, 73.6%] | $0.052 |
| Flaky Coding | 65.5% | [60.7%, 70.0%] | $0.079 |
| Fast-But-Wrong | 45.2% | [40.4%, 50.1%] | $0.007 |
Every agent fails at a specific step. The Flaky Coding agent: 71% execution failures, 29% planning failures. Knowing WHERE it fails changes WHAT you fix.
Key Features
- Multi-trial execution — run N times (default 10), get confidence intervals instead of pass/fail
- Wilson score intervals — accurate even with small samples (N=10)
- Step-level failure attribution — Fisher exact test + Benjamini-Hochberg correction
- Cost per success — the metric that matters in production, not cost per call
- 6 framework adapters — LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, smolagents
- CI/CD integration — GitHub Action blocks PRs when reliability drops below threshold
- Snapshot testing — statistical regression detection across versions
- LLM-as-judge — calibrated with Krippendorff's alpha, rule-based fallback included
- MCP security scanner — prompt injection, tool shadowing detection
- Production monitoring — CUSUM + Page-Hinkley drift detection
Quick start
pip install agentrial
agentrial init
agentrial run --trials 100
Who it's for
Developers building AI agents with LangGraph, CrewAI, AutoGen, or similar frameworks who need to know if their agent still works after a prompt change, model swap, or code update — before it hits production.
MIT licensed. Everything runs locally. 349 tests, zero external dependencies required.
No comments yet.
Sign in to be the first to comment.