PitchHut logo
pytest for AI agents. Statistics, not anecdotes.
Pitch

Your agent passes Monday, fails Wednesday. Same prompt, same code. agentrial runs it N times — statistics, not guesswork.

We tested 5 agents 400 times each: the expensive one costs 10× more per success for 3.5 fewer accuracy points.

Pass rates with confidence intervals, cost per success, step-level failure attribution, CI/CD regression detection.

pip install agentrial — MIT, local-first.

Description

Why agentrial exists

AI agents are non-deterministic. Running a test once tells you nothing. We ran 5 agent archetypes 400 times each:

AgentPass Rate95% CICost/Success
Reliable RAG91.0%[87.8%, 93.4%]$0.016
Expensive Multi-Model87.5%[83.9%, 90.4%]$0.161
Inconsistent69.2%[64.6%, 73.6%]$0.052
Flaky Coding65.5%[60.7%, 70.0%]$0.079
Fast-But-Wrong45.2%[40.4%, 50.1%]$0.007

Every agent fails at a specific step. The Flaky Coding agent: 71% execution failures, 29% planning failures. Knowing WHERE it fails changes WHAT you fix.

Key Features

  • Multi-trial execution — run N times (default 10), get confidence intervals instead of pass/fail
  • Wilson score intervals — accurate even with small samples (N=10)
  • Step-level failure attribution — Fisher exact test + Benjamini-Hochberg correction
  • Cost per success — the metric that matters in production, not cost per call
  • 6 framework adapters — LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents, smolagents
  • CI/CD integration — GitHub Action blocks PRs when reliability drops below threshold
  • Snapshot testing — statistical regression detection across versions
  • LLM-as-judge — calibrated with Krippendorff's alpha, rule-based fallback included
  • MCP security scanner — prompt injection, tool shadowing detection
  • Production monitoring — CUSUM + Page-Hinkley drift detection

Quick start

pip install agentrial
agentrial init
agentrial run --trials 100

Who it's for

Developers building AI agents with LangGraph, CrewAI, AutoGen, or similar frameworks who need to know if their agent still works after a prompt change, model swap, or code update — before it hits production.

MIT licensed. Everything runs locally. 349 tests, zero external dependencies required.

0 comments

No comments yet.

Sign in to be the first to comment.