agentrial
The pytest for AI agents. Statistics, not luck.
agentrial – Statistical evaluation and reliability scoring for AI agents
Summary: agentrial evaluates AI agents by running tests multiple times to provide statistical confidence intervals and step-level failure analysis. It calculates an Agent Reliability Score, tracks cost-per-correct-answer across 45+ models, and integrates with developer tools to detect regressions and production drift.
What it does
agentrial runs agents N times to measure performance variability using Wilson confidence intervals and Fisher exact tests for step-level failure attribution. It offers a GitHub Action to block regressions, a VS Code extension, and supports multiple AI frameworks and models.
Who it's for
Developers and teams building and testing AI agents who need reliable, statistically grounded evaluation and monitoring tools.
Why it matters
It addresses the high variance in LLM agent performance by providing statistically robust metrics and failure diagnostics to improve reliability and detect regressions.