agentrial

The pytest for AI agents. Statistics, not luck.

#Open Source #Developer Tools #Artificial Intelligence #GitHub

agentrial - Main product screenshot demonstrating key features and user interface

agentrial – Statistical evaluation and reliability scoring for AI agents

Summary: agentrial evaluates AI agents by running tests multiple times to provide statistical confidence intervals and step-level failure analysis. It calculates an Agent Reliability Score, tracks cost-per-correct-answer across 45+ models, and integrates with developer tools to detect regressions and production drift.

What it does

agentrial runs agents N times to measure performance variability using Wilson confidence intervals and Fisher exact tests for step-level failure attribution. It offers a GitHub Action to block regressions, a VS Code extension, and supports multiple AI frameworks and models.

Who it's for

Developers and teams building and testing AI agents who need reliable, statistically grounded evaluation and monitoring tools.

Why it matters

It addresses the high variance in LLM agent performance by providing statistically robust metrics and failure diagnostics to improve reliability and detect regressions.

Upvote on Product Hunt

agentrial

agentrial – Statistical evaluation and reliability scoring for AI agents

What it does

Who it's for

Why it matters

Related Products