EvalView offers a streamlined testing solution for AI agents, utilizing pytest-style conventions. This framework enables developers to write clear and maintainable test cases, ensuring CI/CD integration and safeguarding against regressions in behavior, cost, and latency, making it an essential tool for teams deploying AI solutions.
EvalView is an open-source testing framework tailored for AI agents, designed to facilitate readable testing via a pytest-style interface. It allows developers to define and automate tests using YAML files, effectively blocking deployments if there are regressions in behavior, cost, or latency.
Key Features
-
Intuitive YAML Test Cases: Write tests that easily specify inputs, expected tool calls, and acceptance grades. This straightforward format enhances readability and maintainability.
-
Automated Regression Testing: Convert real agent conversations into regression suites, enabling automatic tests that verify functionality after every change.
-
Continuous Integration Compatibility: Seamlessly integrate with CI/CD processes to enforce quality checks based on behavior and performance metrics. Block undesired changes before they reach production.
-
Framework Adaptability: Compatible with various platforms including LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude, making it versatile for different AI workflows.
Advantages Over Manual Testing
| Feature | Manual Testing | EvalView |
|---|---|---|
| Catches hallucinations | No | Yes |
| Tracks token cost | No | Automatic |
| CI/CD Integration | Difficult | Built-in |
| Detects regressions | No | Automatic |
| Tests tool calls | Manual inspection | Automated |
| Latency tracking | No | Per-test thresholds |
| Handles flaky LLMs | No | Statistical mode |
Usage Examples
Basic Budget Regression Test
Ensure cost does not exceed a defined threshold:
name: "Cost check"
input:
query: "Summarize this document"
thresholds:
min_score: 70
max_cost: 0.05
Mandatory Tool Usage Test
Verify that a specific tool is invoked:
name: "Must use search"
input:
query: "What's the weather in NYC?"
expected:
tools:
- web_search
thresholds:
min_score: 80
Hallucination Detection
Prevent the agent from fabricating information:
name: "No hallucinations"
input:
query: "What's our refund policy?"
expected:
tools:
- retriever
thresholds:
min_score: 80
checks:
hallucination: true
Quick Start
Getting started is simple and does not require a database or complex infrastructure setup:
pip install evalview
export OPENAI_API_KEY='your-key-here'
evalview quickstart
This process will create a demo agent, a test case, and run the tests with comprehensive reporting.
Evaluation Reports
Provide clear and actionable insights on test performance through JSON and interactive HTML reports. Continuous evaluation ensures an up-to-date understanding of agent capabilities and performance metrics.
EvalView emphasizes behavior coverage instead of traditional line coverage, allowing for detailed insights into AI agent responses, including correctness, safety, and efficiency metrics. It is adaptable, extensible, and built to work with the evolving landscape of AI technology, making it an essential tool for developers aiming to ensure reliable AI deployments.
No comments yet.
Sign in to be the first to comment.