PitchHut logo
Pytest-style testing framework designed for AI agents.
Pitch

EvalView offers a streamlined testing solution for AI agents, utilizing pytest-style conventions. This framework enables developers to write clear and maintainable test cases, ensuring CI/CD integration and safeguarding against regressions in behavior, cost, and latency, making it an essential tool for teams deploying AI solutions.

Description

EvalView is an open-source testing framework tailored for AI agents, designed to facilitate readable testing via a pytest-style interface. It allows developers to define and automate tests using YAML files, effectively blocking deployments if there are regressions in behavior, cost, or latency.

Key Features

  • Intuitive YAML Test Cases: Write tests that easily specify inputs, expected tool calls, and acceptance grades. This straightforward format enhances readability and maintainability.

  • Automated Regression Testing: Convert real agent conversations into regression suites, enabling automatic tests that verify functionality after every change.

  • Continuous Integration Compatibility: Seamlessly integrate with CI/CD processes to enforce quality checks based on behavior and performance metrics. Block undesired changes before they reach production.

  • Framework Adaptability: Compatible with various platforms including LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude, making it versatile for different AI workflows.

Advantages Over Manual Testing

FeatureManual TestingEvalView
Catches hallucinationsNoYes
Tracks token costNoAutomatic
CI/CD IntegrationDifficultBuilt-in
Detects regressionsNoAutomatic
Tests tool callsManual inspectionAutomated
Latency trackingNoPer-test thresholds
Handles flaky LLMsNoStatistical mode

Usage Examples

Basic Budget Regression Test

Ensure cost does not exceed a defined threshold:

name: "Cost check"
input:
  query: "Summarize this document"
thresholds:
  min_score: 70
  max_cost: 0.05

Mandatory Tool Usage Test

Verify that a specific tool is invoked:

name: "Must use search"
input:
  query: "What's the weather in NYC?"
expected:
  tools:
    - web_search
thresholds:
  min_score: 80

Hallucination Detection

Prevent the agent from fabricating information:

name: "No hallucinations"
input:
  query: "What's our refund policy?"
expected:
  tools:
    - retriever
thresholds:
  min_score: 80
checks:
  hallucination: true

Quick Start

Getting started is simple and does not require a database or complex infrastructure setup:

pip install evalview
export OPENAI_API_KEY='your-key-here'
evalview quickstart

This process will create a demo agent, a test case, and run the tests with comprehensive reporting.

Evaluation Reports

Provide clear and actionable insights on test performance through JSON and interactive HTML reports. Continuous evaluation ensures an up-to-date understanding of agent capabilities and performance metrics.

EvalView emphasizes behavior coverage instead of traditional line coverage, allowing for detailed insights into AI agent responses, including correctness, safety, and efficiency metrics. It is adaptable, extensible, and built to work with the evolving landscape of AI technology, making it an essential tool for developers aiming to ensure reliable AI deployments.

0 comments

No comments yet.

Sign in to be the first to comment.