EvalLens serves as a comprehensive evaluation tool designed for structured LLM outputs. It enables users to upload datasets and systematically assess model performance, using a detailed failure taxonomy that supports various testing scenarios, from prompt regression to debugging, ensuring precise insight into model accuracy and conformance.
EvalLens is a sophisticated evaluation tool designed specifically for structured outputs generated by large language models (LLMs). It enables users to upload datasets containing expected and actual outcomes and provides an in-depth analysis of each entry. With EvalLens, users can effortlessly identify what passed, what failed, and the reasons behind each result, furnished with a clear failure taxonomy that transcends simple binary evaluations.
Key Features
- Prompt Regression Testing: Ensure consistency in model outputs through rigorous testing of prompts.
- Extraction Pipeline Validation: Validate the accuracy of value extractions from structured data.
- Classification Output Benchmarking: Compare and verify classification outputs against established criteria.
- Schema Conformance Checks: Ensure model outputs adhere to expected schemas.
- Debugging Support: Determine specific reasons for incorrect structured outputs.
Modes of Operation
EvalLens can function in two distinct modes:
| Hosted | Self-Hosted | |
|---|---|---|
| Functionality | Upload CSV/JSONL containing expected and actual columns and receive evaluations. | Similar functionality, with the ability to generate actual outputs using AI providers before evaluation. |
| AI calls | None; focuses on direct file comparison. | Integrates with OpenAI, Anthropic, and Gemini (via environment variables). |
| Setup | Simply visit rendonarango.com/eval-lens. | Clone the repository, configure API keys, and run locally or in Docker. |
Evaluation Process
- Upload: Drag and drop a CSV or JSONL file for evaluation.
- Validate: EvalLens verifies the data columns and infers the schema from the
expectedcolumn. - Generate (self-hosted only): Select a provider and model to generate missing
actualvalues if necessary. - Evaluate: Each entry is meticulously inspected for schema integrity and value correctness.
- Inspect: Filter and review evaluation results, with detailed insights on each row's performance.
- Export: Evaluation results can be saved in various formats including CSV, JSON, Markdown, or branded PDF reports.
Dataset Requirements
Files must conform to the following structure:
| Column | Required | Description |
|---|---|---|
id | Yes | Unique identifier for the row |
prompt | Yes | Input prompt submitted to the model |
expected | Yes | Expected structured output (in JSON format) |
actual | Hosted: Yes / Self-hosted: Optional | Model’s actual output (in JSON format) |
Sample CSV and JSONL Formats:
CSV Example:
id,prompt,expected,actual
1,"Extract the name and role","{\"name\": \"Alice\", \"role\": \"engineer\"}","{\"name\": \"Alice\", \"role\": \"engineer\"}"
2,"Extract the name and role","{\"name\": \"Bob\", \"role\": \"designer\"}","{\"name\": \"Bob\", \"role\": \"developer\"}"
JSONL Example:
{"id": "1", "prompt": "Extract the vendor and total", "expected": {"vendor": "Acme", "total": 1250}, "actual": {"vendor": "Acme", "total": 1250}}
{"id": "2", "prompt": "Extract the vendor and total", "expected": {"vendor": "TechParts", "total": 3400}, "actual": {"vendor": "TechParts", "total": "3400"}}
Comprehensive Failure Taxonomy
EvalLens goes beyond standard evaluations by providing reasons for each failure, categorized as follows:
| Failure Reason | Description |
|---|---|
SCHEMA_MISMATCH | The output structure diverges from the expected schema. |
MISSING_FIELD | A mandatory field is absent in the output. |
WRONG_TYPE | The field is present but holds an incorrect data type. |
WRONG_VALUE | The field is of the correct type but has an erroneous value. |
EXTRA_FIELD | An unexpected field is found in the output. |
UNPARSEABLE | The output fails to parse as valid JSON. |
Export Options
Results can be exported in several formats:
- CSV: For raw data analysis in spreadsheets.
- JSON: For programmatic manipulation.
- Markdown: For human-readable reporting.
- PDF: A branded report suitable for presentation.
EvalLens combines a user-friendly interface with robust evaluation capabilities, making it an essential tool for developers and data scientists seeking to ensure the quality and reliability of structured outputs from language models.
No comments yet.
Sign in to be the first to comment.