EvalLens - Evaluate LLM outputs with precision, row by row.

EvalLens

Evaluate LLM outputs with precision, row by row.

Pitch

EvalLens serves as a comprehensive evaluation tool designed for structured LLM outputs. It enables users to upload datasets and systematically assess model performance, using a detailed failure taxonomy that supports various testing scenarios, from prompt regression to debugging, ensuring precise insight into model accuracy and conformance.

Description

EvalLens is a sophisticated evaluation tool designed specifically for structured outputs generated by large language models (LLMs). It enables users to upload datasets containing expected and actual outcomes and provides an in-depth analysis of each entry. With EvalLens, users can effortlessly identify what passed, what failed, and the reasons behind each result, furnished with a clear failure taxonomy that transcends simple binary evaluations.

Key Features

Prompt Regression Testing: Ensure consistency in model outputs through rigorous testing of prompts.
Extraction Pipeline Validation: Validate the accuracy of value extractions from structured data.
Classification Output Benchmarking: Compare and verify classification outputs against established criteria.
Schema Conformance Checks: Ensure model outputs adhere to expected schemas.
Debugging Support: Determine specific reasons for incorrect structured outputs.

Modes of Operation

EvalLens can function in two distinct modes:

	Hosted	Self-Hosted
Functionality	Upload CSV/JSONL containing `expected` and `actual` columns and receive evaluations.	Similar functionality, with the ability to generate `actual` outputs using AI providers before evaluation.
AI calls	None; focuses on direct file comparison.	Integrates with OpenAI, Anthropic, and Gemini (via environment variables).
Setup	Simply visit rendonarango.com/eval-lens.	Clone the repository, configure API keys, and run locally or in Docker.

Evaluation Process

Upload: Drag and drop a CSV or JSONL file for evaluation.
Validate: EvalLens verifies the data columns and infers the schema from the expected column.
Generate (self-hosted only): Select a provider and model to generate missing actual values if necessary.
Evaluate: Each entry is meticulously inspected for schema integrity and value correctness.
Inspect: Filter and review evaluation results, with detailed insights on each row's performance.
Export: Evaluation results can be saved in various formats including CSV, JSON, Markdown, or branded PDF reports.

Dataset Requirements

Files must conform to the following structure:

Column	Required	Description
`id`	Yes	Unique identifier for the row
`prompt`	Yes	Input prompt submitted to the model
`expected`	Yes	Expected structured output (in JSON format)
`actual`	Hosted: Yes / Self-hosted: Optional	Model’s actual output (in JSON format)

Sample CSV and JSONL Formats:

CSV Example:

id,prompt,expected,actual
1,"Extract the name and role","{\"name\": \"Alice\", \"role\": \"engineer\"}","{\"name\": \"Alice\", \"role\": \"engineer\"}"
2,"Extract the name and role","{\"name\": \"Bob\", \"role\": \"designer\"}","{\"name\": \"Bob\", \"role\": \"developer\"}"

JSONL Example:

{"id": "1", "prompt": "Extract the vendor and total", "expected": {"vendor": "Acme", "total": 1250}, "actual": {"vendor": "Acme", "total": 1250}}
{"id": "2", "prompt": "Extract the vendor and total", "expected": {"vendor": "TechParts", "total": 3400}, "actual": {"vendor": "TechParts", "total": "3400"}}

Comprehensive Failure Taxonomy

EvalLens goes beyond standard evaluations by providing reasons for each failure, categorized as follows:

Failure Reason	Description
`SCHEMA_MISMATCH`	The output structure diverges from the expected schema.
`MISSING_FIELD`	A mandatory field is absent in the output.
`WRONG_TYPE`	The field is present but holds an incorrect data type.
`WRONG_VALUE`	The field is of the correct type but has an erroneous value.
`EXTRA_FIELD`	An unexpected field is found in the output.
`UNPARSEABLE`	The output fails to parse as valid JSON.

Export Options

Results can be exported in several formats:

CSV: For raw data analysis in spreadsheets.
JSON: For programmatic manipulation.
Markdown: For human-readable reporting.
PDF: A branded report suitable for presentation.

EvalLens combines a user-friendly interface with robust evaluation capabilities, making it an essential tool for developers and data scientists seeking to ensure the quality and reliability of structured outputs from language models.

0 comments

No comments yet.

New comment