PitchHut logo
Evaluate LLM outputs with precision, row by row.
Pitch

EvalLens serves as a comprehensive evaluation tool designed for structured LLM outputs. It enables users to upload datasets and systematically assess model performance, using a detailed failure taxonomy that supports various testing scenarios, from prompt regression to debugging, ensuring precise insight into model accuracy and conformance.

Description

EvalLens is a sophisticated evaluation tool designed specifically for structured outputs generated by large language models (LLMs). It enables users to upload datasets containing expected and actual outcomes and provides an in-depth analysis of each entry. With EvalLens, users can effortlessly identify what passed, what failed, and the reasons behind each result, furnished with a clear failure taxonomy that transcends simple binary evaluations.

Key Features

  • Prompt Regression Testing: Ensure consistency in model outputs through rigorous testing of prompts.
  • Extraction Pipeline Validation: Validate the accuracy of value extractions from structured data.
  • Classification Output Benchmarking: Compare and verify classification outputs against established criteria.
  • Schema Conformance Checks: Ensure model outputs adhere to expected schemas.
  • Debugging Support: Determine specific reasons for incorrect structured outputs.

Modes of Operation

EvalLens can function in two distinct modes:

HostedSelf-Hosted
FunctionalityUpload CSV/JSONL containing expected and actual columns and receive evaluations.Similar functionality, with the ability to generate actual outputs using AI providers before evaluation.
AI callsNone; focuses on direct file comparison.Integrates with OpenAI, Anthropic, and Gemini (via environment variables).
SetupSimply visit rendonarango.com/eval-lens.Clone the repository, configure API keys, and run locally or in Docker.

Evaluation Process

  1. Upload: Drag and drop a CSV or JSONL file for evaluation.
  2. Validate: EvalLens verifies the data columns and infers the schema from the expected column.
  3. Generate (self-hosted only): Select a provider and model to generate missing actual values if necessary.
  4. Evaluate: Each entry is meticulously inspected for schema integrity and value correctness.
  5. Inspect: Filter and review evaluation results, with detailed insights on each row's performance.
  6. Export: Evaluation results can be saved in various formats including CSV, JSON, Markdown, or branded PDF reports.

Dataset Requirements

Files must conform to the following structure:

ColumnRequiredDescription
idYesUnique identifier for the row
promptYesInput prompt submitted to the model
expectedYesExpected structured output (in JSON format)
actualHosted: Yes / Self-hosted: OptionalModel’s actual output (in JSON format)

Sample CSV and JSONL Formats:

CSV Example:

id,prompt,expected,actual
1,"Extract the name and role","{\"name\": \"Alice\", \"role\": \"engineer\"}","{\"name\": \"Alice\", \"role\": \"engineer\"}"
2,"Extract the name and role","{\"name\": \"Bob\", \"role\": \"designer\"}","{\"name\": \"Bob\", \"role\": \"developer\"}"

JSONL Example:

{"id": "1", "prompt": "Extract the vendor and total", "expected": {"vendor": "Acme", "total": 1250}, "actual": {"vendor": "Acme", "total": 1250}}
{"id": "2", "prompt": "Extract the vendor and total", "expected": {"vendor": "TechParts", "total": 3400}, "actual": {"vendor": "TechParts", "total": "3400"}}

Comprehensive Failure Taxonomy

EvalLens goes beyond standard evaluations by providing reasons for each failure, categorized as follows:

Failure ReasonDescription
SCHEMA_MISMATCHThe output structure diverges from the expected schema.
MISSING_FIELDA mandatory field is absent in the output.
WRONG_TYPEThe field is present but holds an incorrect data type.
WRONG_VALUEThe field is of the correct type but has an erroneous value.
EXTRA_FIELDAn unexpected field is found in the output.
UNPARSEABLEThe output fails to parse as valid JSON.

Export Options

Results can be exported in several formats:

  • CSV: For raw data analysis in spreadsheets.
  • JSON: For programmatic manipulation.
  • Markdown: For human-readable reporting.
  • PDF: A branded report suitable for presentation.

EvalLens combines a user-friendly interface with robust evaluation capabilities, making it an essential tool for developers and data scientists seeking to ensure the quality and reliability of structured outputs from language models.

0 comments

No comments yet.

Sign in to be the first to comment.