spark-llm-eval - Effortlessly evaluate LLMs at scale with statistical rigor and Spark integration.

spark-llm-eval

Effortlessly evaluate LLMs at scale with statistical rigor and Spark integration.

Pitch

spark-llm-eval is a distributed evaluation framework tailored for Apache Spark. Unlike traditional tools, it handles millions of examples, offering advanced statistical measures and seamless integration with data technologies. Ideal for enterprises seeking scalability and accuracy in their LLM evaluations.

Description

Distributed LLM Evaluation Framework for Apache Spark

The spark-llm-eval repository offers a comprehensive evaluation framework for large language models (LLMs) designed specifically for Apache Spark, enabling the handling of evaluation tasks at scale. Unlike conventional evaluation tools that are limited to single-machine execution, this framework is optimized for processing millions of examples, such as customer support tickets or vast document collections.

Key Features and Benefits

Feature	Traditional Tools	spark-llm-eval
Scale	~10K examples	Millions of examples
Statistical Rigor	Point estimates	Confidence intervals, significance tests
Reproducibility	Manual tracking	Delta Lake versioning + MLflow
Cost Efficiency	Naive API calls	Caching, batching, rate limiting
Enterprise Readiness	Limited	Unity Catalog, governance, audit

The framework includes features such as:

Distributed Inference: Leverages Spark-native execution with Pandas UDFs for linear scalability with executors.
Multi-Provider Support: Compatible with multiple LLM providers, including OpenAI, Anthropic Claude, and Google Gemini.
Statistical Rigor: Incorporates advanced metrics like bootstrap confidence intervals and various statistical tests (e.g., paired t-tests, McNemar's test).
Smart Rate Limiting: Implements token bucket algorithms to manage request limits efficiently.
Comprehensive Metrics: Offers a broad range of metrics covering both lexical and semantic evaluations, alongside LLM-as-judge functionalities.
MLflow Integration: Facilitates experiment tracking and model comparison to support reproducibility and thorough analysis.
Delta Lake Native: Utilizes versioned datasets and ACID transactions for robust data handling.

Quick Start Guide

Initializing an evaluation task involves setting up a Spark session, loading your evaluation dataset, configuring your model, defining the evaluation task, and executing the evaluation as follows:

from pyspark.sql import SparkSession
from spark_llm_eval.core.config import ModelConfig, ModelProvider, MetricConfig, StatisticsConfig
from spark_llm_eval.core.task import EvalTask
from spark_llm_eval.orchestrator import EvaluationRunner, RunnerConfig
from spark_llm_eval.datasets import load_dataset

# Initialize Spark
spark = SparkSession.builder.appName("llm-eval").getOrCreate()

# Load evaluation dataset
data = load_dataset(
    spark,
    table_path="/mnt/delta/datasets/qa_test",
    input_column="question",
    reference_column="answer",
) 

# Configure model
model_config = ModelConfig(
    provider=ModelProvider.OPENAI,
    model_name="gpt-4o",
    api_key_secret="secrets/openai_key",
)

# Define evaluation task
task = EvalTask(...)

# Configure and execute runner
runner = EvaluationRunner(spark, runner_config)
result = runner.run(data, task)

Supported Metrics

Lexical Metrics include exact match, token-level F1 score, BLEU, ROUGE-L, substring containment check, and response length ratio. Semantic Metrics feature BERTScore and cosine similarity for sentence embeddings, while LLM-as-judge metrics provide customizable evaluation options.

Statistical Features

Included statistical features ensure that all metrics are accompanied by confidence intervals, standard error, and sample sizes. Model comparisons utilize paired t-tests and other methodologies for robust statistical analysis, allowing for insightful evaluations of different models.

This framework aims to redefine LLM evaluation by combining scalability, statistical rigor, and comprehensive tracking capabilities, making it an essential tool for practitioners and researchers in the field.

0 comments

No comments yet.

New comment