spark-llm-eval is a distributed evaluation framework tailored for Apache Spark. Unlike traditional tools, it handles millions of examples, offering advanced statistical measures and seamless integration with data technologies. Ideal for enterprises seeking scalability and accuracy in their LLM evaluations.
Distributed LLM Evaluation Framework for Apache Spark
The spark-llm-eval repository offers a comprehensive evaluation framework for large language models (LLMs) designed specifically for Apache Spark, enabling the handling of evaluation tasks at scale. Unlike conventional evaluation tools that are limited to single-machine execution, this framework is optimized for processing millions of examples, such as customer support tickets or vast document collections.
Key Features and Benefits
| Feature | Traditional Tools | spark-llm-eval |
|---|---|---|
| Scale | ~10K examples | Millions of examples |
| Statistical Rigor | Point estimates | Confidence intervals, significance tests |
| Reproducibility | Manual tracking | Delta Lake versioning + MLflow |
| Cost Efficiency | Naive API calls | Caching, batching, rate limiting |
| Enterprise Readiness | Limited | Unity Catalog, governance, audit |
The framework includes features such as:
- Distributed Inference: Leverages Spark-native execution with Pandas UDFs for linear scalability with executors.
- Multi-Provider Support: Compatible with multiple LLM providers, including OpenAI, Anthropic Claude, and Google Gemini.
- Statistical Rigor: Incorporates advanced metrics like bootstrap confidence intervals and various statistical tests (e.g., paired t-tests, McNemar's test).
- Smart Rate Limiting: Implements token bucket algorithms to manage request limits efficiently.
- Comprehensive Metrics: Offers a broad range of metrics covering both lexical and semantic evaluations, alongside LLM-as-judge functionalities.
- MLflow Integration: Facilitates experiment tracking and model comparison to support reproducibility and thorough analysis.
- Delta Lake Native: Utilizes versioned datasets and ACID transactions for robust data handling.
Quick Start Guide
Initializing an evaluation task involves setting up a Spark session, loading your evaluation dataset, configuring your model, defining the evaluation task, and executing the evaluation as follows:
from pyspark.sql import SparkSession
from spark_llm_eval.core.config import ModelConfig, ModelProvider, MetricConfig, StatisticsConfig
from spark_llm_eval.core.task import EvalTask
from spark_llm_eval.orchestrator import EvaluationRunner, RunnerConfig
from spark_llm_eval.datasets import load_dataset
# Initialize Spark
spark = SparkSession.builder.appName("llm-eval").getOrCreate()
# Load evaluation dataset
data = load_dataset(
spark,
table_path="/mnt/delta/datasets/qa_test",
input_column="question",
reference_column="answer",
)
# Configure model
model_config = ModelConfig(
provider=ModelProvider.OPENAI,
model_name="gpt-4o",
api_key_secret="secrets/openai_key",
)
# Define evaluation task
task = EvalTask(...)
# Configure and execute runner
runner = EvaluationRunner(spark, runner_config)
result = runner.run(data, task)
Supported Metrics
Lexical Metrics include exact match, token-level F1 score, BLEU, ROUGE-L, substring containment check, and response length ratio. Semantic Metrics feature BERTScore and cosine similarity for sentence embeddings, while LLM-as-judge metrics provide customizable evaluation options.
Statistical Features
Included statistical features ensure that all metrics are accompanied by confidence intervals, standard error, and sample sizes. Model comparisons utilize paired t-tests and other methodologies for robust statistical analysis, allowing for insightful evaluations of different models.
This framework aims to redefine LLM evaluation by combining scalability, statistical rigor, and comprehensive tracking capabilities, making it an essential tool for practitioners and researchers in the field.
No comments yet.
Sign in to be the first to comment.