RetrievalCI - Evaluate and compare retrieval quality across RAG services effortlessly.

RetrievalCI

Evaluate and compare retrieval quality across RAG services effortlessly.

Pitch

RetrievalCI is a powerful measurement harness designed to benchmark retrieval quality from hosted RAG services like Vertex AI and OpenAI File Search against your custom corpus. It provides a clear comparison of retrieval performance, enabling informed decisions with automated CI workflows and a consistent scoring mechanism, all backed by comprehensive research.

Description

RetrievalCI is a comprehensive retrieval-quality measurement harness that facilitates benchmarking various hosted Retrieval-Augmented Generation (RAG) services and local retrieval architectures against a shared corpus. This tool is currently in its early preview stage (bench-v0) and offers a robust methodology for measuring retrieval quality by utilizing a scorecard format that consolidates results across different systems while keeping costs capped throughout the lifecycle.

Key Features

Assessment of Retrieval Quality: RetrievalCI allows users to evaluate the performance of hosted RAG services, including Vertex AI RAG Engine, Bedrock Knowledge Bases, Azure AI Search, and OpenAI File Search, compared against their own local retrieval setups. It uses a standard set of enterprise questions and documents to maintain consistency across evaluations.
Dual Workflow Support: The tool supports two primary workflows:
1. Hosted-RAG Comparison: Users can index their corpus with different hosted services and score the retrieved chunks against the same ground truth citations. This method assists in identifying the most suitable hosted RAG service for specific deployment needs.
2. Local RAG Architecture Evaluation: Allows comparison of various local retrieval architectures such as BM25, dense retrieval, and hybrid methods on personal datasets, establishing regression gates to ensure quality assurance during iterative development.
Detailed Scorecard Generation: The scorecard provides a detailed performance metric that includes recall, precision, retrieval speed, and overall scoring formula based on these metrics:
```
score = 100 * (0.7 * retrieval_source_recall + 0.3 * retrieval_source_precision)
```
This enables clear visibility into the strengths and weaknesses of each system under review.
Research Findings: RetrievalCI has produced critical insights into the factors affecting retrieval performance, illustrating how the strength of local embedder models can significantly impact results compared to hosted services. The tool makes it easier to focus on optimizing retrieval mechanisms rather than generation quality alone, which is typically evaluated by other frameworks.
Cost-Capped Operations: The system is designed with financial safety in mind by placing caps on possible costs incurred during operation, making it viable for testing without overspending.

Methodological Insights

Utilizing synthetic data from the EnterpriseRAG-Bench, RetrievalCI opens doors to nuanced analysis by adding more corpora and adapters as the project evolves. It is not limited to merely assessing generation quality but provides unique insights into retrieval capabilities that other tools do not cover.

Quick Start Example

To quickly initiate benchmarking with RetrievalCI, users can clone the repository and set up their environment as follows:

git clone https://github.com/colon-md/retrievalci.git
cd retrievalci
python -m venv .venv && .venv/bin/pip install -e '.[dev,providers,hosted-aws]'
make bench-v0-mock  # Validates the harness without cost

Next, API keys for specific providers can be added, followed by executing benchmarking scripts for each service:

python scripts/run_bench_v0_vertex.py  run --questions ... --corpus-dir ... --output ...

Conclusion

RetrievalCI fills a critical gap in the evaluation of RAG services by enabling side-by-side performance comparisons that are otherwise unattainable through conventional vendor benchmarks. This tool empowers organizations to make informed decisions about their RAG deployment strategies by providing rigorous, reproducible results on user-defined corpora.

0 comments

No comments yet.

New comment