Krira Chunker - High-performance chunking engine for fast data processing.

Krira Chunker

High-performance chunking engine for fast data processing.

Pitch

Krira Chunker is a production-grade RAG chunking engine built in Rust, designed to efficiently process large volumes of data in various formats like CSV, PDF, and JSON. With its exceptional speed—up to 40 times faster than alternatives—and minimal memory usage, processing gigabytes of data becomes a swift and seamless task.

Description

Krira Chunker is a high-performance Rust-based chunking engine designed for Retrieval-Augmented Generation (RAG) pipelines. This engine excels in processing large amounts of data, handling GBs of CSV, PDF, JSON, JSONL, DOCX, XLSX, URLs, and more with remarkable efficiency.

Key Features

Ultra-Fast Processing: Achieve speeds 40 times faster than LangChain while maintaining O(1) memory usage.
Rapid Data Handling: Process gigabytes of text in seconds, making it suitable for applications requiring quick data manipulation and retrieval.

Example Usage

Here's a quick example demonstrating how to utilize the Krira Chunker:

from krira_augment.krira_chunker import Pipeline, PipelineConfig, SplitStrategy

config = PipelineConfig(
    chunk_size=512,
    strategy=SplitStrategy.SMART,
    clean_html=True,
    clean_unicode=True,
)

pipeline = Pipeline(config=config)

result = pipeline.process("sample.csv", output_path="output.jsonl")

print(f"Chunks Created: {result.chunks_created}")
print(f"Execution Time: {result.execution_time:.2f}s")
print(f"Throughput: {result.mb_per_second:.2f} MB/s")
print(f"Preview: {result.preview_chunks[:3]}")

Comprehensive Performance Benchmark

The engine has demonstrated impressive performance by processing 42.4 million chunks in under 114 seconds, generating an impressive throughput of 47.51 MB/s. The output includes detailed feedback, summarizing chunk creation and execution time:

============================================================
✅ KRIRA AUGMENT - Processing Complete
============================================================
📊 Chunks Created:  42,448,765
⏱️  Execution Time:  113.79 seconds
🚀 Throughput:      47.51 MB/s
📁 Output File:     output.jsonl
============================================================

📝 Preview (Top 3 Chunks):
------------------------------------------------------------
[1] event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session

Architecture

The design architecture emphasizes efficient chunk processing, ensuring operational stability and scalability. The architecture diagram illustrates the workflow of the Krira Chunker: Krira Chunker Architecture

Streaming Mode

For applications requiring real-time processing without saving to disk, the streaming mode offers significant efficiency improvements, allowing users to process data in real-time while consuming minimal resources.

Supported File Formats

The Krira Chunker supports multiple file formats, making it versatile for various applications:

Format	Extension	Method
CSV	`.csv`	Direct processing
Text	`.txt`	Direct processing
JSONL	`.jsonl`	Direct processing
JSON	`.json`	Auto-flattening
PDF	`.pdf`	pdfplumber extraction
Word	`.docx`	python-docx extraction
Excel	`.xlsx`	openpyxl extraction
XML	`.xml`	ElementTree parsing
URLs	`http://`	BeautifulSoup scraping

Conclusion

The Krira Chunker presents a powerful and efficient solution for handling large datasets, significantly optimizing the chunking process necessary for RAG pipelines. It is ideal for those looking to enhance their data processing capabilities while saving on time and computational resources.

0 comments

No comments yet.

New comment