Krira Chunker is a production-grade RAG chunking engine built in Rust, designed to efficiently process large volumes of data in various formats like CSV, PDF, and JSON. With its exceptional speed—up to 40 times faster than alternatives—and minimal memory usage, processing gigabytes of data becomes a swift and seamless task.
Krira Chunker is a high-performance Rust-based chunking engine designed for Retrieval-Augmented Generation (RAG) pipelines. This engine excels in processing large amounts of data, handling GBs of CSV, PDF, JSON, JSONL, DOCX, XLSX, URLs, and more with remarkable efficiency.
Key Features
- Ultra-Fast Processing: Achieve speeds 40 times faster than LangChain while maintaining O(1) memory usage.
- Rapid Data Handling: Process gigabytes of text in seconds, making it suitable for applications requiring quick data manipulation and retrieval.
Example Usage
Here's a quick example demonstrating how to utilize the Krira Chunker:
from krira_augment.krira_chunker import Pipeline, PipelineConfig, SplitStrategy
config = PipelineConfig(
chunk_size=512,
strategy=SplitStrategy.SMART,
clean_html=True,
clean_unicode=True,
)
pipeline = Pipeline(config=config)
result = pipeline.process("sample.csv", output_path="output.jsonl")
print(f"Chunks Created: {result.chunks_created}")
print(f"Execution Time: {result.execution_time:.2f}s")
print(f"Throughput: {result.mb_per_second:.2f} MB/s")
print(f"Preview: {result.preview_chunks[:3]}")
Comprehensive Performance Benchmark
The engine has demonstrated impressive performance by processing 42.4 million chunks in under 114 seconds, generating an impressive throughput of 47.51 MB/s. The output includes detailed feedback, summarizing chunk creation and execution time:
============================================================
✅ KRIRA AUGMENT - Processing Complete
============================================================
📊 Chunks Created: 42,448,765
⏱️ Execution Time: 113.79 seconds
🚀 Throughput: 47.51 MB/s
📁 Output File: output.jsonl
============================================================
📝 Preview (Top 3 Chunks):
------------------------------------------------------------
[1] event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
Architecture
The design architecture emphasizes efficient chunk processing, ensuring operational stability and scalability. The architecture diagram illustrates the workflow of the Krira Chunker:
Streaming Mode
For applications requiring real-time processing without saving to disk, the streaming mode offers significant efficiency improvements, allowing users to process data in real-time while consuming minimal resources.
Supported File Formats
The Krira Chunker supports multiple file formats, making it versatile for various applications:
| Format | Extension | Method |
|---|---|---|
| CSV | .csv | Direct processing |
| Text | .txt | Direct processing |
| JSONL | .jsonl | Direct processing |
| JSON | .json | Auto-flattening |
.pdf | pdfplumber extraction | |
| Word | .docx | python-docx extraction |
| Excel | .xlsx | openpyxl extraction |
| XML | .xml | ElementTree parsing |
| URLs | http:// | BeautifulSoup scraping |
Conclusion
The Krira Chunker presents a powerful and efficient solution for handling large datasets, significantly optimizing the chunking process necessary for RAG pipelines. It is ideal for those looking to enhance their data processing capabilities while saving on time and computational resources.
No comments yet.
Sign in to be the first to comment.