Loclean is a comprehensive local AI data cleaning library designed to enhance data integrity without compromising privacy. By leveraging small language models locally, it ensures sensitive information remains secure while providing deterministic, structured outputs that comply with predefined schemas.
Loclean is an all-in-one local AI data cleaning library designed with a focus on privacy and efficiency. It enables users to leverage the power of Small Language Models (SLMs), such as Phi-3 and Llama-3, locally without requiring a GPU or cloud API keys, ensuring that sensitive data remains within your infrastructure.
Key Features
Privacy-First Approach
Utilizing local inference, Loclean provides a privacy-focused solution suitable for handling PII, medical records, and other sensitive data types. This makes it ideal for production environments where data protection is critical.
Deterministic Data Output
Loclean addresses common issues associated with large language models (LLMs), such as unpredictable outputs or "hallucinations." By employing GBNF grammars and Pydantic V2, it guarantees that outputs adhere to valid and type-safe JSON structures. If the output does not conform to the expected schema, it will not be accepted.
Structured Data Extraction
With Loclean, extracting structured information from unstructured text is straightforward and reliable:
from pydantic import BaseModel
import loclean
class Product(BaseModel):
name: str
price: int
color: str
# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(item.name) # Outputs: 't-shirt'
print(item.price) # Outputs: 50000
Loclean's extract() function ensures compliance with your Pydantic schema using:
- Dynamic GBNF Grammar Generation: Converts Pydantic schemas to GBNF grammars automatically.
- JSON Repair: Fixes malformed JSON outputs from LLMs.
- Retry Logic: Automatically retries prompts when validation fails.
Backend Agnostic Design
Built upon the Narwhals framework, Loclean supports various data handling libraries such as Pandas, Polars, and PyArrow without the burden of heavy dependency lock-in. This allows users to operate smoothly across different backends:
- Efficiently processes data using Polars or Pandas without any heavy dependencies.
Quick Start Guides
To assist users in getting started, Loclean provides comprehensive Jupyter notebooks covering core features and advanced data cleaning techniques. Example notebooks include:
- Quick Start: Introduction to core Loclean functionalities and structured extraction.
- Data Cleaning: In-depth strategies for effective data cleaning.
- Privacy Scrubbing: Insights on PII redaction.
Explore these resources in the examples directory to quickly become proficient with Loclean's functionalities.
For further information, visit the official documentation at nxank4.github.io/loclean.
Contributions
Loclean is an open-source project. Developers interested in contributing can refer to the Contributing Guide for setup instructions and development practices.
Loclean is an efficient and privacy-centric library that simplifies the data cleaning process while ensuring compliance and security.
No comments yet.
Sign in to be the first to comment.