PitchHut logo
Loclean
An efficient, privacy-first AI data cleaning tool for local use.
Pitch

Loclean is a comprehensive local AI data cleaning library designed to enhance data integrity without compromising privacy. By leveraging small language models locally, it ensures sensitive information remains secure while providing deterministic, structured outputs that comply with predefined schemas.

Description

Loclean is an all-in-one local AI data cleaning library designed with a focus on privacy and efficiency. It enables users to leverage the power of Small Language Models (SLMs), such as Phi-3 and Llama-3, locally without requiring a GPU or cloud API keys, ensuring that sensitive data remains within your infrastructure.

Key Features

Privacy-First Approach

Utilizing local inference, Loclean provides a privacy-focused solution suitable for handling PII, medical records, and other sensitive data types. This makes it ideal for production environments where data protection is critical.

Deterministic Data Output

Loclean addresses common issues associated with large language models (LLMs), such as unpredictable outputs or "hallucinations." By employing GBNF grammars and Pydantic V2, it guarantees that outputs adhere to valid and type-safe JSON structures. If the output does not conform to the expected schema, it will not be accepted.

Structured Data Extraction

With Loclean, extracting structured information from unstructured text is straightforward and reliable:

from pydantic import BaseModel
import loclean

class Product(BaseModel):
    name: str
    price: int
    color: str

# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(item.name)  # Outputs: 't-shirt'
print(item.price)  # Outputs: 50000

Loclean's extract() function ensures compliance with your Pydantic schema using:

  • Dynamic GBNF Grammar Generation: Converts Pydantic schemas to GBNF grammars automatically.
  • JSON Repair: Fixes malformed JSON outputs from LLMs.
  • Retry Logic: Automatically retries prompts when validation fails.

Backend Agnostic Design

Built upon the Narwhals framework, Loclean supports various data handling libraries such as Pandas, Polars, and PyArrow without the burden of heavy dependency lock-in. This allows users to operate smoothly across different backends:

  • Efficiently processes data using Polars or Pandas without any heavy dependencies.

Quick Start Guides

To assist users in getting started, Loclean provides comprehensive Jupyter notebooks covering core features and advanced data cleaning techniques. Example notebooks include:

  • Quick Start: Introduction to core Loclean functionalities and structured extraction.
  • Data Cleaning: In-depth strategies for effective data cleaning.
  • Privacy Scrubbing: Insights on PII redaction.

Explore these resources in the examples directory to quickly become proficient with Loclean's functionalities.

For further information, visit the official documentation at nxank4.github.io/loclean.

Contributions

Loclean is an open-source project. Developers interested in contributing can refer to the Contributing Guide for setup instructions and development practices.

Loclean is an efficient and privacy-centric library that simplifies the data cleaning process while ensuring compliance and security.

0 comments

No comments yet.

Sign in to be the first to comment.