Open Vernacular AI Kit - An SDK and CLI for cleaning Indian vernacular-English mixed text.

Open Vernacular AI Kit

An SDK and CLI for cleaning Indian vernacular-English mixed text.

Pitch

Open Vernacular AI Kit offers an open-source SDK and CLI designed to normalize messy vernacular-English code-mixed text, enhancing AI applications for Indian languages. It addresses challenges in handling diverse scripts, providing a robust foundation for vernacular AI workflows while paving the way for future global language support.

Description

open-vernacular-ai-kit is a comprehensive open-source Software Development Kit (SDK) and Command Line Interface (CLI) specifically designed to streamline the cleaning of Indian vernacular-English code-mixed text. This initiative is India-first, featuring integrations with Sarvam AI, and aims for global language expansion in future releases, making it a robust tool for handling diverse linguistic inputs from native to Romanized scripts.

Key Features

Multi-Input Compatibility: The toolkit adeptly normalizes messy WhatsApp-style messages that contain vernacular text in native scripts (e.g., ગુજરાતી), Romanized vernacular (e.g., Gujlish), or a combination of both in a single sentence.
Downstream Model Optimization: By normalizing text before it reaches downstream models such as Sarvam-M, Mayura, and Sarvam-Translate, it improves the quality of output for various AI applications.
Community Contributions: The project encourages community participation with plans for additional languages and provider adapters, making it PR-friendly.

Workflow Example

An illustration of its functionality:

 gck codemix "maru business plan ready chhe!!!"  
 # -> મારું business plan ready છે!!

This demonstrates how the toolkit effectively converts informal vernacular input into a normalized canonical format.

Purpose and Challenges Solved

The open-vernacular-ai-kit acts as a production-oriented normalization layer, essential for AI applications that focus on India. It effectively cleans up noisy, mixed-script chat data, ensuring improved performance in multi-lingual LLM (Large Language Models), retrieval workflows, and customer support tasks, thus enhancing overall engagement.

Example of Transformation

Here's a sample transformation showcasing how it handles messy inputs:

Input (messy)	Output (canonical code-mix)
`maru mobile number 123 chhe`	`મારું mobile number 123 છે`
`aaje maru kaam ready chhe`	`આજે મારું કામ ready છે`

Evaluation and Metrics

The repository includes an evaluation harness to maintain baseline performance metrics, ensuring consistent output quality. Users can utilize commands to generate reports and evaluate the effectiveness of the normalization process. The current version, 1.0.2, shows impressive transliteration and dialect accuracy.

Language Support

Initially focused on Indian vernaculars, open-vernacular-ai-kit holds full support for the Gujarati language, with plans to expand to other Scheduled Indian languages like Hindi, Tamil, and Marathi based on community contributions.

Contributing to the Project

The repository invites developers to contribute by submitting issues or pulling requests, particularly to assist in extending language support and improving existing functionalities. Comprehensive governance documentation is in place to guide contributions effectively.

Conclusion

The open-vernacular-ai-kit serves as a necessary foundation for vernacular AI workflows in India, paving the way for cleaner, more efficient text processing in culturally rich linguistic environments.

0 comments

No comments yet.

New comment