Outrageous Voice Assistant

A local AI voice assistant with seamless speech-to-text and text-to-speech.

Pitch

Outrageous Voice Assistant offers a fully local voice assistant experience, combining advanced ASR, LLM, and TTS models without internet dependency. Designed for affordability and ethical usage, it showcases how easy it is to deploy an AI assistant on standard hardware while retaining privacy and control.

Description

The Outrageous Voice Assistant is a fully local voice assistant demonstration designed with an intuitive FastAPI backend and a straightforward HTML front-end. This project leverages open-weight models for Automatic Speech Recognition (ASR), Language Model (LLM), and Text-to-Speech (TTS) capabilities—all while running locally. No data is transmitted to the Internet, ensuring privacy and security.

Key Features

Fully Local Operation: Enjoy the benefits of a voice assistant without compromising data privacy. The entire setup operates locally on affordable hardware.
Speedy Performance: On systems with an RTX 5070 GPU and 12GiB VRAM, the round-trip audio processing time is around one second using the Kokoro TTS model.
Voice Cloning: The project includes support for voice cloning using simple audio clips with corresponding transcriptions, allowing for personalized interaction.

How It Works

The frontend captures audio input from the user and sends the data to the backend via the /chat endpoint.
The backend processes the audio, extracting the sample rate and channel information.
The audio is transcribed to text using the ASR model.
The transcribed text is processed by the LLM, generating a response.
The response is converted to speech using a TTS model, normalized, and the audio is sent back to the frontend for playback.

Models Utilized

ASR: NVIDIA parakeet-tdt-0.6b-v3 600M
LLM: Mistral ministral-3 3b 4-bit quantized
TTS (Simple): Hexgrad Kokoro 82M
TTS (With Voice Cloning): Qwen3-TTS

Demos

Voice assistant using Dua Lipa's voice clone: Demo Video
Voice assistant using the default voice: Demo Video

Future Improvements

Add support for Apple Silicon (MLX).
Implement client-side Voice Activity Detection (VAD).
Enhance voice cloning capabilities to eliminate the need for prepared transcripts.
Introduce orchestration for more complex task detection and tool calls.

Disclaimer & Ethical Considerations

This project serves as a proof-of-concept and is provided "as is" without warrants. It is intended for educational and experimental use only. Voice cloning is meant for learning, and users are encouraged to seek permissions for real-life applications. The project also raises important ethical queries about the potential misuse of voice cloning technology, which can be achieved easily and quickly with minimal audio input, stressing the importance of responsible use.

0 comments

No comments yet.

New comment