Outrageous Voice Assistant offers a fully local voice assistant experience, combining advanced ASR, LLM, and TTS models without internet dependency. Designed for affordability and ethical usage, it showcases how easy it is to deploy an AI assistant on standard hardware while retaining privacy and control.
Outrageous Voice Assistant
The Outrageous Voice Assistant is a fully local voice assistant demonstration designed with an intuitive FastAPI backend and a straightforward HTML front-end. This project leverages open-weight models for Automatic Speech Recognition (ASR), Language Model (LLM), and Text-to-Speech (TTS) capabilities—all while running locally. No data is transmitted to the Internet, ensuring privacy and security.
Key Features
- Fully Local Operation: Enjoy the benefits of a voice assistant without compromising data privacy. The entire setup operates locally on affordable hardware.
- Speedy Performance: On systems with an RTX 5070 GPU and 12GiB VRAM, the round-trip audio processing time is around one second using the Kokoro TTS model.
- Voice Cloning: The project includes support for voice cloning using simple audio clips with corresponding transcriptions, allowing for personalized interaction.
How It Works
- The frontend captures audio input from the user and sends the data to the backend via the
/chatendpoint. - The backend processes the audio, extracting the sample rate and channel information.
- The audio is transcribed to text using the ASR model.
- The transcribed text is processed by the LLM, generating a response.
- The response is converted to speech using a TTS model, normalized, and the audio is sent back to the frontend for playback.
Models Utilized
- ASR: NVIDIA parakeet-tdt-0.6b-v3 600M
- LLM: Mistral ministral-3 3b 4-bit quantized
- TTS (Simple): Hexgrad Kokoro 82M
- TTS (With Voice Cloning): Qwen3-TTS
Demos
- Voice assistant using Dua Lipa's voice clone: Demo Video
- Voice assistant using the default voice: Demo Video
Future Improvements
- Add support for Apple Silicon (MLX).
- Implement client-side Voice Activity Detection (VAD).
- Enhance voice cloning capabilities to eliminate the need for prepared transcripts.
- Introduce orchestration for more complex task detection and tool calls.
Disclaimer & Ethical Considerations
This project serves as a proof-of-concept and is provided "as is" without warrants. It is intended for educational and experimental use only. Voice cloning is meant for learning, and users are encouraged to seek permissions for real-life applications. The project also raises important ethical queries about the potential misuse of voice cloning technology, which can be achieved easily and quickly with minimal audio input, stressing the importance of responsible use.
No comments yet.
Sign in to be the first to comment.