This project brings the Chatterbox Turbo TTS model into the vLLM environment, harnessing faster audio generation while maintaining high quality. With improved GPU efficiency and seamless integration, this repository offers a robust solution for real-time speech synthesis, fitting for advanced inference infrastructure.
Chatterbox Turbo TTS on vLLM
Chatterbox-turbo-vllm is an innovative port of the Chatterbox Turbo text-to-speech (TTS) model to vLLM, enhancing performance and memory efficiency. Built upon the chatterbox-vllm foundation, this project extends its capabilities to include the newer Turbo model, which incorporates a significantly faster S3Gen waveform decoder for a superior audio generation experience.
Key Features
- Enhanced Performance: Experience improved processing speeds and efficient GPU memory utilization.
- Integration Ready: Seamlessly integrates with state-of-the-art inference infrastructures for TTS applications.
Performance Benchmarking
The performance of Chatterbox Turbo TTS has been rigorously tested on a RTX 4090 GPU. Significant speedups have been achieved as illustrated in the table below:
| Metric | Regular | Turbo | Speedup |
|---|---|---|---|
| Audio duration | 39.9 min | 38.5 min | — |
| Model load | 27.3s | 21.4s | 1.3x |
| Generation Time | 103.1s | 61.3s | 1.7x |
| — T3 speech token generation | 31.6s | 39.9s | 0.8x |
| — S3Gen waveform generation | 70.4s | 20.2s | 3.5x |
| End-to-End Total | 131.1s | 83.3s | 1.6x |
| Generation RTF | 23.2x real-time | 37.6x real-time | 1.6x |
| End-to-End RTF | 18.3x real-time | 27.7x real-time | 1.5x |
Current Project Status
-
Usable Features:
- Basic speech cloning with audio and text conditioning.
- Outputs maintain quality comparable to the original Chatterbox implementation.
- Implementation of Context Free Guidance (CFG) and exaggeration control.
- vLLM batching is optimized, resulting in substantial speed improvements.
-
Ongoing Development:
- Continual improvements are underway to have more idiomatic usages of vLLM, minimizing hacky workarounds currently employed.
- Refactoring is needed to streamline code and enhance stability; APIs are still under development.
Get Started
To run Chatterbox Turbo TTS and generate audio samples, utilize the provided example script:
import torchaudio as ta
from chatterbox_vllm.tts import ChatterboxTTS
if __name__ == "__main__":
model = ChatterboxTTS.from_pretrained(
gpu_memory_utilization = 0.4,
max_model_len = 1000,
enforce_eager = True,
)
for i, audio_prompt_path in enumerate([None, "docs/audio-sample-01.mp3", "docs/audio-sample-03.mp3"]):
prompts = [
"You are listening to a demo of the Chatterbox TTS model running on VLLM.",
"This is a separate prompt to test the batching implementation.",
"And here is a third prompt. It's a bit longer than the first one, but not by much.",
]
audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)
for audio_idx, audio in enumerate(audios):
ta.save(f"test-{i}-{audio_idx}.mp3", audio, model.sr)
Architecture Insights
Chatterbox's architecture draws inspiration from the CosyVoice framework, leveraging multimodal conditioning techniques to enhance audio output quality utilizing an efficient model structure. Comprehensive explanations and architectural diagrams are provided to understand the system better:
Chatterbox turbo-vllm is a robust solution for developers looking to integrate high-quality TTS capabilities into their applications, providing both flexibility and performance enhancements.
No comments yet.
Sign in to be the first to comment.