Chatterbox Turbo TTS on vLLM

Enhanced Chatterbox Turbo TTS model optimized for vLLM performance.

Pitch

This project brings the Chatterbox Turbo TTS model into the vLLM environment, harnessing faster audio generation while maintaining high quality. With improved GPU efficiency and seamless integration, this repository offers a robust solution for real-time speech synthesis, fitting for advanced inference infrastructure.

Description

Chatterbox-turbo-vllm is an innovative port of the Chatterbox Turbo text-to-speech (TTS) model to vLLM, enhancing performance and memory efficiency. Built upon the chatterbox-vllm foundation, this project extends its capabilities to include the newer Turbo model, which incorporates a significantly faster S3Gen waveform decoder for a superior audio generation experience.

Key Features

Enhanced Performance: Experience improved processing speeds and efficient GPU memory utilization.
Integration Ready: Seamlessly integrates with state-of-the-art inference infrastructures for TTS applications.

Performance Benchmarking

The performance of Chatterbox Turbo TTS has been rigorously tested on a RTX 4090 GPU. Significant speedups have been achieved as illustrated in the table below:

Metric	Regular	Turbo	Speedup
Audio duration	39.9 min	38.5 min	—
Model load	27.3s	21.4s	1.3x
Generation Time	103.1s	61.3s	1.7x
— T3 speech token generation	31.6s	39.9s	0.8x
— S3Gen waveform generation	70.4s	20.2s	3.5x
End-to-End Total	131.1s	83.3s	1.6x
Generation RTF	23.2x real-time	37.6x real-time	1.6x
End-to-End RTF	18.3x real-time	27.7x real-time	1.5x

Current Project Status

Usable Features:
- Basic speech cloning with audio and text conditioning.
- Outputs maintain quality comparable to the original Chatterbox implementation.
- Implementation of Context Free Guidance (CFG) and exaggeration control.
- vLLM batching is optimized, resulting in substantial speed improvements.
Ongoing Development:
- Continual improvements are underway to have more idiomatic usages of vLLM, minimizing hacky workarounds currently employed.
- Refactoring is needed to streamline code and enhance stability; APIs are still under development.

Get Started

To run Chatterbox Turbo TTS and generate audio samples, utilize the provided example script:

import torchaudio as ta
from chatterbox_vllm.tts import ChatterboxTTS

if __name__ == "__main__":
    model = ChatterboxTTS.from_pretrained(
        gpu_memory_utilization = 0.4,
        max_model_len = 1000,
        enforce_eager = True,
    )

    for i, audio_prompt_path in enumerate([None, "docs/audio-sample-01.mp3", "docs/audio-sample-03.mp3"]):
        prompts = [
            "You are listening to a demo of the Chatterbox TTS model running on VLLM.",
            "This is a separate prompt to test the batching implementation.",
            "And here is a third prompt. It's a bit longer than the first one, but not by much.",
        ]
        audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)
        for audio_idx, audio in enumerate(audios):
            ta.save(f"test-{i}-{audio_idx}.mp3", audio, model.sr)

Architecture Insights

Chatterbox's architecture draws inspiration from the CosyVoice framework, leveraging multimodal conditioning techniques to enhance audio output quality utilizing an efficient model structure. Comprehensive explanations and architectural diagrams are provided to understand the system better:

Chatterbox Architecture Diagram
CFG Implementation Diagram

Chatterbox turbo-vllm is a robust solution for developers looking to integrate high-quality TTS capabilities into their applications, providing both flexibility and performance enhancements.

0 comments

No comments yet.

New comment