IndexTTS-2

IndexTTS-2

Bilibili's latest open-source TTS model
2025-09-15
Audio-Video Processing
Pricing:
$15/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

IndexTTS2 achieves breakthroughs in the field of zero-shot text-to-speech with automatic regression, enabling advanced emotional expression and duration control, and introduces a novel, model-friendly approach. This method supports two generation modes: either specifying a fixed number of tokens to precisely control speech length, or enabling free-form autoregressive generation while faithfully reproducing the prosody of the input prompt. It decouples emotional expression from speaker identity, allowing for accurate reconstruction of target voice characteristics and seamless replication of desired emotional tones even in zero-shot scenarios. Additionally, by integrating GPT latent representations and implementing a three-stage training paradigm, the system enhances the clarity and stability of high-emotion speech. Fine-tuning Qwen3 with a soft-text instruction mechanism further lowers the barrier to emotional control. Experiments across multiple datasets demonstrate that IndexTTS2 outperforms existing zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity.

From the open-source project:https://github.com/index-tts/index-tts

API Console

Log in to explore more features! Click to Log In

API Analytics

API Reference (2)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Create TTS Task
POST
Stable
View Details
Query Task
GET
Stable
View Details

API Pricing

$
ModelDescription302.AI Price

Create TTS Task

Create a TTS task, with a minimum charge of $0.001

$15/1M tokens

Query Task

-

free