
speech-2.6-turbo
starting from
API Overview
The latest speech generation model from MiniMax, **Speech-2.6-Turbo**, delivers a comprehensive breakthrough in Voice Agent scenarios, featuring ultra-low latency, full compatibility with professional audio formats, and enhanced voice naturalness.
Official documentation: Synchronous Speech Synthesis Guide - MiniMax API Docs
API Console
Log in to explore more features! Click to Log In
API Analytics
API Reference (4)
| API Description | API Endpoint | Request Method | Stability | Parameter Description |
|---|---|---|---|---|
T2A(Speech Generation - Synchronous) | POST | Stable | View Details | |
Document Details Text-to-Speech Tone Frequency from Minimax Request Parameters Header ParametersAuthorizationstringOptional Example Value: Bearer {{YOUR_API_KEY}}Content-TypestringOptional Example Value: application/jsonRequest Body application/jsonmodelenum<string>Required Requested model version, optional range:speech-2.8-turbo,speech-2.8-hd,speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo Enum Value: speech-2.6-hdspeech-2.6-turbospeech-02-hdspeech-02-turbospeech-01-hdspeech-01-turbospeech-2.8-turbospeech-2.8-hdtextstringRequired The text to be synthesized must be less than 10,000 characters in length. If the text exceeds 3,000 characters, it is recommended to use streaming output. streambooleanOptional Controls whether to enable streaming output. Default is false, which means streaming is not enabled. stream_optionsobjectOptional exclude_aggregated_audiobooleanOptional Set whether the last chunk contains the concatenated speech hex data. The default value is False. voice_settingobjectOptional voice_idstringOptional The timbre number for synthesized audio, supports system timbres, recreated timbres, and text-to-sound timbres. speednumberOptional The speech rate of the synthesized audio, value range [0.5, 2], default value is 1.0 volnumberOptional The volume of the synthesized audio, value range (0,10], default value is 1.0 pitchintegerOptional The intonation of the synthesized audio, value range [-12,12], default value is 0 emotionenum<string>Optional Control the emotion of synthesized speech. The model will automatically match the emotion based on the input text. Enum Value: happysadangryfearfuldisgustedsurprisedcalmfluenttext_normalizationbooleanOptional Whether to enable Chinese and English text normalization. Enabling this option can improve performance in digital reading scenarios. latex_readbooleanOptional Controls whether to read aloud latex formulas. The default is false. audio_settingobjectOptional sample_rateenum<integer>Optional The sampling rate for generated audio, default is 32000. Enum Value: 80001600022050240003200044100bitrateenum<integer>Optional The bitrate for generating audio, with a default value of 128000, is only applicable to the mp3 format. Enum Value: 3200064000128000256000formatenum<string>Optional The format for generating audio, defaults to mp3. Enum Value: mp3pcmflacwavchannelenum<integer>Optional Number of audio channels to generate: 1 for mono, 2 for stereo Enum Value: 12force_cbrbooleanOptional Whether to use constant bitrate encoding; only effective for streaming output and when the format is mp3. pronunciation_dictobjectOptional tonearray[string]Optional Replacement rules for substituting special annotations, phonetic symbols, or pronunciation timber_weightsarray[object]Optional voice_idstringOptional Timbre number of synthesized audio weightintegerOptional The weight of the timbre, with a value range of [1, 100] language_boostenum<string>Optional Whether to enhance recognition capabilities for specified minority languages and dialects. The default value is null. Enum Value: ChineseChinese,YueEnglishArabicRussianSpanishFrenchPortugueseGermanTurkishDutchUkrainianVietnameseIndonesianJapaneseItalianKoreanThaiPolishRomanianGreekCzechFinnishHindiBulgarianDanishHebrewMalayPersianSlovakSwedishCroatianFilipinoHungarianNorwegianSlovenianCatalanNynorskTamilAfrikaansautovoice_modifyobjectOptional pitchenum<integer>Optional Pitch adjustment (deep/bright), range [-100,100] Enum Value: -100100intensityenum<integer>Optional Intensity adjustment (strength/softness), range [-100,100] Enum Value: -100100timbreenum<integer>Optional Tone adjustment (magnetic/crisp), range [-100,100] Enum Value: -100100sound_effectsenum<string>Optional Sound Effect Settings Enum Value: spacious_echoauditorium_echolofi_telephoneroboticsubtitle_enablebooleanOptional Controls whether to enable the subtitle service; only effective for non-streaming output. output_formatenum<string>Optional Control the format of the output result. In streaming scenarios, only hex format is supported. Enum Value: urlhexaigc_watermarkbooleanOptional Controls whether to add an audio rhythm marker at the end of the synthesized audio. The default value is False. | ||||
T2A(Async extra content generation) | POST | Stable | View Details | |
Document Details Text-to-Audio Tone Frequency from Minimax Request Parameters Header ParametersAuthorizationstringOptional Example Value: Bearer {{YOUR_API_KEY}}Content-TypestringOptional Example Value: application/jsonRequest Body application/jsonmodelenum<string>Required Requested model version, optional range:speech-2.8-turbo,speech-2.8-hd,speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo Enum Value: speech-2.6-hdspeech-2.6-turbospeech-02-hdspeech-02-turbospeech-01-hdspeech-01-turbospeech-2.8-turbospeech-2.8-hdtextstringRequired Text to be synthesized must be less than 50,000 characters in length, suitable for long text synthesis. voice_settingobjectOptional voice_idstringOptional The timbre number for synthesized audio, supports system timbres, replicated timbres, and text-to-timbre. speednumberOptional The speech rate of the synthesized audio, value range [0.5, 2], default value is 1.0 volnumberOptional The volume of the synthesized audio, value range (0,10], default value is 1.0 pitchintegerOptional The intonation of the synthesized audio, value range [-12,12], default value is 0 emotionenum<string>Optional Control the emotion of synthesized speech; the model will automatically match the emotion based on the input text. Enum Value: happysadangryfearfuldisgustedsurprisedcalmfluenttext_normalizationbooleanOptional Whether to enable Chinese and English text normalization. Enabling this option can improve performance in digital reading scenarios. latex_readbooleanOptional Controls whether to read aloud latex formulas. The default is false. audio_settingobjectOptional sample_rateenum<integer>Optional The sampling rate for generated audio, default is 32000 Enum Value: 80001600022050240003200044100bitrateenum<integer>Optional The bitrate for generating audio, with a default value of 128000, is only effective for the mp3 format. Enum Value: 3200064000128000256000formatenum<string>Optional The format for generating audio is mp3 by default. Enum Value: mp3pcmflacwavchannelenum<integer>Optional Number of audio channels to generate; 1 for mono, 2 for stereo Enum Value: 12force_cbrbooleanOptional Whether to use constant bitrate encoding; only effective for streaming output and when the format is mp3. pronunciation_dictobjectOptional tonearray[string]Optional Replacement rules for substituting special annotated text or symbols with corresponding phonetic notation or pronunciation timber_weightsarray[object]Optional voice_idstringOptional Timbre number of the synthesized audio weightintegerOptional The weight of the timbre, value range [1, 100] language_boostenum<string>Optional Whether to enhance the recognition capability for specified minor languages and dialects. The default value is null. Enum Value: ChineseChinese,YueEnglishArabicRussianSpanishFrenchPortugueseGermanTurkishDutchUkrainianVietnameseIndonesianJapaneseItalianKoreanThaiPolishRomanianGreekCzechFinnishHindiBulgarianDanishHebrewMalayPersianSlovakSwedishCroatianFilipinoHungarianNorwegianSlovenianCatalanNynorskTamilAfrikaansautovoice_modifyobjectOptional pitchenum<integer>Optional Pitch adjustment (deep/bright), range [-100,100] Enum Value: -100100intensityenum<integer>Optional Intensity Adjustment (Strength/Softness), range [-100,100] Enum Value: -100100timbreenum<integer>Optional Timbre adjustment (magnetic/crisp), range [-100,100] Enum Value: -100100sound_effectsenum<string>Optional Sound Effect Settings Enum Value: spacious_echoauditorium_echolofi_telephoneroboticoutput_formatenum<string>Optional Control the format of the output result; in streaming scenarios, only hex format is supported. Enum Value: urlhex | ||||
T2A(Status Inquiry) | GET | Stable | View Details | |
Document Details Vincennes Tone Frequency from Minimax, Price:0 PTC / call Request Parameters Header ParametersAuthorizationstringOptional Example Value: Bearer {{YOUR_API_KEY}}Content-TypestringOptional Example Value: application/jsonQuery Parameterstask_idstringOptional | ||||
Files(Audio File Download) | GET | Stable | View Details | |
Document Details Vincennes Video Model from Minimax Price:0 PTC / call Request Parameters Header ParametersAuthorizationstringOptional Example Value: Bearer {{YOUR_API_KEY}}Query Parametersfile_idstringOptional | ||||
API Pricing
| Model | Description | 302.AI Price |
|---|
speech-2.6-turbo | T2A (voice generation-synchronization) |
|
speech-2.6-turbo | Asynchronous Long-form Text-to-Speech Generation |
|
T2A | Status Inquiry |
|
Files(Audio File Download) | - |
|