
Audio-Translation(Audio-Translation)
API Overview
Feature Overview
Input a segment of audio and generate translated speech in another language, preserving the original voice's tone.
Processing Workflow Explanation
🔄 Overall Processing Flow
Input Audio → Speech-to-Text (STT) → Text Translation → Voice Tone Cloning → Text-to-Speech (TTS) → Output Audio
The system automatically performs end-to-end translation from audio to audio, maintaining either the original or a specified voice tone.
Detailed Processing Steps
📋 Five Core Steps
1️⃣ Initialization (initialization)
Download and prepare the audio file
Automatically trim the cloned audio to within 30 seconds
2️⃣ Speech Recognition (speech_to_text)
Use OpenAI’s gpt-4o-transcribe to convert audio into text
3️⃣ Translation (translation)
Users can select their preferred LLM model (default is claude-haiku-4-5-20251001 for intelligent translation)
4️⃣ Voice Tone Cloning (voice_clone)
Analyze and extract audio features
5️⃣ Text-to-Speech (text_to_speech)
Generate target-language audio using the cloned voice tone
Output high-quality audio files
Vendor Selection Logic
🎯 Automatic Selection Rules
The system automatically chooses the best voice tone cloning vendor based on the target language:
Selection Priority
1. User-Specified: If a vendor is designated and supports the target language, it takes precedence
2. Language Match: Use index_tts2 for Chinese and English; otherwise, opt for Fish
⚠️ Important Notes
Recommended audio length for cloning is 10–30 seconds—longer clips will be automatically trimmed
Audio must be clear and free of noise for optimal results
Supported formats: MP3, WAV
API Console
Log in to explore more features! Click to Log In