Meet MiniMax Speech-2.8

MiniMax Voice

Studio-grade text-to-speech, instant voice cloning,
and sub-200ms voice agents in 40+ languages.

Cloud or on-premise, at half the cost of ElevenLabs. Available today.

Get a Demo View Pricing Read the Docs

40+ languages·<200ms streaming TTFB·From $60 / 1M chars·On-prem available today

Why MiniMax Voice

Built for production,
priced for shipping.

Voice cloning across 40+ languages, including Mandarin, Arabic, and Hindi.
Per-sentence emotion & style prompts. No SSML soup.
On-premise deployment available today, not a 2026 roadmap promise.
Voice agent ready. Drops into LiveKit, Pipecat, Vapi, Retell, any SIP trunk.
Half the price of ElevenLabs at HD parity, a quarter at Turbo.

QUALITY

Speech-2.8 ties with ElevenLabs Turbo v2.5 in blind MOS evaluations and outperforms it on emotional range. Speaker identity is preserved across long-form output without drift, making it usable for audiobook and podcast production end-to-end. No chunking, no manual stitching.

MOS 4.42 on internal eval set

Listen

Hear what production
voice sounds like.

Two unedited first-take samples from Speech 2.8. No mastering, no EQ, no post-processing. What you hear is what the API returns.

Natural

Golden Voice (human-like)

English · Speech 2.8

Hey, it's me. How are ya? (chuckle) I hope you're having an awesome day! We actually had a bit of a crazy launch day yesterday, but I'm just recovered and ready to roll. You're listening to this and probably thinking I'm just chatting into a microphone, but here's the twist: I'm actually not human. I am the new Speech 2.8 model from MiniMax.

Listen for

Breaths, chuckles, throat-clears. Every disfluency you'd expect from a human.

Bilingual

Japanese × English, mid-sentence

JP × EN · Speech 2.8

Oh my gosh, you won't believe it, 今日は本当にすごかったの! I was running late for work, それから電車が止まっちゃって, and I'm like, 'Seriously?!' でも大丈夫, because guess what, 道で昔の友達にばったり会ったの!

Listen for

Native prosody on both sides. Same voice through the switch, no model swap.

Need a specific language or persona? Mention it on the demo form and we'll generate one before the call.

TTS API

Two models. One API.
Real pricing.

Pick HD for content that ships to humans, Turbo for high-volume conversational workloads. Same SDK, same voices, same auth.

Most Expressive

Speech-2.8-HD

Cinematic delivery for content that ships to humans.

$100per 1M characters

Studio-grade 32kHz audio
Fine-grained emotion & style control
40+ languages, native-quality prosody
Compatible with PVC custom voices
Long-form generation up to 30 minutes

Speech-2.8-Turbo

Built for high-volume, latency-sensitive workloads.

$60per 1M characters

Sub-200ms time-to-first-byte
Low-latency PCM streaming
40+ languages, conversational tone
Compatible with IVC custom voices
Optimized for voice agent loops

Voice Cloning

Clone any voice in seconds,
or studio-grade in days.

Two cloning paths so you can match the quality bar to the use case. Both work across all 40+ supported languages.

IVC

Instant Voice Clone

Upload 10 seconds of audio and start generating in under a minute. Built for prototypes, character voices, and personalized assistants.

10s reference audio
Ready in <30 seconds
Pay-per-use, no setup fee

PVC

Professional Voice Clone

We fine-tune a dedicated model on a curated studio recording. Indistinguishable from the source speaker, even on long-form narration.

30 min curated recordings
Trained in 3 to 5 business days
Available cloud or on-premise

Languages

Cloned voices speak every language we support.

40+ supported

EnglishMandarinSpanishPortugueseFrenchGermanItalianJapaneseKoreanArabicHindiTurkishRussianPolishDutchIndonesianVietnameseThaiCzechSwedishDanishFinnishNorwegianGreekHebrewRomanianHungarianUkrainianBulgarianCroatianSlovakTagalogMalayBengaliTamilTeluguMarathiUrduPersianSwahili

Voice Agent

Built for sub-200ms
voice loops.

Drop into LiveKit, Pipecat, or your own stack. The same TTS that powers our cloud API, tuned for real-time conversation.

Sub-200ms TTFB

Time-to-first-byte under 200ms on Turbo. Fast enough for natural turn-taking inside a voice loop.

PCM streaming

Stream raw 24kHz PCM directly into LiveKit, Pipecat, or your custom WebRTC pipeline.

Emotion & style control

Per-sentence prompts for tone, pace, energy, and emphasis. No prompt engineering tricks required.

Mid-sentence interruption

Cut audio cleanly when the user barges in, then resume with state intact.

40+ languages, mid-call

Switch language inside a conversation without swapping models or reloading voices.

SIP-ready

Drop into any SIP trunk. Tested for telephony codecs and 8kHz fallback.

Compare

MiniMax vs ElevenLabs
vs Cartesia.

Where each platform actually wins. Numbers reflect public pricing and documented capabilities at the time of writing.

Capability	MiniMax Voice	ElevenLabs	Cartesia
HD model, per 1M chars	$100	~$300	$99
Turbo model, per 1M chars	$60	~$99	$25
Languages supported	40+	70+	15+
Streaming TTFB	<200 ms	~250 ms	<90 ms
Voice cloning	IVC + PVC	IVC + PVC	IVC only
On-premise deployment	Available now	Early access, Apr 2026	Self-hosted (Enterprise)
Emotion / style control	Native, per-sentence	Native	Limited
Long-form audio (>10 min)	Single call up to 30 min	Chunking required	Chunking required

Sources: vendor pricing pages and product changelogs as of April 2026. ElevenLabs on-premise is in early access announced for April 2026 and not yet production-proven.

Ready when you are

Ship voice this quarter.

Book a 30-minute walkthrough with our solutions team. We'll generate samples in your target language and scope a deployment that fits your constraints, whether cloud, hybrid, or fully on-premise.

Book a Demo Read the docs