No Login Data Private Local Save

SSML Generator & TTS Demo - Online Speech Synthesis

11
0
0
0

SSML Generator & TTS Demo

Generate Speech Synthesis Markup Language (SSML) and preview with browser text-to-speech

Online Speech Synthesis
SSML Output
<speak>
  
</speak>
1.0x
1.0
100%
Ready
Characters: 0 Words: 0 SSML Tags: 0 SSML generated for use with Amazon Polly, Google Cloud TTS, Microsoft Azure TTS & more
Quick Examples
Warm Greeting
Emphasis & breaks for a friendly hello
Phone Number
Say-as for telephone number reading
News Brief
Prosody & paragraphs for news reading
Language Learning
Slow prosody with phoneme example
Frequently Asked Questions

SSML (Speech Synthesis Markup Language) is an XML-based markup language that gives you fine-grained control over how text-to-speech engines pronounce your content. With SSML, you can add pauses, adjust speaking rate and pitch, emphasize words, spell out characters, pronounce phonetically, and much more — resulting in more natural, expressive audio output compared to plain text TTS.

All major cloud TTS providers fully support SSML: Amazon Polly, Google Cloud Text-to-Speech, Microsoft Azure Cognitive Services Speech, IBM Watson Text to Speech, and many others. Each service supports the core SSML specification plus some vendor-specific extensions. The SSML generated by this tool follows the W3C standard and works across all major platforms.

The most popular SSML tags include: <break> for inserting pauses, <emphasis> for stressing words, <prosody> for controlling rate/pitch/volume, <say-as> for interpreting numbers/dates/phone numbers, <p> and <s> for structuring paragraphs and sentences, <phoneme> for precise pronunciation, and <sub> for substituting spoken text.

Amazon Polly accepts SSML directly in its synthesize-speech API call. Simply set the TextType parameter to "ssml" and provide your SSML content wrapped in <speak> tags. Polly also supports additional tags like <amazon:effect name="whispered"> for whispering effects and <amazon:auto-breaths> for automatic breath sounds.

Yes! This tool provides a built-in TTS demo using your browser's Web Speech API. While the browser's speech synthesis doesn't natively parse SSML, our tool strips the SSML tags and uses the plain text with your selected rate/pitch/volume settings for a quick preview. For full SSML fidelity, you'll want to use a cloud TTS service like Amazon Polly or Google Cloud TTS with the generated SSML code.

Use the <break> tag! It's a self-closing tag that accepts a time attribute. For example: <break time="500ms"/> inserts a half-second pause. You can also use <break time="2s"/> for longer pauses, or <break strength="strong"/> for a context-appropriate pause without specifying exact duration.

The <prosody> tag supports three key attributes: rate (speaking speed: x-slow, slow, medium, fast, x-fast, or a percentage like "80%"), pitch (voice pitch: x-low, low, medium, high, x-high, or a semitone value like "+2st"), and volume (loudness: silent, x-soft, soft, medium, loud, x-loud, or dB values like "+3dB"). These can be combined for nuanced control.

SSML tags themselves are case-insensitive (so <BREAK>, <Break>, and <break> all work), but attribute values like "strong", "slow", or "ipa" should be lowercase for best compatibility across all TTS services. This tool always generates properly cased, standards-compliant SSML.

Wrap numbers with the <say-as> tag to control how they're spoken. Use interpret-as="cardinal" for regular numbers ("one hundred twenty-three"), "ordinal" for rankings ("first"), "telephone" for phone-style digit-by-digit reading, "date" with a format attribute for dates, "time" for time values, and "characters" to spell out each character individually (great for codes and IDs).

The <p> (paragraph) tag represents a paragraph-level structure and typically inserts a longer pause before and after the content — ideal for separating distinct topics or sections. The <s> (sentence) tag marks individual sentences within a paragraph, creating shorter pauses. Using both tags helps TTS engines deliver more natural-sounding speech with appropriate prosodic boundaries.