Skip to content
On this page

SSML

Speech Synthesis Markup Language (SSML) is an XML-based markup language that allows you to precisely control the output via additional XML tags. During the beta, the AI API will only support a selected tags listed below, but we're looking to expect the list of supported tags in the future. Every SSML text begins and ends with a <speak> tag with the content to be synthesized contained with.

xml
<speak>Your content to be synthesized here</speak>

Some characters must be escaped when transforming text to SSML.

& -> &amp;

> -> &gt;

< -> &lt;

ts
const escapeSSMLChars = (text: string) =>
  text
    .replaceAll('&', '&amp;')
    .replaceAll('<', '&lt;')
    .replaceAll('>', '&gt;')

For example Some text with 5 < 6 & 4 > 8 in it -> <speak>Some text with 5 &lt; 6 &amp; 4 &gt; 8 in it</speak>

The speech marks from the AI API will use the indices from the escaped values. This means that you must transform the indices around the symbols. You may consider using our string tracker library to assist in the mapping. This may look something like the following:

ts
import { createStringTracker, StringTracker } from "@speechifyinc/string-tracker";

type SpeechMarksChunk = {
  startTime: number,
  endTime: number,
  start: number,
  end: number,
  value: string,
}

type SpeechMarks = SpeechMarksChunk & {
  chunks: SpeechMarksChunk[],
}

const applyTrackerToSpeechMarkChunk =
  (tracker: StringTracker) => (chunk: SpeechMarksChunk) => ({
    ...chunk,
    value: tracker.slice(chunk.start, chunk.end).getOriginal(),
    start: tracker.getIndexOnOriginal(chunk.start),
    end: tracker.getIndexOnOriginal(chunk.end),
  })

const text = "Example sentence with & and < and > to make sure it works";
const textTracker = createStringTracker(text)
  .replaceAll("&", "&amp;")
  .replaceAll("<", "&lt;")
  .replaceAll(">", "&gt;")

const ssml = `<speak>${textTracker.get()}</speak>`;
const synthesisResponse = await fetch(`${API_URL}/tts/v0/get`, { ... })

const { audioData, audioFormat, speechMarks: ssmlSpeechMarks } = synthesisResponse
const speechMarks: SpeechMarks = {
  ...applyTrackerToSpeechMarkChunk(textTracker)(ssmlSpeechMarks),
  chunks: ssmlSpeechMarks.chunks.map(applyTrackerToSpeechMarkChunk(textTracker))
}

<voice>

The voice tag allows you to make a request containing multiple voices. The AI API will automatically split and concatenate these voices together. The tag must contain all the attributes of a voice object. e.g. <voice engine="speechify-1" name="mrbeast" language="en-us">. These voices can be deeply nested inside of each other. Text that is not wrapped in a voice tag will use the voice provided in the request.

xml
<speak>
  This is a sentence
  <voice engine="speechify-1" name="mrbeast">
    that suddenly switches to a different voice that is
    <voice engine="speechify-1" name="gwyneth">
      deeply nested
    </voice>
    inside of another voice
  </voice>
  all the way back up
</speak>

<prosody>

The prosody element in Speech Synthesis Markup Language (SSML) is a powerful tool used to control and enhance the expressiveness of synthesized speech. It allows you to manipulate three primary attributes of spoken text: pitch, rate, and volume. For example:

xml
<speak>
    This is a normal speech pattern.
    <prosody pitch="high" rate="fast" volume="+20%">
        I'm speaking with a higher pitch, faster than usual, and louder!
    </prosody>
    Back to normal speech pattern.
</speak>

Attributes

pitch

Adjusts the pitch at which the speech is delivered. Valid values include:

  • x-low
  • low
  • medium (default)
  • high
  • x-high
  • Percentage expressed as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%). Valid range is between -83% and +100% but could be lower/higher when used in combination with rate.
xml
<speak>
    <prosody pitch="high">Hello! I am a cheerful character.</prosody>
    <prosody pitch="-50%">And I am a more serious character.</prosody>
</speak>

rate

Alters the speed at which the speech is spoken. It allows the following values:

  • x-low
  • low
  • medium (default)
  • high
  • x-high
  • Percentage expressed between as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%). Valid rang is -50% and +9900%.
xml
<speak>
    This is spoken at a <prosody rate="slow">slower rate</prosody>, while this is <prosody rate="fast">much faster</prosody> or <prosody rate="500%">insanely fast.</prosody>
</speak>

volume

Controls the loudness of the speech. In addition to the standard levels (silent, x-soft, soft, medium, loud, x-loud), it supports percentage adjustments (e.g., +10%, -20%).

  • silent
  • x-soft
  • medium (default)
  • loud
  • x-loud
  • A number preceded by "+" or "-" and immediately followed by "dB"
  • Percentage expressed as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%)
xml
<speak>
    <prosody volume="-6dB">Sometimes</prosody> it can be useful to
    <prosody volume="loud">increase the volume for a specific speech.</prosody>
</speak>

<break>

The <break> tag controls pausing or other prosodic boundaries between words. It follows the W3 specifications. The strength attribute can take values such as none, x-weak, weak, medium, strong, or x-strong. Additionally, the time attribute can be specified in seconds or milliseconds. Example usage:

xml
<speak>
    Sometimes it can be useful to add a longer pause at the end of the sentence.
    <break strength="medium" /> 
    Or <break time="100ms" /> sometimes in the middle.
</speak>

Attributes

strength

Specifies the strength of the pause, influencing its duration. Supported values include:

  • none: 0ms
  • x-weak: 250ms
  • weak: 500ms
  • medium: 750ms
  • strong: 1000ms
  • x-strong: 1250ms

time

Allows for the specification of pause duration in either milliseconds (ms) or seconds (s). This attribute offers precise control over the length of the pause.

<sub>

The <sub> tag is utilized to replace pronunciation for the contained text. It follows the W3 specifications. The required alias attribute has a value of any text.

xml
<speak>
    For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
</speak>

Attributes

alias

Specifies a string to be spoken instead of the enclosed text

<speechify:clone>

To clone a voice using an audio prompt, you must provide a source attribute. The source attribute is required and can be in the form of either a Data URI or a direct link.

The following audio formats are supported for voice cloning:

  • .mp3
  • .ogg
  • .wav
  • .flac
  • .aac
  • .m4a
  • .webm

Data URI Format

The Data URI should follow the format: data:audio/<type>;base64,<data>

Example

xml
<speak>
    <speechify:clone source="data:audio/mp3;base64,Yasd=">
        Hello, this is cloned voice.
    </speechify:clone>
</speak>

The direct link should follow the format: <protocol>://<path>.<audio format>

Example

xml
<speak>
    <speechify:clone source="gs://project-name.appspot.com/voices/voice_sample.wav">
        Hello, this is cloned voice.
    </speechify:clone>
</speak>

Note: when calling the TTS API with <speechify:clone> you must hardcode the voice:

json
"voice": {
  "engine": "speechify-1",
  "language": "en-US",
  "name": "clone"
}