Skip to content
On this page

TTS - Beta

The AI API synthesis API is identical for all of the supported models. There are three components to every request/response: content, voice and format.

Authorization

The AI API is protected with API key authentication. You can obtain an API key by signing up at our home page.

Each request must be accompanied by an API key, which can be provided in the Authorization header. Requests without a valid API key will be rejected with a 401 Unauthorized response.

Authorization: ApiKey YOUR_API_KEY_HERE

Content

The AI API uses the standard SSML format for content. Speech Synthesis Markup Language (SSML) is an XML-based markup language that allows you to precisely control the output via additional SSML tags such as <prosody> <emotion> <emphasis> and more. Please note that during the beta, we have not implemented additional SSML tags but plan to do so in the future. Every SSML text begins and ends with a <speak> tag with the content to be synthesized contained with.

xml
<speak>Your content to be synthesized here</speak>

For more information, see the SSML section.

Voices

The voice defines which engine and which voice within the engine to use for synthesis with all values being case-insensitive. A full list of voices can be acquired from the voices endpoint.

ts
type Voice = {
  name: string
  engine: string
  language: string
}

For more information, see the voices section.

Formats

The AI API will provide ogg or mp3 based on whichever is the lowest latency and highest quality, this is almost always ogg. You can force the AI API to provide one or the other but it's highly preferable that you do not. In some cases, where the end device doesn't support ogg (such as clients on Safari or iOS), you will unfortunately have no choice.

ts
type AudioFormat = 'ogg' | 'mp3'

Endpoints

Development (Live): https://api.dev.speechify.ai

Production (In-progress): https://api.speechify.ai

Examples

For some examples on how to call the following endpoints, see the examples section.

POST /tts/v0/get

The /get endpoint allows clients to request a piece of ssml to be synthesized. The ideal length of requests lies around 2-4 sentences. Single sentences are discouraged since some models benefit from additional context for their prosodies. Very large requests will be parallelized into many smaller requests internally so that latency should be relatively consistent.

Required Headers

Content-Type: application/json

WARNING

ssml must be a max of 5000 characters total and a max of 2000 characters excluding the tags (just the text), for example:

ssml: "<speak>Hello world</speak>"

IncludingTags = "<speak>Hello world</speak>".length = 26

ExcludingTags = "Hello world".length = 11

Request

ts
type Request = {
  // Max 5000 characters including tags. Max 2000 characters excluding tags. See above for details
  ssml: string
  voice: {
    name: string
    engine: string
    language: string
  }
  // Avoid providing unless absolutely necessary
  forcedAudioFormat?: 'ogg' | 'mp3'
}

Response

Returned to the client is the audioData in base64 with the format indicated by the audioFormat field. The speechMarks indicate when each word in the original text is spoken in the audio. This information allows the client to highlight the text on screen and seek accurately. More information on speech marks can be found in the speech marks section.

ts
type Response = {
  audioData: string
  audioFormat: 'mp3' | 'ogg'
  speechMarks: NestedChunk
}

type Chunk = {
  startTime: number
  endTime: number
  start: number
  end: number
  value: string
}

type NestedChunk = Chunk & {
  chunks: Chunk[]
}

GET /tts/v0/voices

This endpoint returns an array of every voice supported by the AI API. This list is generated dynamically based on the data provided by each engine.

Response

Returned to the client is a list of all voices supported by the AI API.

ts
type Voice = {
  name: string
  engine: string
  language: string
}

type Response = Voice[]