Appearance
TTS - Beta
The AI API synthesis API is identical for all of the supported models. There are three components to every request/response: content, voice and format.
Authorization
The AI API is protected with API key authentication. You can obtain an API key by signing up at our home page.
Each request must be accompanied by an API key, which can be provided in the Authorization
header. Requests without a valid API key will be rejected with a 401 Unauthorized
response.
Authorization: ApiKey YOUR_API_KEY_HERE
Content
The AI API uses the standard SSML format for content. Speech Synthesis Markup Language (SSML) is an XML-based markup language that allows you to precisely control the output via additional SSML tags such as <prosody>
<emotion>
<emphasis>
and more. Please note that during the beta, we have not implemented additional SSML tags but plan to do so in the future. Every SSML text begins and ends with a <speak>
tag with the content to be synthesized contained with.
xml
<speak>Your content to be synthesized here</speak>
For more information, see the SSML section.
Voices
The voice defines which engine and which voice within the engine to use for synthesis with all values being case-insensitive. A full list of voices can be acquired from the voices endpoint.
ts
type Voice = {
name: string
engine: string
language: string
}
For more information, see the voices section.
Formats
The AI API will provide ogg
or mp3
based on whichever is the lowest latency and highest quality, this is almost always ogg
. You can force the AI API to provide one or the other but it's highly preferable that you do not. In some cases, where the end device doesn't support ogg
(such as clients on Safari or iOS), you will unfortunately have no choice.
ts
type AudioFormat = 'ogg' | 'mp3'
Endpoints
Development (Live): https://api.dev.speechify.ai
Production (In-progress): https://api.speechify.ai
Examples
For some examples on how to call the following endpoints, see the examples section.
POST /tts/v0/get
The /get
endpoint allows clients to request a piece of ssml to be synthesized. The ideal length of requests lies around 2-4 sentences. Single sentences are discouraged since some models benefit from additional context for their prosodies. Very large requests will be parallelized into many smaller requests internally so that latency should be relatively consistent.
Required Headers
Content-Type: application/json
WARNING
ssml
must be a max of 5000 characters total and a max of 2000 characters excluding the tags (just the text), for example:
ssml: "<speak>Hello world</speak>"
IncludingTags = "<speak>Hello world</speak>".length = 26
ExcludingTags = "Hello world".length = 11
Request
ts
type Request = {
// Max 5000 characters including tags. Max 2000 characters excluding tags. See above for details
ssml: string
voice: {
name: string
engine: string
language: string
}
// Avoid providing unless absolutely necessary
forcedAudioFormat?: 'ogg' | 'mp3'
}
Response
Returned to the client is the audioData
in base64 with the format indicated by the audioFormat
field. The speechMarks
indicate when each word in the original text is spoken in the audio. This information allows the client to highlight the text on screen and seek accurately. More information on speech marks can be found in the speech marks section.
ts
type Response = {
audioData: string
audioFormat: 'mp3' | 'ogg'
speechMarks: NestedChunk
}
type Chunk = {
startTime: number
endTime: number
start: number
end: number
value: string
}
type NestedChunk = Chunk & {
chunks: Chunk[]
}
GET /tts/v0/voices
This endpoint returns an array of every voice supported by the AI API. This list is generated dynamically based on the data provided by each engine.
Response
Returned to the client is a list of all voices supported by the AI API.
ts
type Voice = {
name: string
engine: string
language: string
}
type Response = Voice[]