Skip to content
On this page

Speech Marks

The speech marks returned with every synthesis request are a mapping between time and text. It informs the client on when each word is spoken in the audio for the purposes of highlighting, seeking, tracking usage, etc.

ts
type Chunk = {
  startTime: number
  endTime: number
  start: number
  end: number
  value: string
}

type NestedChunk = Chunk & {
  chunks: Chunk[]
}

Typical Gotchas

  • The values are returned based on the SSML so any escaping of &, < and > will be present in the value, start and end fields. You may consider using string tracker library to assist in the mapping.
  • The start and end values of each word may have gaps. If you're looking for the word at an index, look for the start being >= yourIndex. Rather than checking if the index is within the bounds of both start and end
  • The startTime and endTime of each word may have gaps. Follow the same advice as above
  • The startTime of the first word is not necessarily 0 like the NestedChunk. There can be silence at the beginning of the sentence that leads to the word starting part way through.
  • The endTime of the last word does not necessarily correspond with the end of the NestedChunk. There can be silence on the end of the NestedChunk that will lead it to be longer.

Example output

ts
const chunk: NestedChunk = {
  start: 0,
  end: 79,
  startTime: 0,
  endTime: 4292.58,
  value: 'This is a sentence used for testing with some text on the end to make it longer',
  chunks: [
    {
      start: 0,
      end: 4,
      startTime: 125,
      endTime: 250,
      value: 'This',
    },
    {
      start: 5,
      end: 7,
      startTime: 250,
      endTime: 375,
      value: 'is',
    },
    {
      start: 8,
      end: 9,
      startTime: 375,
      endTime: 500,
      value: 'a',
    },
    {
      start: 10,
      end: 18,
      startTime: 500,
      endTime: 937,
      value: 'sentence',
    },
    {
      start: 19,
      end: 23,
      startTime: 937,
      endTime: 1200,
      value: 'used',
    },
    {
      start: 24,
      end: 27,
      startTime: 1200,
      endTime: 1375,
      value: 'for',
    },
    {
      start: 28,
      end: 35,
      startTime: 1375,
      endTime: 1775,
      value: 'testing',
    },
    {
      start: 36,
      end: 40,
      startTime: 1775,
      endTime: 1937,
      value: 'with',
    },
    {
      start: 41,
      end: 45,
      startTime: 1937,
      endTime: 2125,
      value: 'some',
    },
    {
      start: 46,
      end: 50,
      startTime: 2125,
      endTime: 2500,
      value: 'text',
    },
    {
      start: 51,
      end: 53,
      startTime: 2500,
      endTime: 2625,
      value: 'on',
    },
    {
      start: 54,
      end: 57,
      startTime: 2625,
      endTime: 2850,
      value: 'the',
    },
    {
      start: 58,
      end: 61,
      startTime: 2850,
      endTime: 3000,
      value: 'end',
    },
    {
      start: 62,
      end: 64,
      startTime: 3000,
      endTime: 3125,
      value: 'to',
    },
    {
      start: 65,
      end: 69,
      startTime: 3125,
      endTime: 3312,
      value: 'make',
    },
    {
      start: 70,
      end: 72,
      startTime: 3312,
      endTime: 3437,
      value: 'it',
    },
    {
      start: 73,
      end: 79,
      startTime: 3437,
      endTime: 4292.58,
      value: 'longer',
    },
  ],
}