Overview
Both AUTOPILOT and EXTERNAL applications use speech APIs. To illustrate the use of speech in Fonoster, look at the following example using the SDK:Configuring speech-to-text
The speechToText object allows you to define the speech-to-text engine to use. The speech-to-text engine is responsible for converting the caller’s speech into text. The speechToText object has the productRef and config properties. The productRef property identifies the speech-to-text vendor you want to use. The config property is an object that contains the configuration settings for the speech-to-text engine. The configuration settings vary depending on the vendor. Currently, only Deepgram is supported as a speech-to-text vendor, but we are working on adding more vendors.Deepgram configuration
Deepgram is a speech-to-text vendor that provides high-quality transcription services. Deepgram supports the languageCode as well as model properties. The languageCode property is the language code of the speech you want to transcribe. The model property is the model to use for transcription and defaults tonova-2-phonecall
.
The Autopilot supports the models nova-2
, nova-2-phonecall
, and nova-2-conversationalai
, nova-3
.
Example of a Deepgram configuration for Spanish:
For languageCode other than
en-US
, you need to use the nova-2
model.Please refer to the Deepgram documentation for more information.Configuring text-to-speech
The textToSpeech object allows you to define the text-to-speech engine. The text-to-speech engine is responsible for converting the Autopilot’s responses into speech. The textToSpeech object has the productRef and config properties. The productRef property identifies the text-to-speech vendor you want to use. The config property is an object that contains the configuration settings for the text-to-speech engine. The configuration settings vary depending on the vendor. We currently support Google, Azure, Deepgram, and ElevenLabs as text-to-speech vendors. Most vendors only support the voice property as the voice for the text-to-speech. The voice is a string that represents the voice to use. The available voices depend on the vendor. Please visit the vendor’s documentation for more information on the available voices.In addition to the voice property, the ElevenLabs vendor supports the model property. The model property is the model to use for text-to-speech and defaults to
eleven_flash_v2_5
.
Please refer to the ElevenLabs documentation for additional information about the available models.Available voices by vendor
The following links provide information on the available voices for each vendor:If you need a non-default ElevenLabs voice, please let us know, and we will add it for you.