This documentation is actively being improved. You may encounter gaps or incomplete sections as we refine and expand the content. We appreciate your understanding and welcome any feedback to help us make this resource even better!

The Autopilot is currently in overview mode and only available via the SDK, and the KnowledgeBase features have been disabled.

Fonoster’s Autopilot is a component within the platform that allows you to create powerful conversational experiences. It is built on top of Fonoster Programmable voice and uses the latest advances in large language models (LLMs) to provide a natural and engaging experience.

Overview

The following is an example of how to create an Autopilot application using the SDK:

const SDK = require("@fonoster/sdk");

const client = new SDK.Client({ accessKeyId: "WO000000-0000-0000-0000-000000000000" });

const appConfig = {
  name: "Dr. Green's AI Assistant",
  type: "AUTOPILOT",
  speechToText: {
    productRef: "stt.deepgram",
    config: {
      languageCode: "en-US"
    }
  },
  textToSpeech: {
    productRef: "tts.deepgram",
    config: {
      voice: "aura-asteria-en"
    }
  },
  intelligence: {
    productRef: "llm.groq",
    config: {
      conversationSettings: {
        firstMessage: "Hello, this is Olivia from Dr. Green's Family Medicine. How can I assist you today?",
        systemTemplate: "You are a Customer Service Representative. You are here to help the caller with their needs.",
        goodbyeMessage: "Goodbye, have a great day!",
        systemErrorMessage: "I'm sorry, I didn't understand that. Can you please repeat it?",
        idleOptions: {
          "message": "Are you still there?"
        }
      },
      languageModel: {
        provider: "groq",
        model: "llama-3.3-70b-specdec",
        maxTokens: 250,
        temperature: 0.7
      }
    }
  }
}

client.loginWithApiKey("AP0eerv2g7qow3e950k7twu4rvydcunq3k", "fNc...")
  .then(async() => new SDK.Applications(client).createApplication(appConfig))
  .catch(console.error);

General configuration

The Autopilot configuration is divided into a general section and three sub-sections: speechToText, textToSpeech, and intelligence.

The general section contains name, type, and endpoint properties.

The name property is the name of the Autopilot application. The type property is the type of the application, which should always be set to AUTOPILOT. The endpoint is an optional property allowing you to specify the endpoint for self-hosted Autopilots.

Configuring speech-to-text

The speechToText object allows you to define the speech-to-text engine to use. The speech-to-text engine is responsible for converting the caller’s speech into text.

The speechToText object has the productRef and config properties. The productRef property identifies the speech-to-text vendor you want to use. The config property is an object that contains the configuration settings for the speech-to-text engine. The configuration settings vary depending on the vendor.

Currently, only Deepgram is supported as a speech-to-text vendor, but we are working on adding more vendors.

Deepgram configuration

Deepgram is a speech-to-text vendor that provides high-quality transcription services. Deepgram supports the languageCode as well as model properties. The languageCode property is the language code of the speech you want to transcribe. The model property is the model to use for transcription and defaults to nova-2-phonecall.

The Autopilot supports the models nova-2, nova-2-phonecall, and nova-2-conversationalai.

Example of a Deepgram configuration for Spanish:

const appConfig = {
  ...
  speechToText: {
    productRef: "stt.deepgram",
    config: {
      model: "nova-2"
      languageCode: "es",
    }
  },
  ...
}

For languageCode other than en-US, you need to use the nova-2 model. Please refer to the Deepgram documentation for more information.

Configuring text-to-speech

The textToSpeech object allows you to define the text-to-speech engine. The text-to-speech engine is responsible for converting the Autopilot’s responses into speech.

The textToSpeech object has the productRef and config properties. The productRef property identifies the text-to-speech vendor you want to use. The config property is an object that contains the configuration settings for the text-to-speech engine. The configuration settings vary depending on the vendor.

We currently support Google, Azure, Deepgram, and ElevenLabs as text-to-speech vendors.

Most vendors only support the voice property, which is the voice to use for the text-to-speech. The voice is a string that represents the voice to use. The available voices depend on the vendor.

Please visit the vendor’s documentation for more information on the available voices.

In addition to the voice property, the ElevenLabs vendor supports the model property. The model property is the model to use for text-to-speech and defaults to eleven_flash_v2_5. Please refer to the ElevenLabs documentation for additional information about the available models.

Example of a text-to-speech configuration for ElevenLabs:

const appConfig = {
  ...
  textToSpeech: {
    productRef: "tts.elevenlabs",
    config: {
      voice: "CaJslL1xziwefCeTNzHv",
      model: "eleven_flash_v2_5"
    }
  },
  ...
}

Available voices by vendor

The following links provide information on the available voices for each vendor:

If you need a non-default ElevenLabs voices please let us know, and we will add it for you.

Conversational settings

The conversationSettings object allows you to define the Autopilot’s conversational behavior. The conversation settings are independent of the language model used.

The following is a list of the supported settings:

SettingDescription
firstMessageThe first message the Autopilot will say when the conversation starts
systemTemplateA template that describes the role of the Autopilot. This is used to set the context of the conversation
systemErrorMessageThe message the Autopilot will say when an error occurs
maxSpeechWaitTimeoutThe maximum time in milliseconds to wait for the caller before returning the speech-to-text result. Default to 10000 ms
initialDtmfA DTMF to play when the conversation starts
transferOptionsThe options to transfer the call to a live agent
transferOptions.phoneNumberThe phone number to transfer the call to
transferOptions.messageThe message to play before transferring the call
transferOptions.timeoutThe time in milliseconds to wait before hanging up the call if the transfer is incomplete. Default to 30000 ms
idleOptionsThe options to handle idle time during the conversation
idleOptions.messageThe message to play after the idle time is reached
idleOptions.timeoutThe time in milliseconds to wait before playing the idle message. Defaults to 10000 ms
idleOptions.maxTimeoutCountThe maximum number of times the idle message will be played before hanging up the call
vadThe voice activity detection settings
vad.activationThresholdThe activation threshold for the voice activity detection. Default to 0.3
vad.deactivationThresholdThe deactivation threshold for the voice activity detection. Default to 0.25
vad.debounceFramesThe number of frames to debounce the voice activity detection. Default to 3

A few noteworthy settings include the maxSpeechWaitTimeout, intialDtmf, idleOptions, and vad.

Max Speech Wait Timeout

The maxSpeechWaitTimeout property allows you to specify the maximum time in milliseconds to wait for the caller before returning the speech-to-text result. If the caller does not speak within the specified time, the speech-to-text engine will return the result.

A value that is too low may result in the speech-to-text engine returning the result before the caller finishes speaking. A value that is too high may result in the speech-to-text engine waiting too long for the caller to speak.

Initial DTMF

Sometimes, users will use call forwarding to reach the number in Fonoster. Some telephony service providers require a Dual-tone multi-frequency (DTMF) to be played before the call is connected. The initialDtmf property allows you to specify a DTMF to play when the session starts.

Voice Activity Detection (VAD)

The vad object allows you to configure the voice activity detection settings. Voice activity detection is used to detect when the caller is speaking and when they are not speaking.

The vad object has the activationThreshold, deactivationThreshold, debounceFrames properties. The actionThreshold property is the activation threshold for voice activity detection. The deactivationThreshold property is the deactivation threshold for voice activity detection. The debounceFrames property is the number of frames to debounce the voice activity detection.

A lower activation threshold will make the detection more sensitive to the caller’s speech. A higher activation threshold will result in the voice activity detection being less sensitive to the caller’s speech.

A lower deactivation threshold will result in more aggressive voice activity detection deactivation. A higher deactivation threshold will result in less aggressive voice activity detection deactivation.

The debounceFrames parameter introduces a delay mechanism that ensures that transitions between “speech” and “non-speech” states are stable and not too sensitive to small fluctuations in the input audio signal. Here’s how it works:

By requiring multiple consecutive frames (debounceFrames) to confirm speech or non-speech, the system filters out short bursts of noise or brief gaps in speech that might otherwise cause erratic state changes.

Langue model configuration

The languageModel object allows you to define the language model the Autopilot uses. The language model is responsible for generating responses to the user’s input.

The following is a list of the supported settings:

SettingDescription
providerThe provider of the language model. Supported providers are openai, groq, and ollama
modelThe model to use. The available models depend on the provider
maxTokensThe maximum number of tokens the language model can generate in a single response
temperatureThe randomness of the language model. A higher temperature will result in more random responses
knowledgeBaseA list of knowledge bases to use for the language model
toolsA list of tools to use for the language model

LLM providers and models

The Autopilot supports multiple language model providers. The following is a list of the supported providers:

ProviderDescriptionSupported models
OpenAIOpenAI provides various GPT models for conversational AIgpt-4o, gpt-4o-mini, gpt-3.5-turbo, gpt-4-turbo
GroqGroq offers high-performance AI models optimized for speedllama-3.1-8b-instant, llama-3.3-70b-specdec, llama-3.3-70b-versatile
OllamaSelf-hosted Ollama modelsllama3-groq-tool-use

We are constantly updating the list of supported providers and models. Please let us know if you have a specific model you want to use.

We have noticed that Groq models, particularly llama-3.3-70b-versatile, often require greater prompting specificity for effective tool usage. We will share best practices to ensure more consistent behavior as we gain more insights

Knowledge bases

Coming soon…

Tools

Fonoster’s Autopilot allows you to use tools to enhance the conversational experience. Tools are used to perform specific actions during the conversation.

Built-in tools

The following is a list of built-in tools available for an agent:

ToolDescription
hangupA tool to end the conversation
transferA tool to transfer the call to a live agent
holdA tool to put the call on hold (Coming soon)

Custom tools

You can add custom tools to the language model by adding an entry to the tools array. The following is an example of how to add a custom tool:

Custom tools are governed by the tool schema.

An custom tool to get available appointment times would look as follows:

{
  "name": "getAvailableTimes",
  "description": "Get available appointment times for a specific date.",
  "requestStartMessage": "I'm looking for available appointment times for the date you provided.",
  "parameters": {
    "type": "object",
    "properties": {
      "date": {
        "type": "string",
        "format": "date"
      }
    },
    "required": [
      "date"
    ]
  },
  "operation": {
    "type": "get",
    "url": "https://api.example.com/appointment-times",
    "headers": {
      "x-api-key": "your-api-key"
    }
  }
}

Use operation.type “post” for POST requests. If you want the Autopilot to wait for POST requests to complete, set operation.waitForResponse to true. For “get” requests, the Autopilot will wait for the response by default.