Autopilot
Voice applications powered by LLMs.
The Autopilot is currently in overview mode and the KnowledgeBase features have been disabled.
Fonoster’s Autopilot is a component within the platform that allows you to create powerful conversational experiences. It is built on top of Fonoster Programmable Voice and uses the latest advances in Large Language Models (LLMs) to provide a natural and engaging experience.
Overview
The following is an example of creating an Autopilot application using the SDK.
First, add the following content to a file named appConfig.yaml
:
Then, create the application as follows:
Similarly, to update the application, you can use the applications:update
command with the from-file
flag.
General configuration
The Autopilot configuration is divided into a general section and three sub-sections: speechToText, textToSpeech, and intelligence.
The general section contains name, type, and endpoint properties.
The name property is the name of the Autopilot application. The type property is the type of the application, which should always be set to AUTOPILOT
. The endpoint is an optional property allowing you to specify the endpoint for self-hosted Autopilots.
Speech settings
Autopilot applications support a variety of speech-to-text and text-to-speech vendors. The speechToText and textToSpeech objects allow you to define the speech-to-text and text-to-speech engines to use.
You can mix and match vendors to suit your needs. For example, you can use Deepgram for speech-to-text and Google for text-to-speech. Please check the Speech Vendors section for more information on configuring speech-to-text and text-to-speech.
Conversational settings
The conversationSettings object allows you to define the Autopilot’s conversational behavior. The conversation settings are independent of the language model used.
The following is a list of the supported settings:
Setting | Description | Default Value |
---|---|---|
firstMessage | The first message the Autopilot will say when the conversation starts | |
systemPrompt | A prompt that describes the behavior of the Autopilot and sets the context of the conversation | |
systemErrorMessage | The message the Autopilot will say when an error occurs | |
maxSessionDuration | Maximum length of the session (in milliseconds) before it is automatically terminated, regardless of activity | 1800000 (30 minutes) |
maxSpeechWaitTimeout | Specifies the maximum amount of time (in milliseconds) to wait for the user to begin speaking before sending the captured audio for processing | 0 |
initialDtmf | A DTMF to play prior to starting the conversation | |
allowUserBargeIn | Determines whether the user can interrupt the voice agent while it is speaking | true |
transferOptions | The options to transfer the call to a live agent | |
transferOptions.phoneNumber | The phone number to transfer the call to | |
transferOptions.message | The message to play before transferring the call | |
transferOptions.timeout | Time to wait for a transfer answer before the transfer attempt is considered failed | 30000 |
idleOptions | The options to handle idle time during the conversation | |
idleOptions.message | The message to play after the idle time is reached | |
idleOptions.timeout | Duration of user inactivity (in milliseconds) before the system triggers an idle event | 30000 |
idleOptions.maxTimeoutCount | The maximum number of times the idle message will be played before hanging up the call | 2 |
vad | The voice activity detection settings | |
vad.activationThreshold | See VAD section | 0.4 |
vad.deactivationThreshold | See VAD section | 0.25 |
vad.debounceFrames | See VAD section | 4 |
A few noteworthy settings include the maxSpeechWaitTimeout, initialDtmf, idleOptions, and vad.
Max Speech Wait Timeout
The maxSpeechWaitTimeout property allows you to specify the maximum time in milliseconds to wait for the caller before returning the speech-to-text result. If the caller does not speak within the specified time, the speech-to-text engine will return the result.
A value that is too low may result in the speech-to-text engine returning the result before the caller finishes speaking. A value that is too high may result in the speech-to-text engine waiting too long for the caller to speak.
Initial DTMF
Sometimes, users will use call forwarding to reach the number in Fonoster. Some telephony service providers require a Dual-tone multi-frequency (DTMF) to be played before connecting the call. The initialDtmf property allows you to specify a DTMF to play when the session starts.
Voice Activity Detection (VAD)
The vad object allows you to configure the voice activity detection settings. Voice activity detection is used to detect when the caller is speaking and when they are not speaking.
The vad object has the activationThreshold, deactivationThreshold, debounceFrames properties. The actionThreshold property is the activation threshold for voice activity detection. The deactivationThreshold property is the deactivation threshold for voice activity detection. The debounceFrames property is the number of frames to debounce the voice activity detection.
A lower activation threshold will make the detection more sensitive to the caller’s speech. A higher activation threshold will make detecting voice activity less sensitive to the caller’s speech.
A lower deactivation threshold will result in more aggressive voice activity detection deactivation. A higher deactivation threshold will result in less aggressive voice activity detection deactivation.
The debounceFrames parameter introduces a delay mechanism that ensures that transitions between “speech” and “non-speech” states are stable and not too sensitive to small fluctuations in the input audio signal. Here’s how it works:
By requiring multiple consecutive frames (debounceFrames) to confirm speech or non-speech, the system filters out short bursts of noise or brief gaps in speech that might otherwise cause erratic state changes.
Langue model configuration
The languageModel object allows you to define the language model the Autopilot uses. The language model is responsible for generating responses to the user’s input.
The following is a list of the supported settings:
Setting | Description |
---|---|
provider | Model provider |
model | The model to use. The available models depend on the provider |
maxTokens | The maximum number of tokens the language model can generate in a single response |
temperature | The randomness of the language model. A higher temperature will result in more random responses |
knowledgeBase | A list of knowledge bases to use for the language model |
tools | A list of tools to use for the language model |
LLM providers and models
The Autopilot supports multiple language model providers. The following is a list of the supported providers:
Provider | Description | Supported models |
---|---|---|
OpenAI | OpenAI provides various GPT models for conversational AI | gpt-4o , gpt-4o-mini , gpt-3.5-turbo , gpt-4-turbo |
Groq | Groq offers high-performance AI models optimized for speed | llama-3.3-70b-versatile |
Google offers various LLM models for conversational AI | gemini-2.0-flash , gemini-2.0-flash-lite , gemini-2.0-pro-exp-02-05 | |
Anthropic | Anthropic offers various LLM models for conversational AI | claude-3-5-haiku-latest , claude-3-7-sonnet-latest |
We have noticed that Groq models, particularly llama-3.3-70b-versatile
, often require greater prompting specificity for effective tool usage. Also, Google’s gemini-2.0-flash-lite
does not support tool calling. We will share best practices to ensure more consistent behavior as we gain more insights
Knowledge bases
Coming soon…
Tools
Fonoster’s Autopilot allows you to use tools to enhance the conversational experience. Tools are used to perform specific actions during the conversation.
Built-in tools
The following is a list of built-in tools available for an agent:
Tool | Description |
---|---|
hangup | A tool to end the conversation |
transfer | A tool to transfer the call to a live agent |
hold | A tool to put the call on hold (Coming soon) |
Custom tools
You can add custom tools under intelligence.config.languageModel.tools
, which is an array where each tool is defined as an object. These tools enable your assistant to interact with external services, APIs, or execute specific actions.
Each tool must follow the tool schema, for consistency and compatibility.
The following example demonstrates how to add a custom tool that fetches available appointment times for a specific date:
The response from your endpoint must be a JSON object containing a result
property. For example: { "result": "We have open slots for Thursday and Friday." }
Key Components of a Tool Definition:
name
: A unique identifier for the tooldescription
: A brief explanation of what the tool doesrequestStartMessage
: The message sent when the tool is triggeredparameters
: Defines the expected input structure in accordance with the JSON Schema standard, which is also required for OpenAI compatible tool callingtype
: Defines the structure of the input (typicallyobject
)properties
: Specifies the fields expected in the inputrequired
: Lists the fields that must be provided
operation
:method
: The HTTP method (get
andpost
are supported)url
: The endpoint to send the request toheaders
: Any necessary headers, such as authentication keys
For additional details, refer to the tool schema documentation.
Use operation.method
“post” for POST requests. If you want the Autopilot to wait for POST requests to complete, set operation.waitForResponse
to true
. For “get” requests, the Autopilot will wait for the response by default.
Autopilot’s Test Cases
Test cases are an experimental feature and the behavior might change in the future.
The Autopilot supports automated testing through test cases defined in the configuration. Test cases allow you to verify the behavior of your Autopilot before deploying it to production.
The following is an example of creating a test case for Fonoster Autopilot:
Evaluation Language Model
The evalsLanguageModel
section defines the model used to evaluate test cases:
Setting | Description |
---|---|
provider | Evaluation provider |
model | The OpenAI model to use for evaluations |
apiKey | The API key for the evaluation model |
The evaluation model is separate from the model used in actual conversations. This separation allows for consistent evaluation results regardless of the production model being used.
Test Scenarios
Each test scenario represents a complete conversation flow. The scenario includes:
ref
: A unique identifier for the test casedescription
: A brief description of what the test case verifiestelephonyContext
: Emulates the context of a real phone call with the following properties:callDirection
: The direction of the call (e.g., “FROM_PSTN”)ingressNumber
: The number being calledcallerNumber
: The number making the call
This information is available to the AI model to help it understand the context of the call and might be use in your prompts.
Conversation Turns
Each scenario contains a series of conversation turns. A turn represents a single interaction between the user and the Autopilot, consisting of:
Component | Description |
---|---|
userInput | The text representing what the user says |
expected | The expected response from the Autopilot |
expected.text | The expected text response from the Autopilot |
expected.text.type | |
expected.text.response | The actual text response from the Autopilot |
expected.tools | The expected tools to be used in the response |
expected.tools.tool | The name of the tool to be used |
expected.tools.parameters |
Use type: "similar"
for text responses to allow for natural language variations in the Autopilot’s responses while maintaining the same semantic meaning.
The expected
object can validate:
-
Text responses via the
text
property: -
Tool usage via the
tools
property:
Was this page helpful?