Real-Time Voice Interactions with the WebSocket Audio Adapter

Authors: Mark Sze, Tvrtko Sternak, Davor Runje, Davorin Rusevljan TL;DR: Demo implementation: Implement a website using websockets and communicate using voice with the RealtimeAgent Introducing WebSocketAudioAdapter: Stream audio directly from your browser using WebSockets. Simplified Development: Connect to real-time agents quickly and effortlessly with minimal setup. ​Realtime over WebSockets In our previous blog post, we introduced a way to interact with the RealtimeAgent using TwilioAudioAdapter. While effective, this approach required a setup-intensive process involving Twilio integration, account configuration, number forwarding, and other complexities. Today, we’re excited to introduce theWebSocketAudioAdapter, a streamlined approach to real-time audio streaming directly via a web browser. This post explores the features, benefits, and implementation of the WebSocketAudioAdapter, showing how it transforms the way we connect with real-time agents. ​Why We Built the WebSocketAudioAdapter ​Challenges with Existing Solutions Previously introduced TwilioAudioAdapter provides a robust way to cennect to your RealtimeAgent, it comes with challenges: Browser Limitations: For teams building web-first applications, integrating with a telephony platform can feel redundant. Complex Setup: Configuring Twilio accounts, verifying numbers, and setting up forwarding can be time-consuming. Platform Dependency: This solution requires developers to rely on external API, which adds latency and costs. ​Our Solution The WebSocketAudioAdapter eliminates these challenges by allowing direct audio streaming over WebSockets. It integrates seamlessly with modern web technologies, enabling real-time voice interactions without external telephony platforms. ​How It Works At its core, the WebSocketAudioAdapter leverages WebSockets to handle real-time audio streaming. This means your browser becomes the communication bridge, sending audio packets to a server where a RealtimeAgent agent processes them. Here’s a quick overview of its components and how they fit together: WebSocket Connection: * The adapter establishes a [**WebSockets**](https://fastapi.tiangolo.com/advanced/websockets/) connection between the client (browser) and the server. * Audio packets are streamed in real time through this connection. Integration with FastAPI: * Using Python’s [**FastAPI**](https://fastapi.tiangolo.com/) framework, developers can easily set up endpoints for handling [**WebSockets**](https://fastapi.tiangolo.com/advanced/websockets/) traffic. Powered by Realtime Agents: * The audio adapter integrates with an AI-powered [`RealtimeAgent`](https://docs.ag2.ai/docs/reference/agentchat/realtime_agent/realtime_agent), allowing the agent to process audio inputs and respond intelligently. ​Key Features ​1. Simplified Setup Unlike TwilioAudioAdapter, the WebSocketAudioAdapter requires no phone numbers, no telephony configuration, and no external accounts. It’s a plug-and-play solution. ​2. Real-Time Performance By streaming audio over WebSockets, the adapter ensures low latency, making conversations feel natural and seamless. ​3. Browser-Based Everything happens within the user’s browser, meaning no additional software is required. This makes it ideal for web applications. ​4. Flexible Integration Whether you’re building a chatbot, a voice assistant, or an interactive application, the adapter can integrate easily with existing frameworks and AI systems. ​Example: Build a Voice-Enabled Weather Bot Let’s walk through a practical example where we use the WebSocketAudioAdapter to create a voice-enabled weather bot. You can find the full example here. To run the demo example, follow these steps: ​1. Clone the Repository git clone https://github.com/ag2ai/realtime-agent-over-websockets.git cd realtime-agent-over-websockets ​2. Set Up Environment Variables Create a OAI_CONFIG_LIST file based on the provided OAI_CONFIG_LIST_sample: cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST In the OAI_CONFIG_LIST file, update the api_key to your OpenAI API key. ​(Optional) Create and use a virtual environment To reduce cluttering your global Python environment on your machine, you can create a virtual environment. On your command line, enter: python3 -m venv env source env/bin/activate ​3. Install Dependencies Install the required Python packages using pip: pip install -r requirements.txt ​4. Start the Server Run the application with Uvicorn: uvicorn realtime_over_websockets.main:app --port 5050 After you start the server you should see your application running in the logs: INFO: Started server process [64425] INFO: Waiting for application startup. INFO: Application startup complete. I

Jan 14, 2025 - 18:36
Real-Time Voice Interactions with the WebSocket Audio Adapter

Authors: Mark Sze, Tvrtko Sternak, Davor Runje, Davorin Rusevljan

TL;DR:

  • Demo implementation: Implement a website using websockets and communicate using voice with the RealtimeAgent

  • Introducing WebSocketAudioAdapter: Stream audio directly from your browser using WebSockets.

  • Simplified Development: Connect to real-time agents quickly and effortlessly with minimal setup.

​Realtime over WebSockets

In our previous blog post, we introduced a way to interact with the RealtimeAgent using TwilioAudioAdapter. While effective, this approach required a setup-intensive process involving Twilio integration, account configuration, number forwarding, and other complexities. Today, we’re excited to introduce theWebSocketAudioAdapter, a streamlined approach to real-time audio streaming directly via a web browser.

This post explores the features, benefits, and implementation of the WebSocketAudioAdapter, showing how it transforms the way we connect with real-time agents.

​Why We Built the WebSocketAudioAdapter

​Challenges with Existing Solutions

Previously introduced TwilioAudioAdapter provides a robust way to cennect to your RealtimeAgent, it comes with challenges:

  • Browser Limitations: For teams building web-first applications, integrating with a telephony platform can feel redundant.

  • Complex Setup: Configuring Twilio accounts, verifying numbers, and setting up forwarding can be time-consuming.

  • Platform Dependency: This solution requires developers to rely on external API, which adds latency and costs.

​Our Solution

The WebSocketAudioAdapter eliminates these challenges by allowing direct audio streaming over WebSockets. It integrates seamlessly with modern web technologies, enabling real-time voice interactions without external telephony platforms.

​How It Works

At its core, the WebSocketAudioAdapter leverages WebSockets to handle real-time audio streaming. This means your browser becomes the communication bridge, sending audio packets to a server where a RealtimeAgent agent processes them.

Here’s a quick overview of its components and how they fit together:

  1. WebSocket Connection:
* The adapter establishes a [**WebSockets**](https://fastapi.tiangolo.com/advanced/websockets/) connection between the client (browser) and the server.

* Audio packets are streamed in real time through this connection.
  1. Integration with FastAPI:
* Using Python’s [**FastAPI**](https://fastapi.tiangolo.com/) framework, developers can easily set up endpoints for handling [**WebSockets**](https://fastapi.tiangolo.com/advanced/websockets/) traffic.
  1. Powered by Realtime Agents:
* The audio adapter integrates with an AI-powered [`RealtimeAgent`](https://docs.ag2.ai/docs/reference/agentchat/realtime_agent/realtime_agent), allowing the agent to process audio inputs and respond intelligently.

​Key Features

​1. Simplified Setup

Unlike TwilioAudioAdapter, the WebSocketAudioAdapter requires no phone numbers, no telephony configuration, and no external accounts. It’s a plug-and-play solution.

​2. Real-Time Performance

By streaming audio over WebSockets, the adapter ensures low latency, making conversations feel natural and seamless.

​3. Browser-Based

Everything happens within the user’s browser, meaning no additional software is required. This makes it ideal for web applications.

​4. Flexible Integration

Whether you’re building a chatbot, a voice assistant, or an interactive application, the adapter can integrate easily with existing frameworks and AI systems.

​Example: Build a Voice-Enabled Weather Bot

Let’s walk through a practical example where we use the WebSocketAudioAdapter to create a voice-enabled weather bot. You can find the full example here.

To run the demo example, follow these steps:

​1. Clone the Repository

git clone https://github.com/ag2ai/realtime-agent-over-websockets.git
cd realtime-agent-over-websockets

​2. Set Up Environment Variables

Create a OAI_CONFIG_LIST file based on the provided OAI_CONFIG_LIST_sample:

cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST

In the OAI_CONFIG_LIST file, update the api_key to your OpenAI API key.

​(Optional) Create and use a virtual environment

To reduce cluttering your global Python environment on your machine, you can create a virtual environment. On your command line, enter:

python3 -m venv env
source env/bin/activate

​3. Install Dependencies

Install the required Python packages using pip:

pip install -r requirements.txt

​4. Start the Server

Run the application with Uvicorn:

uvicorn realtime_over_websockets.main:app --port 5050

After you start the server you should see your application running in the logs:

INFO:     Started server process [64425]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit)

​Ready to Chat?