Real-Time Voice Interactions with the WebSocket Audio Adapter

Authors: Mark Sze, Tvrtko Sternak, Davor Runje, Davorin Rusevljan TL;DR: Demo implementation: Implement a website using websockets and communicate using voice with the RealtimeAgent Introducing WebSocketAudioAdapter: Stream audio directly from your browser using WebSockets. Simplified Development: Connect to real-time agents quickly and effortlessly with minimal setup. Realtime over WebSockets In our previous blog post, we introduced a way to interact with the RealtimeAgent using TwilioAudioAdapter. While effective, this approach required a setup-intensive process involving Twilio integration, account configuration, number forwarding, and other complexities. Today, we’re excited to introduce theWebSocketAudioAdapter, a streamlined approach to real-time audio streaming directly via a web browser. This post explores the features, benefits, and implementation of the WebSocketAudioAdapter, showing how it transforms the way we connect with real-time agents. Why We Built the WebSocketAudioAdapter Challenges with Existing Solutions Previously introduced TwilioAudioAdapter provides a robust way to cennect to your RealtimeAgent, it comes with challenges: Browser Limitations: For teams building web-first applications, integrating with a telephony platform can feel redundant. Complex Setup: Configuring Twilio accounts, verifying numbers, and setting up forwarding can be time-consuming. Platform Dependency: This solution requires developers to rely on external API, which adds latency and costs. Our Solution The WebSocketAudioAdapter eliminates these challenges by allowing direct audio streaming over WebSockets. It integrates seamlessly with modern web technologies, enabling real-time voice interactions without external telephony platforms. How It Works At its core, the WebSocketAudioAdapter leverages WebSockets to handle real-time audio streaming. This means your browser becomes the communication bridge, sending audio packets to a server where a RealtimeAgent agent processes them. Here’s a quick overview of its components and how they fit together: WebSocket Connection: * The adapter establishes a [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) connection between the client (browser) and the server. * Audio packets are streamed in real time through this connection. Integration with FastAPI: * Using Python’s [FastAPI](https://fastapi.tiangolo.com/) framework, developers can easily set up endpoints for handling [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) traffic. Powered by Realtime Agents: * The audio adapter integrates with an AI-powered [`RealtimeAgent`](https://docs.ag2.ai/docs/reference/agentchat/realtime_agent/realtime_agent), allowing the agent to process audio inputs and respond intelligently. Key Features 1. Simplified Setup Unlike TwilioAudioAdapter, the WebSocketAudioAdapter requires no phone numbers, no telephony configuration, and no external accounts. It’s a plug-and-play solution. 2. Real-Time Performance By streaming audio over WebSockets, the adapter ensures low latency, making conversations feel natural and seamless. 3. Browser-Based Everything happens within the user’s browser, meaning no additional software is required. This makes it ideal for web applications. 4. Flexible Integration Whether you’re building a chatbot, a voice assistant, or an interactive application, the adapter can integrate easily with existing frameworks and AI systems. Example: Build a Voice-Enabled Weather Bot Let’s walk through a practical example where we use the WebSocketAudioAdapter to create a voice-enabled weather bot. You can find the full example here. To run the demo example, follow these steps: 1. Clone the Repository git clone https://github.com/ag2ai/realtime-agent-over-websockets.git cd realtime-agent-over-websockets 2. Set Up Environment Variables Create a OAI_CONFIG_LIST file based on the provided OAI_CONFIG_LIST_sample: cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST In the OAI_CONFIG_LIST file, update the api_key to your OpenAI API key. (Optional) Create and use a virtual environment To reduce cluttering your global Python environment on your machine, you can create a virtual environment. On your command line, enter: python3 -m venv env source env/bin/activate 3. Install Dependencies Install the required Python packages using pip: pip install -r requirements.txt 4. Start the Server Run the application with Uvicorn: uvicorn realtime_over_websockets.main:app --port 5050 After you start the server you should see your application running in the logs: INFO: Started server process [64425] INFO: Waiting for application startup. INFO: Application startup complete. I

Jan 14, 2025 - 18:36

Real-Time Voice Interactions with the WebSocket Audio Adapter

Authors: Mark Sze, Tvrtko Sternak, Davor Runje, Davorin Rusevljan

TL;DR:

Demo implementation: Implement a website using websockets and communicate using voice with the RealtimeAgent
Introducing WebSocketAudioAdapter: Stream audio directly from your browser using WebSockets.
Simplified Development: Connect to real-time agents quickly and effortlessly with minimal setup.

Realtime over WebSockets

In our previous blog post, we introduced a way to interact with the RealtimeAgent using TwilioAudioAdapter. While effective, this approach required a setup-intensive process involving Twilio integration, account configuration, number forwarding, and other complexities. Today, we’re excited to introduce theWebSocketAudioAdapter, a streamlined approach to real-time audio streaming directly via a web browser.

This post explores the features, benefits, and implementation of the WebSocketAudioAdapter, showing how it transforms the way we connect with real-time agents.

Why We Built the `WebSocketAudioAdapter`

Challenges with Existing Solutions

Previously introduced TwilioAudioAdapter provides a robust way to cennect to your RealtimeAgent, it comes with challenges:

Browser Limitations: For teams building web-first applications, integrating with a telephony platform can feel redundant.
Complex Setup: Configuring Twilio accounts, verifying numbers, and setting up forwarding can be time-consuming.
Platform Dependency: This solution requires developers to rely on external API, which adds latency and costs.

Our Solution

The WebSocketAudioAdapter eliminates these challenges by allowing direct audio streaming over WebSockets. It integrates seamlessly with modern web technologies, enabling real-time voice interactions without external telephony platforms.

How It Works

At its core, the WebSocketAudioAdapter leverages WebSockets to handle real-time audio streaming. This means your browser becomes the communication bridge, sending audio packets to a server where a RealtimeAgent agent processes them.

Here’s a quick overview of its components and how they fit together:

WebSocket Connection:

* The adapter establishes a [**WebSockets**](https://fastapi.tiangolo.com/advanced/websockets/) connection between the client (browser) and the server.

* Audio packets are streamed in real time through this connection.

Integration with FastAPI:

* Using Python’s [**FastAPI**](https://fastapi.tiangolo.com/) framework, developers can easily set up endpoints for handling [**WebSockets**](https://fastapi.tiangolo.com/advanced/websockets/) traffic.

Powered by Realtime Agents:

* The audio adapter integrates with an AI-powered [`RealtimeAgent`](https://docs.ag2.ai/docs/reference/agentchat/realtime_agent/realtime_agent), allowing the agent to process audio inputs and respond intelligently.

Key Features

1. Simplified Setup

Unlike TwilioAudioAdapter, the WebSocketAudioAdapter requires no phone numbers, no telephony configuration, and no external accounts. It’s a plug-and-play solution.

2. Real-Time Performance

By streaming audio over WebSockets, the adapter ensures low latency, making conversations feel natural and seamless.

3. Browser-Based

Everything happens within the user’s browser, meaning no additional software is required. This makes it ideal for web applications.

4. Flexible Integration

Whether you’re building a chatbot, a voice assistant, or an interactive application, the adapter can integrate easily with existing frameworks and AI systems.

Example: Build a Voice-Enabled Weather Bot

Let’s walk through a practical example where we use the WebSocketAudioAdapter to create a voice-enabled weather bot. You can find the full example here.

To run the demo example, follow these steps:

1. Clone the Repository

git clone https://github.com/ag2ai/realtime-agent-over-websockets.git
cd realtime-agent-over-websockets

2. Set Up Environment Variables

Create a OAI_CONFIG_LIST file based on the provided OAI_CONFIG_LIST_sample:

cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST

In the OAI_CONFIG_LIST file, update the api_key to your OpenAI API key.

(Optional) Create and use a virtual environment

To reduce cluttering your global Python environment on your machine, you can create a virtual environment. On your command line, enter:

python3 -m venv env
source env/bin/activate

3. Install Dependencies

Install the required Python packages using pip:

pip install -r requirements.txt

4. Start the Server

Run the application with Uvicorn:

uvicorn realtime_over_websockets.main:app --port 5050

After you start the server you should see your application running in the logs:

INFO:     Started server process [64425]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit)

Ready to Chat?

Read More

Tags:

Previous Article

Building a Website That Gets Indexed Fast on Google

Next Article

Implementing Distributed Tracing in Java with OpenTelemetry and Jaeger

Related Posts

Speaking at New York C++ meetup on January 13

Jan 14, 2025

Don't use JWT for Authorization!

Jan 14, 2025

Introduction to TransmittableThreadLocal (TTL)

Jan 14, 2025

Real-Time Voice Interactions with the WebSocket Audio Adapter

​Realtime over WebSockets

​Why We Built the WebSocketAudioAdapter

​Challenges with Existing Solutions

​Our Solution

​How It Works

​Key Features

​1. Simplified Setup

​2. Real-Time Performance

​3. Browser-Based

​4. Flexible Integration

​Example: Build a Voice-Enabled Weather Bot

​1. Clone the Repository

​2. Set Up Environment Variables

​(Optional) Create and use a virtual environment

​3. Install Dependencies

​4. Start the Server

​Ready to Chat?

Tags:

Related Posts

Popular Posts