Exploring How the New OpenAI Realtime API Simplifies Voice Agent Flows | by Sami Maameri

Organising a Voice Agent utilizing Twilio and the OpenAI Realtime API

On the current OpenAI Dev Day on October 1st, 2024, OpenAI’s greatest launch was the reveal of their Realtime API:

“At this time, we’re introducing a public beta of the Realtime API, enabling all paid builders to construct low-latency, multimodal experiences of their apps.

Just like ChatGPT’s Superior Voice Mode, the Realtime API helps pure speech-to-speech conversations utilizing the six preset voices already supported within the API.”

(supply: OpenAI web site)

As per their message, a few of its key advantages embody low latency, and its speech to speech capabilities. Let’s see how that performs out in follow when it comes to constructing out voice AI brokers.

It additionally has an interruption dealing with characteristic, in order that the realtime stream will cease sending audio if it detects you are attempting to talk over it, a helpful characteristic for certain when constructing voice brokers.

On this article we are going to:

Examine what a telephone voice agent movement might need regarded like earlier than the Realtime API, and what it appears like now,
Evaluate a GitHub challenge from Twilio that units up a voice agent utilizing the brand new Realtime API, so we will see what the implementation appears like in follow, and get an concept how the websockets and connections are setup for such an utility,
Rapidly evaluate the React demo challenge from OpenAI that makes use of the Realtime API,
Examine the pricing of those varied choices.

Earlier than the OpenAI Realtime API

To get a telephone voice agent service working, there are some key companies we require

Speech to Textual content ( e.g Deepgram),
LLM/Agent ( e.g OpenAI),
Textual content to Speech (e.g ElevenLabs).

These companies are illustrated within the diagram beneath

(supply https://github.com/twilio-labs/call-gpt, MIT license)

That in fact means integration with quite a lot of companies, and separate API requests for every components.

The brand new OpenAI Realtime API permits us to bundle all of these collectively right into a single request, therefore the time period, speech to speech.

After the OpenAI Realtime API

That is what the movement diagram would appear to be for the same new movement utilizing the brand new OpenAI Realtime API.

Clearly it is a a lot less complicated movement. What is going on is we’re simply passing the speech/audio from the telephone name on to the OpenAI Realtime API. No want for a speech to textual content middleman service.

And on the response aspect, the Realtime API is once more offering an audio stream because the response, which we will ship proper again to Twilio (i.e to the telephone name response). So once more, no want for an additional textual content to speech service, as it’s all taken care of by the OpenAI Realtime API.

Let’s take a look at some code samples for this. Twilio has offered an awesome github repository instance for establishing this Twilio and OpenAI Realtime API movement. You’ll find it right here:

Listed below are some excerpts from key components of the code associated to establishing

the websockets connection from Twilio to our utility, in order that we will obtain audio from the caller, and ship audio again,
and the websockets connection to the OpenAI Realtime API from our utility.

I’ve added some feedback within the supply code beneath to try to clarify what’s going on, expecially relating to the websocket connection between Twilio and our applicaion, and the websocket connection from our utility to OpenAI. The triple dots (…) refere to sections of the supply code which have been eliminated for brevity, since they aren’t essential to understanding the core options of how the movement works.

// On receiving a telephone name, Twilio forwards the incoming name request to
// a webhook we specify, which is that this endpoint right here. This enables us to 
// create programatic voice purposes, for instance utilizing an AI agent
// to deal with the telephone name
// 
// So, right here we're offering an preliminary response to the decision, and creating
// a websocket (known as a MediaStream in Twilio, extra on that beneath) to obtain
// any future audio that comes into the decision
fastify.all('/incoming', async (request, reply) => {
const twimlResponse = `<?xml model="1.0" encoding="UTF-8"?>
<Response>
<Say>Please wait whereas we join your name to the A. I. voice assistant, powered by Twilio and the Open-A.I. Realtime API</Say>
<Pause size="1"/>
<Say>O.Okay. you can begin speaking!</Say>
<Join>
<Stream url="wss://${request.headers.host}/media-stream" />
</Join>
</Response>`;reply.kind('textual content/xml').ship(twimlResponse);
});
fastify.register(async (fastify) => {
// Right here we're connecting our utility to the websocket media stream we
// setup above. Which means all audio that comes although the telephone will come
// to this websocket connection we now have setup right here
fastify.get('/media-stream', { websocket: true }, (connection, req) => {
console.log('Consumer linked');
// Now, we're creating websocket connection to the OpenAI Realtime API
// That is the second leg of the movement diagram above
const openAiWs = new WebSocket('wss://api.openai.com/v1/realtime?mannequin=gpt-4o-realtime-preview-2024-10-01', {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1"
}
});
...
// Right here we're establishing the listener on the OpenAI Realtime API 
// websockets connection. We're specifying how we wish it to
// deal with any incoming audio streams which have come again from the
// Realtime API.
openAiWs.on('message', (information) => {
strive {
const response = JSON.parse(information);
...
// This response kind signifies an LLM responce from the Realtime API
// So we need to ahead this response again to the Twilio Mediat Stream
// websockets connection, which the caller will hear as a response on
// on the telephone
if (response.kind === 'response.audio.delta' && response.delta) {
const audioDelta = {
occasion: 'media',
streamSid: streamSid,
media: { payload: Buffer.from(response.delta, 'base64').toString('base64') }
};
// That is the precise half we're sending it again to the Twilio
// MediaStream websockets connection. Discover how we're sending the
// response again instantly. No want for textual content to speech conversion from
// the OpenAI response. The OpenAI Realtime API already gives the
// response as an audio stream (i.e speech to speech)
connection.ship(JSON.stringify(audioDelta));
}
} catch (error) {
console.error('Error processing OpenAI message:', error, 'Uncooked message:', information);
}
});
// This components specifies how we deal with incoming messages to the Twilio
// MediaStream websockets connection i.e how we deal with audio that comes
// into the telephone from the caller
connection.on('message', (message) => {
strive {
const information = JSON.parse(message);
change (information.occasion) {
// This case ('media') is that state for when there's audio information 
// obtainable on the Twilio MediaStream from the caller
case 'media':
// we first try OpenAI Realtime API websockets
// connection is open 
if (openAiWs.readyState === WebSocket.OPEN) {
const audioAppend = {
kind: 'input_audio_buffer.append',
audio: information.media.payload
};
// after which ahead the audio stream information to the
// Realtime API. Once more, discover how we're sending the
// audio stream instantly, not speech to textual content converstion
// as would have been required beforehand
openAiWs.ship(JSON.stringify(audioAppend));
}
break;
...
}
} catch (error) {
console.error('Error parsing message:', error, 'Message:', message);
}
});
...
fastify.hear({ port: PORT }, (err) => {
if (err) {
console.error(err);
course of.exit(1);
}
console.log(`Server is listening on port ${PORT}`);
});

So, that’s how the brand new OpenAI Realtime API movement performs out in follow.

Relating to the Twilio MediaStreams, you possibly can learn extra about them here. They’re a strategy to setup a websockets connection between a name to a Twilio telephone quantity and your utility. This enables streaming of audio from the decision to and from you utility, permitting you to construct programmable voice purposes over the telephone.

To get to the code above operating, you will want to setup a Twilio quantity and ngrok additionally. You possibly can try my different article over right here for assist setting these up.

Since entry to the OpenAI Realtime API has simply been rolled, not everybody could have entry simply but. I intially was not capable of entry it. Working the appliance labored, however as quickly because it tries to connect with the OpenAI Realtime API I received a 403 error. So in case you see the identical problem, it may very well be associated to not having entry but additionally.

OpenAI have additionally offered an awesome demo for testing out their Realtime API within the browser utilizing a React app. I examined this out myself, and was very impressed with the velocity of response from the voice agent coming from the Realtime API. The response is instantaneous, there isn’t any latency, and makes for an awesome person expertise. I used to be definitley impressed when testing it out.

Sharing a hyperlink to the supply code right here. It has intructions within the README.md for how one can get setup

It is a image of what the appliance appears like when you get it operating on native

(supply https://github.com/openai/openai-realtime-console, MIT license)

Let’s examine the price the of utilizing the OpenAI Realtime API versus a extra standard method utilizing Deepagram for speech to textual content (STT) and textual content to speech (TTS) and utilizing OpenAI GPT-4o for the LLM half.

Comparability utilizing the costs from their web sites exhibits that for a 1 minute dialog, with the caller talking half the time, and the AI agent talking the opposite half, the price per minute utilizing Deepgram and GPT-4o can be $0.0117/minute, whereas utilizing the OpenAI Realtime API can be $0.15/minute.

Which means utilizing the OpenAI Realtime API can be simply over 10x the value per minute.

It does sound like a good quantity dearer, although we should always stability that with among the advantages the OpenAI Realtime API may present, together with

decreased latencies, essential for having a superb voice expertise,
ease of setup as a result of fewer transferring components,
dialog interruption dealing with offered out of the field.

Additionally, please do remember that costs can change over time, so the costs you discover on the time of studying this text, is probably not the identical as these mirrored above.

Hope that was useful! What do you consider the brand new OpenAI Realtime API? Assume you’ll be utilizing it in any upcoming tasks?

Whereas we’re right here, are there another tutorials or articles round voice brokers andvoice AI you’ll be excited about? I’m deep diving into that area a bit simply now, so can be joyful to look into something individuals discover attention-grabbing.

Joyful hacking!

Source link