Edelosoft — OpenAI’s new voice AI can listen, think, and talk back in 70+ languages

OpenAI has launched three new audio models in its Realtime API, and they are a big deal for anyone building voice-powered apps. The three models are GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.

Together, they move voice AI beyond simple back-and-forth responses toward something that can understand you, take action, and keep up with a real conversation.

So what can these models actually do?

GPT-Realtime-2 is the headline act. It brings GPT-5-class reasoning to live voice interactions, meaning it can handle harder requests without dropping the thread of the conversation.

It can call multiple tools simultaneously and even narrate what it’s doing with phrases like “checking your calendar” or “let me look into that.” It also has a larger context window of 128K tokens, which means longer, more coherent sessions. Developers can even adjust the reasoning effort based on the complexity of the request.

GPT-Realtime-Translate is probably my favorite. It’s the closest we have come to having Star Trek’s Universal Translator in real life. It supports live speech translation across 70+ input languages and 13 output languages.

The best part of the demo was that even when a new person joined and spoke a different language, GPT-Realtime-Translate had no issues in translating both speakers into English in real time.

Finally, there’s the GPT-Realtime-Whisper. Most speech-to-text models wait for the speaker to finish before providing the full translation. This one is a streaming transcription model that converts speech to text as the speaker talks. It is useful for live captions, meeting notes, and any voice-powered workflow where waiting for a transcription is not an option.

Can anyone use these new voice AI models?

Currently, OpenAI has released these models for developers. But the apps they build will affect everyone. For example, a developer can build a real-time translator app, allowing users to converse with people in different languages.

Many companies are already testing these new models. Zillow is building a voice assistant that can search homes and schedule tours from a single spoken request. Priceline can check your flights and hotels, cancel them, and book new ones. Vimeo is using it for real-time transcription, and so on.

Pricing starts at $0.017 per minute for Whisper, $0.034 per minute for Translate, and $32 per 1M audio input tokens for GPT-Realtime-2.

So what can these models actually do?

Can anyone use these new voice AI models?

Need help?