Technology

OpenAI’s new voice model thinks inside the audio loop, and the silence that used to give AI away disappears

The pause is the tell. Until now, voice AI worked by transcribing speech, sending the text to a language model, getting an answer back, and then synthesizing it into audio. Each step takes time. The user hears silence, knows something is being processed, and feels the seam. OpenAI's new GPT-Realtime-2 collapses that pipeline into a single model where reasoning happens inside the audio loop itself, and the seam goes away.
Susan Hill

OpenAI launched three new audio models in its Realtime API this week — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. The headliner is the first one. The company describes it as the first voice model with “GPT-5-class reasoning,” built so that the same model handles audio in and audio out, with thinking woven into the conversation rather than wedged between transcription and synthesis steps. The numerical claims are concrete. Big Bench Audio scores jumped from 81.4 percent to 96.6 percent against the previous reference model. Audio MultiChallenge climbed from 34.7 percent to 48.5 percent. The context window grew from 32,000 tokens to 128,000 — enough room to hold a full customer history during a call.

The structural shift is harder to see in benchmarks. For three years, anyone building a voice agent for production stitched the stack together themselves — Whisper or Deepgram for transcription, an LLM for the reasoning, ElevenLabs or Cartesia for the voice, and prompt engineering to mask the latency. Each handoff cost milliseconds and clarity. The user heard “let me check that for you” inserted by a script, then heard nothing while the model thought, then heard the answer. GPT-Realtime-2 ships those scaffolds as native behaviors. Preambles let the agent say “let me check that” while it calls tools so users do not sit through silence. Parallel tool calls let the model fire multiple back-end requests simultaneously and narrate which one is in flight. Recovery behaviour catches failures and surfaces them rather than freezing the conversation.

The control surface for developers is the most interesting part. Reasoning effort is configurable — minimal, low, medium, high, and xhigh — with low as the default to keep latency down for simple requests. An agent answering “what time do you close?” does not need GPT-5-class reasoning. An agent walking a customer through a refund dispute does. The same model can be told how hard to think on a per-turn basis, which is a meaningful change from the previous model where reasoning depth was fixed and developers chose between fast and smart at deployment time.

Skepticism belongs in the room. “GPT-5-class reasoning” is a marketing line, not a verifiable claim — without independent benchmarks across realistic dialogue, the comparison stays internal. Voice agents have a separate failure mode that benchmarks struggle to capture, which is the moment when an agent confidently says something wrong in a calm, natural-sounding voice. Better reasoning helps but does not eliminate that. The pricing matters too. GPT-Realtime-2 costs $32 per million audio input tokens and $64 per million output tokens. GPT-Realtime-Translate runs at $0.034 per minute, and GPT-Realtime-Whisper at $0.017 per minute. Cheap enough for high-volume customer service. Not cheap enough to use for chatty consumer products without thinking carefully about session length.

The deployment context tells the rest of the story. Zillow went live with voice home search the same day. Deutsche Telekom deployed live-translated voice support across 14 European markets. Both are exactly the use case OpenAI is pricing for — long, transactional, context-heavy conversations where the user benefits from the agent actually reasoning rather than retrieving. Priceline is building systems that allow travellers to manage hotel reservations and track flight delays entirely by voice. The pattern across these launches is that the customers OpenAI is naming first are the ones whose existing voice systems were the worst — call centers, support lines, transactional travel. The places where users currently scream “agent” into their phone.

The models are available in the Realtime API now. ChatGPT voice upgrades are still pending — “Stay tuned, we’re cooking,” OpenAI said. Sam Altman framed the launch around a behavioral shift, suggesting that users increasingly turn to voice with AI when they need to “dump” lots of context. If that pattern holds, the gap between voice AI and text AI starts to close — and the seam that gave AI away on the phone gets harder to hear.

Discussion

There are 0 comments.