Why latency is everything in voice bots (and what OpenAI’s latest move teaches us)

There’s one quiet detail that decides whether a voice bot feels like a helpful assistant or like an awkward support call from the 2000s: latency. Not model intelligence. Not pretty voices. Latency.

And this week, OpenAI explained how it redesigned its entire infrastructure stack to solve exactly that problem. It’s a dense technical piece aimed at network engineers — but what it says has direct implications for any business thinking about automating calls, customer support, or voice agents.

Let’s translate it.

The problem: natural conversation requires milliseconds

When you speak with another person, conversational turns happen with pauses of around 200 to 500 milliseconds. If the other person takes longer, you notice. If they take much longer, you assume they didn’t understand you — or that they’re thinking something strange. That instinct doesn’t disappear when the other side is an AI.

In a production environment, “low latency” has to be quantified. With the Realtime API, you typically see between ~300ms and 500ms of input audio to output audio. That’s the “magic number” for a natural conversation.

Below that threshold, the bot feels like a person. Above it, the trick becomes obvious. And here’s the interesting part: that number doesn’t depend only on the AI model. It depends, above all, on the network infrastructure carrying the audio.

What OpenAI changed (and why)

OpenAI serves voice at a scale that’s hard to imagine. At scale, that translates into three concrete requirements: global reach for more than 900 million weekly active users, fast connection setup so the user can start speaking as soon as the session begins, and low, stable media round-trip time with little jitter and packet loss so turns feel natural.

To achieve that, they redesigned their WebRTC stack (the protocol that moves real-time audio over the internet). The key change: they separated the part that manages the voice session from the part that runs the AI model.

This architecture lets them run WebRTC media in Kubernetes without exposing thousands of UDP ports. That matters because a smaller, fixed UDP surface is easier to secure and balance, and it lets the infrastructure scale without reserving large public port ranges. With better Kubernetes support and stronger security thanks to a smaller surface area, this design also preserves standard WebRTC behavior for clients.

The takeaway, in one sentence: real-time voice only works when the infrastructure makes latency invisible. For them, that meant changing how they deployed WebRTC without changing what clients expect from WebRTC.

Why you should care if you’re not an engineer

Because the underlying message is: a voice bot is not plug-and-play. The difference between one that converts and one that drives customers away lives in layers you don’t see in a demo.

Here’s what that means for a business considering call automation:

1. Model quality is necessary, but not enough

GPT-Realtime, Gemini Live, Anthropic’s voice models — they all sound amazing in a YouTube demo. Unlike traditional pipelines that chain multiple speech-to-text and text-to-speech models, the Realtime API processes and generates audio directly through a single model and API. This reduces latency, preserves speech nuance, and produces more natural, expressive responses.

But that demo was recorded with a good connection, a nearby microphone, and probably a server sitting close by too. In your business, the call comes in over a phone line, passes through a SIP provider, reaches a server that invokes the model, and then comes back. Every hop adds milliseconds.

2. Architecture matters more than it seems

A production voice bot has three layers, and each one can ruin the experience:

Layer	What it does	Where it breaks
Edge / Client	Captures the user’s audio and plays responses	Bad microphones, unstable networks, poorly chosen codecs
Control / Backend	Runs your business logic (CRM lookup, scheduling, etc.)	Slow database queries, slow external APIs
Media / AI Model	Converts audio into speech and back again	Network latency, costs, rate limits

Tool execution round-trip is the variable you control. If your backend takes 2 seconds to query a database, the AI will sit in silence for 2 seconds. One optimization is aggressive caching for read calls. Another is instructing the model to emit a filler phrase (like “let me check that”) before calling the function.

This is the kind of detail that separates a bot that works from one that’s embarrassing.

3. Reliability on real networks is not optional

Although OpenAI provides STUN/TURN capabilities for WebRTC connection, corporate firewalls can block them. For enterprise-grade reliability, it may be necessary to provision your own TURN credentials and pass them to the client during session initialization to ensure connectivity in restrictive network environments.

In plain English: if your customer calls from hotel Wi‑Fi, from an office with a corporate firewall, or from a mobile phone with mediocre coverage, the bot still has to work. That doesn’t come for free.

What happens when latency is handled well

A well-built voice bot changes the economics of customer support and sales. Some concrete examples:

24/7 inbound calls: the bot answers on the first ring, identifies the customer, resolves common questions, and books meetings when needed. Zero missed calls, zero forgotten voicemails.
Cold lead qualification: instead of a salesperson wasting time on leads with no buying intent, the bot handles the first conversation, qualifies them, and only escalates the worthwhile ones.
First-level support: repetitive questions (hours, order status, returns) get solved in seconds without involving a human.

The success metric is not “the bot talks.” It’s “the customer hung up satisfied and didn’t notice — or didn’t care — that it was AI.”

What a well-built voice bot looks like

A practical checklist for evaluating any AI voice solution before signing:

Time to first response: how long after the user finishes speaking does the bot start? If it’s over a second, it sounds artificial.
Interruption handling: can the user cut the bot off mid-sentence and steer the conversation? If not, it’s not a modern voice bot.
Noise and accents: does it work when someone calls from a car, a place with echo, or with a strong accent?
Real integration with your stack: can the bot query your CRM, your calendar, your order system — or does it just “take a message”?
Hand-off to a human: what happens when the bot doesn’t know? Does it loop endlessly or transfer cleanly with context?
Metrics and recordings: can you audit conversations, measure resolution rate, and see where it gets stuck?

If someone is selling you a voice bot and can’t answer these six questions concretely, don’t buy.

Where we fit in

At Studio SmartWork, we build exactly this kind of system — bots that answer the phone, qualify leads, and resolve common questions — trained on each business’s specific information. We don’t sell the OpenAI API wrapped in a logo: we design the full architecture (edge, control, media), integrate it with the client’s CRM and calendar, and run it in production.

The reason we’re highlighting a technical OpenAI article is simple: the difference between a bot that works in a demo and one that works in your business lives in the layers almost nobody teaches. Stable latency, interruption handling, human fallback, integrations that don’t break when you change CRMs. That’s what we build.

The conclusion

The battle over the next two years in AI voice won’t be about who has the “smartest” model. It will be about who has the infrastructure to serve that model at human speed, on any network, without the trick showing. OpenAI just proved that by rewriting its WebRTC from scratch.

For a business owner, the lesson is simple: when you evaluate a voice solution, don’t just ask “what model does it use?” Ask how it handles latency, how it behaves on real networks, and what happens when something fails. Those three answers predict project success better than any benchmark.