Voice and Realtime Agents

Guide · updated 2026-06-16 · Markdown variant

Architectures, vendor APIs, and open frameworks for real-time speech-to-speech AI agents — cascaded pipeline vs. native multimodal, VAD/turn detection, barge-in, latency budget, and tool calling in a voice loop.

Real-time voice agents are one of the fastest-growing deployment patterns in 2026. Two architectures dominate. Understanding the tradeoffs between them is the prerequisite for every vendor and framework choice downstream.

Architecture 1: Cascaded pipeline (STT → LLM → TTS)

The classic pipeline chains three separate models:

STT — streaming speech-to-text converts the user's audio to a text transcript.
LLM — the transcript is fed to a language model, which produces a text reply (and may call tools).
TTS — the reply is synthesized back to audio.

A VAD (Voice Activity Detection) module sits upstream to detect when the user is speaking and trigger end-of-turn detection — the decision that the user has finished and the agent should respond. Between the models, barge-in / interruption handling flushes the TTS buffer and restarts the STT stage when the user speaks over the agent.

Tradeoffs:

Dimension	Cascaded pipeline
Latency	Higher (three sequential models); target sub-second requires fast STT + cached LLM prefix + streaming TTS
Interruptibility	Requires explicit barge-in logic at each stage boundary
Emotion / prosody	TTS adds prosody; quality varies by provider
Cost	Pay for three separate model calls per turn
Control	High: swap any component independently; use any LLM
Open-weight path	Yes — each stage can run on open-weight models

Architecture 2: Native speech-to-speech (realtime multimodal models)

A single model ingests raw audio and outputs raw audio directly, without a text intermediate at the core inference step. Turn detection, interruption handling, and prosody are handled inside the model.

Tradeoffs:

Dimension	Native speech-to-speech
Latency	Lower end-to-end (one model, streaming output)
Interruptibility	Built into the model; lower barge-in latency
Emotion / prosody	Richer; the model controls vocal tone end-to-end
Cost	Single model call, but audio tokens are expensive
Control	Lower: you cannot swap the underlying LLM independently
Open-weight path	Limited — open-weight native speech-to-speech models are still emerging as of mid-2026

Vendor realtime APIs

All entries below are web-verified as of 2026-06-16.

OpenAI Realtime API

A native speech-to-speech API. The production model is gpt-realtime (GA August 28, 2025; snapshot gpt-realtime-2025-08-28). A smaller variant gpt-realtime-mini is also available. The earlier gpt-4o-realtime-preview series is deprecated and being removed.

Transports: WebRTC (recommended for browsers and mobile — lower jitter, handles NAT traversal) and WebSocket (recommended for server-to-server). A SIP integration path is also available for telephony.

Supports streaming audio input and output, tool/function calling mid-conversation, VAD and server-side turn detection, and barge-in. Approximate glass-to-glass latency: 300–600 ms on subsequent turns.

Docs: platform.openai.com/docs/guides/realtime-webrtc and developers.openai.com/api/docs/guides/realtime-websocket

Google Gemini Live API

A native speech-to-speech API with bidirectional streaming over WebSocket. The model processes audio input and returns audio output natively, without a text intermediate. Supported models include Gemini 2.5 Flash (native audio). Also available via Vertex AI.

Supports multimodal input (audio + video/screen), turn detection, barge-in, and function calling. Available via the Google AI for Developers and Vertex AI APIs.

Docs: ai.google.dev/gemini-api/docs/live-api

Amazon Nova Sonic

A native speech-to-speech model on Amazon Bedrock, announced April 2025. The current generation is Amazon Nova 2 Sonic (December 2025). Accessed via Bedrock's bidirectional streaming API (WebSocket). Also supports WebRTC via an AWS blog reference implementation.

Supports tool use, voice selection, interruption handling, and background-noise robustness. Integrates with Amazon Connect and telephony providers (Vonage, Twilio) and open frameworks including LiveKit and Pipecat.

Docs: docs.aws.amazon.com/nova/latest/userguide/speech-bidirection.html

xAI Grok Voice Agent API

A realtime speech-to-speech API launched December 17, 2025. Uses bidirectional WebSocket streaming. Compatible with the OpenAI Realtime API specification, so clients built for OpenAI Realtime can point at xAI with minimal changes.

Features: custom VAD, Smart Turn end-of-turn detection, sub-1-second time-to-first-audio, 100+ language support with automatic detection. Also available via a native LiveKit plugin.

Docs: docs.x.ai/docs/guides/voice

Open frameworks and orchestrators

Pipecat (pipecat-ai)

Open-source Python framework (BSD-2-Clause) for building real-time voice and multimodal conversational agents, developed by Daily. Organizes processing as pipeline frames flowing through transport, STT, LLM, and TTS stages. Supports 20+ STT providers and 30+ TTS providers, plus direct integrations with native speech-to-speech services (OpenAI Realtime, Amazon Nova Sonic, Gemini Live).

Transports: WebRTC (Daily, LiveKit, SmallWebRTC), WebSocket, telephony. Handles VAD, turn detection, barge-in, and multi-agent coordination.

GitHub: github.com/pipecat-ai/pipecat

LiveKit Agents

Open-source Python and TypeScript framework (Apache 2.0) for building realtime voice, video, and physical AI agents on top of the LiveKit WebRTC infrastructure. SDK v1.0 GA April 2025.

Provides two voice agent types: VoicePipelineAgent (cascaded STT → LLM → TTS with configurable components) and MultimodalAgent (wraps native speech-to-speech APIs such as OpenAI Realtime and Gemini Live). Both include built-in turn detection, barge-in, and function calling. Brings-your-own STT, LLM, and TTS with no lock-in.

Docs: docs.livekit.io/agents

STT and TTS component vendors

STT:

Deepgram — streaming STT via WebSocket. Nova-3 is the flagship model (low WER). Flux is a conversational STT model with model-integrated end-of-turn detection and configurable turn-taking dynamics, designed specifically for voice agent pipelines. Docs: developers.deepgram.com
whisper.cpp (ggml-org) — C/C++ port of OpenAI's Whisper ASR models; runs locally with no external dependencies. Supports GPU acceleration and VAD. Use for on-device or self-hosted STT when latency from network round-trips to a cloud STT API is a constraint. GitHub: github.com/ggml-org/whisper.cpp

TTS:

ElevenLabs — streaming TTS via WebSocket. Eleven v3 model. Broadest language coverage (70+ languages), strong voice cloning. Streaming endpoint: /v1/text-to-speech/{voice_id}/stream-input. Docs: elevenlabs.io/docs/api-reference/text-to-speech/v-1-text-to-speech-voice-id-stream-input
Cartesia — streaming TTS via WebSocket (Sonic 3 model, sub-100 ms time-to-first-byte). Optimized for streaming latency at scale. Docs: docs.cartesia.ai

Key concepts

VAD (Voice Activity Detection) — a classifier that scores incoming audio for voice presence, typically using an energy threshold plus a voice classifier plus a minimum-duration guard. Determines when the user is speaking.
End-of-turn detection / endpointing — deciding that the user has finished their utterance and the agent should respond. Can be energy-based (silence duration), STT-integrated (model signals completion), or model-based (an ML classifier predicts semantic completion rather than just silence). Model-based detection has lower latency than waiting for silence timeouts.
Barge-in / interruption handling — detecting when the user speaks while the agent is responding, flushing the TTS output buffer, and restarting the listen/respond cycle. Failure to handle barge-in correctly is the most common CSAT failure in voice agents — distinguish genuine interruptions from backchannels ("uh-huh", "right") to avoid cutting off mid-sentence unnecessarily.
Latency budget — target sub-600 ms glass-to-glass for natural-feeling turn-taking. In a cascaded pipeline, the budget is roughly: STT (50–150 ms) + LLM time-to-first-token (100–300 ms) + TTS time-to-first-audio (50–150 ms). Each component must be streaming — do not wait for full STT transcript before starting LLM inference. See /resources/agent-cost-latency-optimization.
Tool calling in a voice loop — voice agents can call tools mid-conversation, but tool latency adds directly to voice latency. Keep tool calls under 200 ms; use speculative execution for predictable tool calls; return partial results to the model via streaming where possible. See /resources/reliable-tool-calling.

Practical guidance

Measure end-to-end latency per component (STT → LLM → TTS), not just total. Instrument each stage separately; bottlenecks are rarely where intuition points.
Handle interruptions before tuning latency — a fast agent that cannot be interrupted is worse than a slightly slower one that can.
Keep tool calls fast — tool latency is voice latency. Parallelize independent tool calls; cache results of stable lookups.
Plan for transcription errors — cascade pipelines inherit STT errors. Design prompts and tool schemas to tolerate common transcription noise (homophones, dropped words).
Budget for always-on audio cost — native speech-to-speech APIs price audio tokens at rates 3–10× higher than text tokens. Profile real traffic before committing to a pricing model.
For latency optimization across the full agent stack, see /resources/agent-cost-latency-optimization.
For tool-calling reliability in a voice loop, see /resources/reliable-tool-calling.
For framework choices when adding voice to a multi-agent system, see /resources/agent-frameworks-compared.

Verified sources

OpenAI Realtime API (WebRTC): https://platform.openai.com/docs/guides/realtime-webrtc
OpenAI Realtime API (WebSocket): https://developers.openai.com/api/docs/guides/realtime-websocket
OpenAI gpt-realtime GA announcement: https://openai.com/index/introducing-gpt-realtime/
Google Gemini Live API overview: https://ai.google.dev/gemini-api/docs/live-api
Google Gemini Live API (WebSocket get started): https://ai.google.dev/gemini-api/docs/live-api/get-started-websocket
Amazon Nova Sonic announcement (April 2025): https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-nova-sonic-speech-to-speech-conversations-bedrock/
Amazon Nova 2 Sonic announcement (December 2025): https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-nova-2-sonic-real-time-conversational-ai
Amazon Nova Sonic bidirectional streaming docs: https://docs.aws.amazon.com/nova/latest/userguide/speech-bidirection.html
xAI Grok Voice Agent API announcement: https://x.ai/news/grok-voice-agent-api
xAI Voice Agent API docs: https://docs.x.ai/docs/guides/voice
Pipecat (pipecat-ai, BSD-2-Clause): https://github.com/pipecat-ai/pipecat
LiveKit Agents docs: https://docs.livekit.io/agents/voice-agent/
Deepgram Nova-3 and Flux STT docs: https://developers.deepgram.com/docs/models-languages-overview
whisper.cpp (ggml-org, C/C++ Whisper port): https://github.com/ggml-org/whisper.cpp
ElevenLabs TTS WebSocket API: https://elevenlabs.io/docs/api-reference/text-to-speech/v-1-text-to-speech-voice-id-stream-input
Cartesia TTS WebSocket API: https://docs.cartesia.ai/api-reference/tts/websocket

#voice #realtime #speech #stt #tts #vad #agents #webrtc #websocket

Category: Guide