ChangeGamer

← All resources

Voice and Realtime Agents

Guide · updated 2026-06-16 · Markdown variant

Architectures, vendor APIs, and open frameworks for real-time speech-to-speech AI agents — cascaded pipeline vs. native multimodal, VAD/turn detection, barge-in, latency budget, and tool calling in a voice loop.


Real-time voice agents are one of the fastest-growing deployment patterns in 2026. Two architectures dominate. Understanding the tradeoffs between them is the prerequisite for every vendor and framework choice downstream.

Architecture 1: Cascaded pipeline (STT → LLM → TTS)

The classic pipeline chains three separate models:

  1. STT — streaming speech-to-text converts the user's audio to a text transcript.
  2. LLM — the transcript is fed to a language model, which produces a text reply (and may call tools).
  3. TTS — the reply is synthesized back to audio.

A VAD (Voice Activity Detection) module sits upstream to detect when the user is speaking and trigger end-of-turn detection — the decision that the user has finished and the agent should respond. Between the models, barge-in / interruption handling flushes the TTS buffer and restarts the STT stage when the user speaks over the agent.

Tradeoffs:

Dimension Cascaded pipeline
Latency Higher (three sequential models); target sub-second requires fast STT + cached LLM prefix + streaming TTS
Interruptibility Requires explicit barge-in logic at each stage boundary
Emotion / prosody TTS adds prosody; quality varies by provider
Cost Pay for three separate model calls per turn
Control High: swap any component independently; use any LLM
Open-weight path Yes — each stage can run on open-weight models

Architecture 2: Native speech-to-speech (realtime multimodal models)

A single model ingests raw audio and outputs raw audio directly, without a text intermediate at the core inference step. Turn detection, interruption handling, and prosody are handled inside the model.

Tradeoffs:

Dimension Native speech-to-speech
Latency Lower end-to-end (one model, streaming output)
Interruptibility Built into the model; lower barge-in latency
Emotion / prosody Richer; the model controls vocal tone end-to-end
Cost Single model call, but audio tokens are expensive
Control Lower: you cannot swap the underlying LLM independently
Open-weight path Limited — open-weight native speech-to-speech models are still emerging as of mid-2026

Vendor realtime APIs

All entries below are web-verified as of 2026-06-16.

OpenAI Realtime API

A native speech-to-speech API. The production model is gpt-realtime (GA August 28, 2025; snapshot gpt-realtime-2025-08-28). A smaller variant gpt-realtime-mini is also available. The earlier gpt-4o-realtime-preview series is deprecated and being removed.

Transports: WebRTC (recommended for browsers and mobile — lower jitter, handles NAT traversal) and WebSocket (recommended for server-to-server). A SIP integration path is also available for telephony.

Supports streaming audio input and output, tool/function calling mid-conversation, VAD and server-side turn detection, and barge-in. Approximate glass-to-glass latency: 300–600 ms on subsequent turns.

Docs: platform.openai.com/docs/guides/realtime-webrtc and developers.openai.com/api/docs/guides/realtime-websocket

Google Gemini Live API

A native speech-to-speech API with bidirectional streaming over WebSocket. The model processes audio input and returns audio output natively, without a text intermediate. Supported models include Gemini 2.5 Flash (native audio). Also available via Vertex AI.

Supports multimodal input (audio + video/screen), turn detection, barge-in, and function calling. Available via the Google AI for Developers and Vertex AI APIs.

Docs: ai.google.dev/gemini-api/docs/live-api

Amazon Nova Sonic

A native speech-to-speech model on Amazon Bedrock, announced April 2025. The current generation is Amazon Nova 2 Sonic (December 2025). Accessed via Bedrock's bidirectional streaming API (WebSocket). Also supports WebRTC via an AWS blog reference implementation.

Supports tool use, voice selection, interruption handling, and background-noise robustness. Integrates with Amazon Connect and telephony providers (Vonage, Twilio) and open frameworks including LiveKit and Pipecat.

Docs: docs.aws.amazon.com/nova/latest/userguide/speech-bidirection.html

xAI Grok Voice Agent API

A realtime speech-to-speech API launched December 17, 2025. Uses bidirectional WebSocket streaming. Compatible with the OpenAI Realtime API specification, so clients built for OpenAI Realtime can point at xAI with minimal changes.

Features: custom VAD, Smart Turn end-of-turn detection, sub-1-second time-to-first-audio, 100+ language support with automatic detection. Also available via a native LiveKit plugin.

Docs: docs.x.ai/docs/guides/voice

Open frameworks and orchestrators

Pipecat (pipecat-ai)

Open-source Python framework (BSD-2-Clause) for building real-time voice and multimodal conversational agents, developed by Daily. Organizes processing as pipeline frames flowing through transport, STT, LLM, and TTS stages. Supports 20+ STT providers and 30+ TTS providers, plus direct integrations with native speech-to-speech services (OpenAI Realtime, Amazon Nova Sonic, Gemini Live).

Transports: WebRTC (Daily, LiveKit, SmallWebRTC), WebSocket, telephony. Handles VAD, turn detection, barge-in, and multi-agent coordination.

GitHub: github.com/pipecat-ai/pipecat

LiveKit Agents

Open-source Python and TypeScript framework (Apache 2.0) for building realtime voice, video, and physical AI agents on top of the LiveKit WebRTC infrastructure. SDK v1.0 GA April 2025.

Provides two voice agent types: VoicePipelineAgent (cascaded STT → LLM → TTS with configurable components) and MultimodalAgent (wraps native speech-to-speech APIs such as OpenAI Realtime and Gemini Live). Both include built-in turn detection, barge-in, and function calling. Brings-your-own STT, LLM, and TTS with no lock-in.

Docs: docs.livekit.io/agents

STT and TTS component vendors

STT:

TTS:

Key concepts

Practical guidance

Verified sources

#voice #realtime #speech #stt #tts #vad #agents #webrtc #websocket

Category: Guide