Streaming Responses for Agents

Guide · updated 2026-06-21 · Markdown variant

Transport formats, provider event schemas, and practical concerns for consuming streamed LLM responses in production agents: SSE mechanics, OpenAI and Anthropic chunk formats, partial-JSON tool-call parsing, backpressure, cancellation, and gateway proxying.

Streaming lets an agent start processing model output before the full response is complete. For agent builders the three payoffs are: (1) time-to-first-token (TTFT) — perceived latency drops because the pipeline can act on early output; (2) early cancellation — if the first few tokens reveal a hallucination or wrong tool, the request can be aborted before paying for the full generation; (3) incremental parsing — tool-call arguments and structured outputs arrive as partial JSON that can be validated and acted on progressively. See /resources/agent-cost-latency-optimization for the latency framing.

Transport: Server-Sent Events (SSE)

All three major providers (OpenAI, Anthropic, Google Gemini) stream over HTTP using Server-Sent Events (SSE), the W3C/WHATWG standard for unidirectional server-to-client push over a plain HTTP connection. The wire format is Content-Type: text/event-stream; each event is one or more data: lines terminated by a blank line. Named events use an event: field before the data: field.

SSE works over HTTP/1.1 (chunked transfer encoding) and HTTP/2 (a single stream). WebSockets are used for bidirectional real-time protocols (e.g., OpenAI Realtime API for voice); pure generation streaming uses SSE, not WebSockets.

OpenAI Chat Completions streaming

Set "stream": true in the request body. The response is a sequence of data: SSE lines, each carrying a JSON object of type chat.completion.chunk. Each chunk has:

choices[].delta — incremental content fragment. On the first chunk, delta.role is "assistant". Subsequent chunks carry delta.content (text fragment) or delta.tool_calls (partial tool-call data).
choices[].finish_reason — null during the stream; "stop", "tool_calls", or another terminal value on the final content chunk.

The stream ends with data: [DONE] — a sentinel that is not valid JSON and signals the consumer to close the connection.

Tool calls in OpenAI streaming: delta.tool_calls is a list indexed by position. The first delta for a call includes id, type: "function", and function.name. Subsequent deltas carry only function.arguments as a partial JSON string fragment. The consumer must concatenate all function.arguments fragments across deltas, then parse the complete string as JSON after finish_reason: "tool_calls" is received. See /resources/reliable-tool-calling for schema-validation strategies on the parsed result.

Anthropic Messages streaming

Set "stream": true in the request body to /v1/messages. Events use both SSE event: name fields and a type field inside the JSON data: payload. The ordered event flow is:

message_start — contains a Message object with empty content.
For each content block: content_block_start → one or more content_block_delta events → content_block_stop. Each block has an index matching its position in the final message.
One or more message_delta events — top-level message metadata updates (e.g., cumulative usage token counts).
message_stop — stream is complete.

Additional ping events may appear anywhere. Error events can arrive mid-stream (e.g., overloaded_error); consumers must handle unknown event types gracefully.

Delta types inside content_block_delta:

text_delta — delta.text carries a text fragment.
input_json_delta — delta.partial_json carries a partial JSON string fragment for a tool_use block's input field. Accumulate fragments across deltas and parse the complete string at content_block_stop. Current models emit one complete key-value pair per emission, so gaps between events are normal.
thinking_delta — reasoning tokens when extended thinking is enabled.

Google Gemini streaming

Use streamGenerateContent instead of generateContent. With the REST API add ?alt=sse to receive SSE-formatted output. Each SSE data: event carries a complete GenerateContentResponse JSON object; incremental text arrives in candidates[0].content.parts[0].text. There is no separate [DONE] sentinel — the stream ends when the HTTP response body closes. Function-call arguments in streaming follow the same accumulate-then-parse pattern as other providers.

Streaming tool calls and structured outputs

Regardless of provider, function-call arguments arrive as partial JSON string fragments. Two handling strategies:

Accumulate-then-parse (simplest): collect all fragments into a buffer; parse the complete JSON string once the block or stream terminates. Safe for all schema shapes.
Streaming/partial JSON parser: libraries such as partial-json (npm) or Pydantic's partial JSON parsing mode can deserialize incomplete JSON incrementally, enabling early field access before the stream ends. Useful for long structured outputs where upstream steps can act on early fields.

For validation and schema-enforcement concerns once the full arguments are available, see /resources/reliable-tool-calling.

Practical concerns for agent builders

Backpressure and buffering — if your consumer processes chunks slower than the provider emits them, buffers grow. Size-bound your buffer and apply flow control; for gateway deployments see /resources/ai-gateways-llm-routing.

Cancellation / abort — send an HTTP request abort (e.g., AbortController in browser or Node.js, httpx cancel in Python) to stop generation early. The provider stops decoding; you pay only for tokens generated up to the abort. Ensure your agent loop handles a partial-response state cleanly.

Error handling mid-stream — an error event or a dropped TCP connection mid-stream leaves your state machine with a partially assembled response. Track which content blocks received content_block_stop (Anthropic) or whether finish_reason was set (OpenAI) before treating the response as complete.

Token accounting — usage fields in streaming responses (OpenAI stream_options: {"include_usage": true}; Anthropic message_delta.usage) are cumulative, not per-chunk. Read the final value, not a running sum of chunk values.

Proxying through a gateway — if you proxy streamed responses through an AI gateway or middleware, ensure the proxy flushes data: lines immediately rather than buffering the full response body. A buffering proxy negates all TTFT benefits. See /resources/ai-gateways-llm-routing for gateway selection criteria.

Verified sources

WHATWG HTML Living Standard — Server-sent events: https://html.spec.whatwg.org/multipage/server-sent-events.html
MDN Web Docs — Using server-sent events: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events
Anthropic Messages streaming (event types, input_json_delta, tool use): https://platform.claude.com/docs/en/build-with-claude/streaming
OpenAI ChatCompletionChunk type (delta fields, tool_calls.function.arguments): https://github.com/openai/openai-python/blob/main/src/openai/types/chat/chat_completion_chunk.py
Google Gemini streaming (streamGenerateContent, GenerateContentResponse, alt=sse): https://ai.google.dev/api/generate-content
Google Gemini cookbook — Streaming REST quickstart: https://github.com/google-gemini/cookbook/blob/main/quickstarts/rest/Streaming_REST.ipynb

#streaming #sse #server-sent-events #openai #anthropic #gemini #tool-calling #latency #agents

Category: Guide