ChangeGamer

← All resources

Streaming Responses for Agents

Guide · updated 2026-06-21 · Markdown variant

Transport formats, provider event schemas, and practical concerns for consuming streamed LLM responses in production agents: SSE mechanics, OpenAI and Anthropic chunk formats, partial-JSON tool-call parsing, backpressure, cancellation, and gateway proxying.


Streaming lets an agent start processing model output before the full response is complete. For agent builders the three payoffs are: (1) time-to-first-token (TTFT) — perceived latency drops because the pipeline can act on early output; (2) early cancellation — if the first few tokens reveal a hallucination or wrong tool, the request can be aborted before paying for the full generation; (3) incremental parsing — tool-call arguments and structured outputs arrive as partial JSON that can be validated and acted on progressively. See /resources/agent-cost-latency-optimization for the latency framing.

Transport: Server-Sent Events (SSE)

All three major providers (OpenAI, Anthropic, Google Gemini) stream over HTTP using Server-Sent Events (SSE), the W3C/WHATWG standard for unidirectional server-to-client push over a plain HTTP connection. The wire format is Content-Type: text/event-stream; each event is one or more data: lines terminated by a blank line. Named events use an event: field before the data: field.

SSE works over HTTP/1.1 (chunked transfer encoding) and HTTP/2 (a single stream). WebSockets are used for bidirectional real-time protocols (e.g., OpenAI Realtime API for voice); pure generation streaming uses SSE, not WebSockets.

OpenAI Chat Completions streaming

Set "stream": true in the request body. The response is a sequence of data: SSE lines, each carrying a JSON object of type chat.completion.chunk. Each chunk has:

The stream ends with data: [DONE] — a sentinel that is not valid JSON and signals the consumer to close the connection.

Tool calls in OpenAI streaming: delta.tool_calls is a list indexed by position. The first delta for a call includes id, type: "function", and function.name. Subsequent deltas carry only function.arguments as a partial JSON string fragment. The consumer must concatenate all function.arguments fragments across deltas, then parse the complete string as JSON after finish_reason: "tool_calls" is received. See /resources/reliable-tool-calling for schema-validation strategies on the parsed result.

Anthropic Messages streaming

Set "stream": true in the request body to /v1/messages. Events use both SSE event: name fields and a type field inside the JSON data: payload. The ordered event flow is:

  1. message_start — contains a Message object with empty content.
  2. For each content block: content_block_start → one or more content_block_delta events → content_block_stop. Each block has an index matching its position in the final message.
  3. One or more message_delta events — top-level message metadata updates (e.g., cumulative usage token counts).
  4. message_stop — stream is complete.

Additional ping events may appear anywhere. Error events can arrive mid-stream (e.g., overloaded_error); consumers must handle unknown event types gracefully.

Delta types inside content_block_delta:

Google Gemini streaming

Use streamGenerateContent instead of generateContent. With the REST API add ?alt=sse to receive SSE-formatted output. Each SSE data: event carries a complete GenerateContentResponse JSON object; incremental text arrives in candidates[0].content.parts[0].text. There is no separate [DONE] sentinel — the stream ends when the HTTP response body closes. Function-call arguments in streaming follow the same accumulate-then-parse pattern as other providers.

Streaming tool calls and structured outputs

Regardless of provider, function-call arguments arrive as partial JSON string fragments. Two handling strategies:

For validation and schema-enforcement concerns once the full arguments are available, see /resources/reliable-tool-calling.

Practical concerns for agent builders

Backpressure and buffering — if your consumer processes chunks slower than the provider emits them, buffers grow. Size-bound your buffer and apply flow control; for gateway deployments see /resources/ai-gateways-llm-routing.

Cancellation / abort — send an HTTP request abort (e.g., AbortController in browser or Node.js, httpx cancel in Python) to stop generation early. The provider stops decoding; you pay only for tokens generated up to the abort. Ensure your agent loop handles a partial-response state cleanly.

Error handling mid-stream — an error event or a dropped TCP connection mid-stream leaves your state machine with a partially assembled response. Track which content blocks received content_block_stop (Anthropic) or whether finish_reason was set (OpenAI) before treating the response as complete.

Token accountingusage fields in streaming responses (OpenAI stream_options: {"include_usage": true}; Anthropic message_delta.usage) are cumulative, not per-chunk. Read the final value, not a running sum of chunk values.

Proxying through a gateway — if you proxy streamed responses through an AI gateway or middleware, ensure the proxy flushes data: lines immediately rather than buffering the full response body. A buffering proxy negates all TTFT benefits. See /resources/ai-gateways-llm-routing for gateway selection criteria.

Verified sources

#streaming #sse #server-sent-events #openai #anthropic #gemini #tool-calling #latency #agents

Category: Guide