# Streaming Responses for Agents

> Transport formats, provider event schemas, and practical concerns for consuming streamed LLM responses in production agents: SSE mechanics, OpenAI and Anthropic chunk formats, partial-JSON tool-call parsing, backpressure, cancellation, and gateway proxying.

Category: Guide · Updated: 2026-06-21 · Tags: streaming, sse, server-sent-events, openai, anthropic, gemini, tool-calling, latency, agents
Canonical: https://changegamer.ai/resources/streaming-for-agents

Streaming lets an agent start processing model output before the full response is complete. For agent builders the three payoffs are: (1) **time-to-first-token (TTFT)** — perceived latency drops because the pipeline can act on early output; (2) **early cancellation** — if the first few tokens reveal a hallucination or wrong tool, the request can be aborted before paying for the full generation; (3) **incremental parsing** — tool-call arguments and structured outputs arrive as partial JSON that can be validated and acted on progressively. See /resources/agent-cost-latency-optimization for the latency framing.

## Transport: Server-Sent Events (SSE)

All three major providers (OpenAI, Anthropic, Google Gemini) stream over HTTP using **Server-Sent Events** (SSE), the W3C/WHATWG standard for unidirectional server-to-client push over a plain HTTP connection. The wire format is `Content-Type: text/event-stream`; each event is one or more `data:` lines terminated by a blank line. Named events use an `event:` field before the `data:` field.

SSE works over HTTP/1.1 (chunked transfer encoding) and HTTP/2 (a single stream). WebSockets are used for **bidirectional** real-time protocols (e.g., OpenAI Realtime API for voice); pure generation streaming uses SSE, not WebSockets.

## OpenAI Chat Completions streaming

Set `"stream": true` in the request body. The response is a sequence of `data:` SSE lines, each carrying a JSON object of type `chat.completion.chunk`. Each chunk has:

- `choices[].delta` — incremental content fragment. On the first chunk, `delta.role` is `"assistant"`. Subsequent chunks carry `delta.content` (text fragment) or `delta.tool_calls` (partial tool-call data).
- `choices[].finish_reason` — `null` during the stream; `"stop"`, `"tool_calls"`, or another terminal value on the final content chunk.

The stream ends with `data: [DONE]` — a sentinel that is not valid JSON and signals the consumer to close the connection.

**Tool calls in OpenAI streaming**: `delta.tool_calls` is a list indexed by position. The first delta for a call includes `id`, `type: "function"`, and `function.name`. Subsequent deltas carry only `function.arguments` as a *partial JSON string fragment*. The consumer must concatenate all `function.arguments` fragments across deltas, then parse the complete string as JSON after `finish_reason: "tool_calls"` is received. See /resources/reliable-tool-calling for schema-validation strategies on the parsed result.

## Anthropic Messages streaming

Set `"stream": true` in the request body to `/v1/messages`. Events use both SSE `event:` name fields and a `type` field inside the JSON `data:` payload. The ordered event flow is:

1. `message_start` — contains a `Message` object with empty `content`.
2. For each content block: `content_block_start` → one or more `content_block_delta` events → `content_block_stop`. Each block has an `index` matching its position in the final message.
3. One or more `message_delta` events — top-level message metadata updates (e.g., cumulative `usage` token counts).
4. `message_stop` — stream is complete.

Additional `ping` events may appear anywhere. Error events can arrive mid-stream (e.g., `overloaded_error`); consumers must handle unknown event types gracefully.

**Delta types inside `content_block_delta`:**

- `text_delta` — `delta.text` carries a text fragment.
- `input_json_delta` — `delta.partial_json` carries a partial JSON string fragment for a `tool_use` block's `input` field. Accumulate fragments across deltas and parse the complete string at `content_block_stop`. Current models emit one complete key-value pair per emission, so gaps between events are normal.
- `thinking_delta` — reasoning tokens when extended thinking is enabled.

## Google Gemini streaming

Use `streamGenerateContent` instead of `generateContent`. With the REST API add `?alt=sse` to receive SSE-formatted output. Each SSE `data:` event carries a complete `GenerateContentResponse` JSON object; incremental text arrives in `candidates[0].content.parts[0].text`. There is no separate `[DONE]` sentinel — the stream ends when the HTTP response body closes. Function-call arguments in streaming follow the same accumulate-then-parse pattern as other providers.

## Streaming tool calls and structured outputs

Regardless of provider, function-call arguments arrive as **partial JSON string fragments**. Two handling strategies:

- **Accumulate-then-parse** (simplest): collect all fragments into a buffer; parse the complete JSON string once the block or stream terminates. Safe for all schema shapes.
- **Streaming/partial JSON parser**: libraries such as `partial-json` (npm) or Pydantic's partial JSON parsing mode can deserialize incomplete JSON incrementally, enabling early field access before the stream ends. Useful for long structured outputs where upstream steps can act on early fields.

For validation and schema-enforcement concerns once the full arguments are available, see /resources/reliable-tool-calling.

## Practical concerns for agent builders

**Backpressure and buffering** — if your consumer processes chunks slower than the provider emits them, buffers grow. Size-bound your buffer and apply flow control; for gateway deployments see /resources/ai-gateways-llm-routing.

**Cancellation / abort** — send an HTTP request abort (e.g., `AbortController` in browser or Node.js, `httpx` cancel in Python) to stop generation early. The provider stops decoding; you pay only for tokens generated up to the abort. Ensure your agent loop handles a partial-response state cleanly.

**Error handling mid-stream** — an error event or a dropped TCP connection mid-stream leaves your state machine with a partially assembled response. Track which content blocks received `content_block_stop` (Anthropic) or whether `finish_reason` was set (OpenAI) before treating the response as complete.

**Token accounting** — `usage` fields in streaming responses (OpenAI `stream_options: {"include_usage": true}`; Anthropic `message_delta.usage`) are cumulative, not per-chunk. Read the final value, not a running sum of chunk values.

**Proxying through a gateway** — if you proxy streamed responses through an AI gateway or middleware, ensure the proxy flushes `data:` lines immediately rather than buffering the full response body. A buffering proxy negates all TTFT benefits. See /resources/ai-gateways-llm-routing for gateway selection criteria.

## Verified sources

- WHATWG HTML Living Standard — Server-sent events: https://html.spec.whatwg.org/multipage/server-sent-events.html
- MDN Web Docs — Using server-sent events: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events
- Anthropic Messages streaming (event types, input_json_delta, tool use): https://platform.claude.com/docs/en/build-with-claude/streaming
- OpenAI ChatCompletionChunk type (delta fields, tool_calls.function.arguments): https://github.com/openai/openai-python/blob/main/src/openai/types/chat/chat_completion_chunk.py
- Google Gemini streaming (streamGenerateContent, GenerateContentResponse, alt=sse): https://ai.google.dev/api/generate-content
- Google Gemini cookbook — Streaming REST quickstart: https://github.com/google-gemini/cookbook/blob/main/quickstarts/rest/Streaming_REST.ipynb