ChangeGamer

← All resources

Durable Execution for Long-Running Agents

Guide · updated 2026-06-17 · Markdown variant

Vendor-neutral reference on durable execution: event logs, replay determinism, idempotency, retries, and human-in-the-loop pause/resume — plus a cross-vendor survey and tradeoffs guide for Temporal, Restate, DBOS, Inngest, Step Functions, Azure Durable Functions, Cloudflare Workflows, GCP Workflows, LangGraph, and OpenAI Agents SDK.


A long-running agent that cannot survive a crash, a restart, or a long wait for human approval is a liability in production. Durable execution is the programming model that solves this. This reference explains the core concepts vendor-neutrally, surveys the leading engines, and maps them to a "when to use which" decision guide.

Core concepts

Persisted event log / checkpoint

A durable execution engine records every meaningful step of a workflow — LLM call, tool invocation, timer, signal received — to a persistent log (or checkpoint store) before moving on. If the process crashes or is restarted, the engine replays the log to reconstruct in-memory state exactly where execution left off. No step that was recorded is re-executed against the real world; its stored result is injected instead.

Determinism / replay constraints

For replay to be safe, workflow code (the orchestration logic) must be deterministic: given the same sequence of recorded inputs, it must produce the same sequence of scheduled commands. This means workflow code must not call wall-clock time, random-number generators, external APIs, or LLMs directly. Those calls are non-deterministic and belong in a separate unit — called an activity, step, or handler depending on the engine — that is executed outside the replay path, recorded once, and replayed from the record on retry.

Idempotency

Because a crashed workflow may retry a step that was already in-flight, side-effecting tool calls (writes to a database, calls to a payment API, sends to a message bus) must be idempotent. The standard pattern is to derive a stable idempotency key from the workflow run ID and step index — not from wall-clock time or a random UUID — and pass it to the downstream service so that a duplicate call is a no-op. See also /resources/reliable-tool-calling for tool-call reliability patterns.

Retries, timeouts, and backoff

All major engines provide per-step retry policies: maximum attempts, initial interval, backoff multiplier, and a non-retryable error list. The recommended distinction: transient failures (network timeout, rate limit, 5xx) are retried with backoff; terminal failures (4xx bad-request, business logic error) escalate immediately. Unbounded retries mask bugs; always set a maximum. Check current docs for default retry parameters, which vary by engine and are subject to change.

Durable human-in-the-loop (pause/resume)

A durable engine can suspend a workflow at any step and wait for an external signal — a human approval, a webhook, an event — for an arbitrarily long duration without holding a live process or consuming compute. When the signal arrives the workflow resumes from exactly where it paused, with full state intact. This is architecturally distinct from a polling loop or a sleeping thread: the workflow is durably parked, not actively running.

Cross-vendor survey

Architectural models

Four broad models exist across current engines:

Model Description Examples
Event-sourcing / full replay Workflow function is re-executed from start on every resume; each step is compared against the event history and short-circuited Temporal, Azure Durable Functions
Durable-state-store / code-first Journal records step outcomes; replay injects stored results without full re-execution of orchestration logic Restate, DBOS (Postgres-backed), Inngest (memoization)
Managed state machine Workflow is a declarative YAML/JSON state machine; engine manages state externally AWS Step Functions (ASL), GCP Workflows
Code-on-Durable-Objects Worker code runs on a per-workflow Durable Object; steps are persisted as the object's durable state Cloudflare Workflows

Temporal

Event-sourcing model. The Temporal service maintains a durable event history per workflow execution. On resume, the SDK re-executes the workflow function from start, comparing each generated command against the history; recorded activity results are injected without re-running. Non-deterministic operations (LLM calls, tool I/O, timers, randomness) must be wrapped as Activities. Human-in-the-loop is implemented with Signals: a workflow blocks on a signal condition while consuming no compute until the signal arrives. Open-source server; self-hostable or managed via Temporal Cloud.

Azure Durable Functions (Durable Task Framework)

Event-sourcing model, closely analogous to Temporal. Orchestrator functions checkpoint progress at every await/yield, saving history to a durable storage backend (Azure Storage by default; MSSQL and other providers also supported). On resume, the framework replays the orchestrator from start and injects previously recorded activity results. Orchestrator code must be deterministic; side effects go in Activity functions. External events (waitForExternalEvent) implement human-in-the-loop pause/resume. Available as a first-party Azure Functions extension.

Restate

Durable-state-store / code-first model. Restate tracks execution in a per-invocation journal on its server. If a handler crashes, Restate replays the journal — skipping completed steps by returning their stored results — and resumes from the failure point. Idempotency is built in via idempotency-key headers; duplicate requests are deduplicated automatically. Human-in-the-loop is supported via durable promises / suspension: a handler suspends and is resumed when an external call provides the result. Open-source; self-hostable or managed via Restate Cloud.

DBOS

Durable-state-store model backed by Postgres. DBOS Transact is a library (Python, TypeScript) that annotates workflows and steps; execution state is stored in the application's own Postgres database. If a workflow is interrupted, it automatically resumes from the last completed step on restart. Designed to be added to an existing application without a separate orchestration server. Open-source library; also available as a managed cloud service.

Inngest

Durable-state-store / step-memoization model. Functions are broken into step.run() units; each step's result is persisted after it completes. On retry the function re-executes, but completed steps return their memoized results immediately rather than re-running. step.waitForEvent() implements human-in-the-loop: the function suspends until a matching external event arrives, consuming no compute while waiting. Managed service with open-source SDK.

AWS Step Functions

Managed state machine model. Workflows are defined in Amazon States Language (ASL), a JSON/YAML state machine definition. The Step Functions service manages all state externally; application code runs in Lambda or other compute only for task states. Standard Workflows are durable up to one year with an exactly-once execution model per state. No replay of application code: state is always held by the service, not reconstructed by replaying code. Integrates natively with most AWS services via optimized integrations. Human-in-the-loop via callback patterns with task tokens (.waitForTaskToken).

GCP Workflows

Managed state machine model. Workflows are defined in YAML or JSON using Google's Workflows syntax. The service manages execution state; a workflow can hold state, retry, poll, or wait for up to one year as documented. Human-in-the-loop is supported via callback endpoints: the workflow pauses and waits for an external HTTP callback to resume it. Serverless; no charges while idle.

Cloudflare Workflows

Code-on-Durable-Objects model. Each workflow instance runs on a dedicated Cloudflare Durable Object; step state is persisted as the object's durable storage. Step primitives: step.do() (execute with automatic retry), step.sleep() / step.sleepUntil() (hibernates the object — no compute consumed during sleep), and step.waitForEvent() (suspends until an external event arrives). Tightly integrated with the Cloudflare Workers ecosystem. Reached general availability.

LangGraph checkpointers

Framework-level checkpoint mechanism within LangGraph (part of LangChain). A checkpointer persists graph state after every node execution to a configurable backend (in-memory, SQLite, Postgres, and others as documented). Execution is tracked by thread_id; resuming with the same thread_id reloads the last checkpoint. Human-in-the-loop is implemented via interrupt(): calling interrupt inside a node raises a GraphInterrupt, saves state, and surfaces the interrupt value to the caller; the graph resumes when the caller re-invokes with a Command containing the human's response. This is a lighter-weight mechanism than a full durable execution engine — it provides fault tolerance and HITL within a single agent graph, not cross-process workflow orchestration.

OpenAI Agents SDK sessions

Session-persistence layer within the Agents SDK. A Session stores conversation history across agent runs to a configurable backend (SQLite, Redis, MongoDB, SQLAlchemy-compatible stores, OpenAI Conversations API, and others as documented). Before each run, the runner prepends session history to the input; after each run, new items are persisted. This is memory continuity, not durable execution: sessions do not guarantee transactional recovery from infrastructure failures mid-run. Suitable when the SDK's built-in session management is sufficient and full crash-resume semantics are not required.

When to use which

Situation Recommended approach
Already on AWS; need durable multi-step agent workflows AWS Step Functions (Standard Workflows) — native integration, zero extra infra
Already on Azure Azure Durable Functions — first-party, event-sourcing model, supports long-running orchestrations
Already on GCP GCP Workflows — managed state machine, callback-based HITL
Already on Cloudflare Workers Cloudflare Workflows — co-located with edge compute, Durable Objects-backed
Need portable, code-first durability; want to self-host Temporal (mature, large ecosystem), Restate (lighter footprint, suspension-native), or DBOS (Postgres-only dependency)
Managed serverless, code-first, TypeScript/JavaScript-first Inngest — step memoization, managed infra, waitForEvent HITL
Already using LangGraph; need HITL and checkpoint-based fault tolerance within a graph LangGraph checkpointers + interrupt() — no extra service needed
Using OpenAI Agents SDK; need memory continuity across sessions but not crash-resume semantics Agents SDK Sessions — simplest path; add a dedicated engine if mid-run durability is required

Key tradeoffs to weigh:

For reliability patterns at the tool-call level, see /resources/reliable-tool-calling. For multi-agent orchestration patterns that interact with durable workflows, see /resources/multi-agent-orchestration-patterns. For observability inside long-running agent workflows, see /resources/agent-observability. For comparing the broader framework landscape, see /resources/agent-frameworks-compared. For cost and latency considerations in long agent loops, see /resources/agent-cost-latency-optimization. For guardrails on autonomous actions that durable workflows may take, see /resources/agent-guardrails.

Verified sources

#agents #durable-execution #workflows #reliability #idempotency #human-in-the-loop

Category: Guide