Durable Execution for Long-Running Agents

Guide · updated 2026-06-17 · Markdown variant

Vendor-neutral reference on durable execution: event logs, replay determinism, idempotency, retries, and human-in-the-loop pause/resume — plus a cross-vendor survey and tradeoffs guide for Temporal, Restate, DBOS, Inngest, Step Functions, Azure Durable Functions, Cloudflare Workflows, GCP Workflows, LangGraph, and OpenAI Agents SDK.

A long-running agent that cannot survive a crash, a restart, or a long wait for human approval is a liability in production. Durable execution is the programming model that solves this. This reference explains the core concepts vendor-neutrally, surveys the leading engines, and maps them to a "when to use which" decision guide.

Core concepts

Persisted event log / checkpoint

A durable execution engine records every meaningful step of a workflow — LLM call, tool invocation, timer, signal received — to a persistent log (or checkpoint store) before moving on. If the process crashes or is restarted, the engine replays the log to reconstruct in-memory state exactly where execution left off. No step that was recorded is re-executed against the real world; its stored result is injected instead.

Determinism / replay constraints

For replay to be safe, workflow code (the orchestration logic) must be deterministic: given the same sequence of recorded inputs, it must produce the same sequence of scheduled commands. This means workflow code must not call wall-clock time, random-number generators, external APIs, or LLMs directly. Those calls are non-deterministic and belong in a separate unit — called an activity, step, or handler depending on the engine — that is executed outside the replay path, recorded once, and replayed from the record on retry.

Idempotency

Because a crashed workflow may retry a step that was already in-flight, side-effecting tool calls (writes to a database, calls to a payment API, sends to a message bus) must be idempotent. The standard pattern is to derive a stable idempotency key from the workflow run ID and step index — not from wall-clock time or a random UUID — and pass it to the downstream service so that a duplicate call is a no-op. See also /resources/reliable-tool-calling for tool-call reliability patterns.

Retries, timeouts, and backoff

All major engines provide per-step retry policies: maximum attempts, initial interval, backoff multiplier, and a non-retryable error list. The recommended distinction: transient failures (network timeout, rate limit, 5xx) are retried with backoff; terminal failures (4xx bad-request, business logic error) escalate immediately. Unbounded retries mask bugs; always set a maximum. Check current docs for default retry parameters, which vary by engine and are subject to change.

Durable human-in-the-loop (pause/resume)

A durable engine can suspend a workflow at any step and wait for an external signal — a human approval, a webhook, an event — for an arbitrarily long duration without holding a live process or consuming compute. When the signal arrives the workflow resumes from exactly where it paused, with full state intact. This is architecturally distinct from a polling loop or a sleeping thread: the workflow is durably parked, not actively running.

Cross-vendor survey

Architectural models

Four broad models exist across current engines:

Model	Description	Examples
Event-sourcing / full replay	Workflow function is re-executed from start on every resume; each step is compared against the event history and short-circuited	Temporal, Azure Durable Functions
Durable-state-store / code-first	Journal records step outcomes; replay injects stored results without full re-execution of orchestration logic	Restate, DBOS (Postgres-backed), Inngest (memoization)
Managed state machine	Workflow is a declarative YAML/JSON state machine; engine manages state externally	AWS Step Functions (ASL), GCP Workflows
Code-on-Durable-Objects	Worker code runs on a per-workflow Durable Object; steps are persisted as the object's durable state	Cloudflare Workflows

Temporal

Event-sourcing model. The Temporal service maintains a durable event history per workflow execution. On resume, the SDK re-executes the workflow function from start, comparing each generated command against the history; recorded activity results are injected without re-running. Non-deterministic operations (LLM calls, tool I/O, timers, randomness) must be wrapped as Activities. Human-in-the-loop is implemented with Signals: a workflow blocks on a signal condition while consuming no compute until the signal arrives. Open-source server; self-hostable or managed via Temporal Cloud.

Azure Durable Functions (Durable Task Framework)

Event-sourcing model, closely analogous to Temporal. Orchestrator functions checkpoint progress at every await/yield, saving history to a durable storage backend (Azure Storage by default; MSSQL and other providers also supported). On resume, the framework replays the orchestrator from start and injects previously recorded activity results. Orchestrator code must be deterministic; side effects go in Activity functions. External events (waitForExternalEvent) implement human-in-the-loop pause/resume. Available as a first-party Azure Functions extension.

Restate

Durable-state-store / code-first model. Restate tracks execution in a per-invocation journal on its server. If a handler crashes, Restate replays the journal — skipping completed steps by returning their stored results — and resumes from the failure point. Idempotency is built in via idempotency-key headers; duplicate requests are deduplicated automatically. Human-in-the-loop is supported via durable promises / suspension: a handler suspends and is resumed when an external call provides the result. Open-source; self-hostable or managed via Restate Cloud.

DBOS

Durable-state-store model backed by Postgres. DBOS Transact is a library (Python, TypeScript) that annotates workflows and steps; execution state is stored in the application's own Postgres database. If a workflow is interrupted, it automatically resumes from the last completed step on restart. Designed to be added to an existing application without a separate orchestration server. Open-source library; also available as a managed cloud service.

Inngest

Durable-state-store / step-memoization model. Functions are broken into step.run() units; each step's result is persisted after it completes. On retry the function re-executes, but completed steps return their memoized results immediately rather than re-running. step.waitForEvent() implements human-in-the-loop: the function suspends until a matching external event arrives, consuming no compute while waiting. Managed service with open-source SDK.

AWS Step Functions

Managed state machine model. Workflows are defined in Amazon States Language (ASL), a JSON/YAML state machine definition. The Step Functions service manages all state externally; application code runs in Lambda or other compute only for task states. Standard Workflows are durable up to one year with an exactly-once execution model per state. No replay of application code: state is always held by the service, not reconstructed by replaying code. Integrates natively with most AWS services via optimized integrations. Human-in-the-loop via callback patterns with task tokens (.waitForTaskToken).

GCP Workflows

Managed state machine model. Workflows are defined in YAML or JSON using Google's Workflows syntax. The service manages execution state; a workflow can hold state, retry, poll, or wait for up to one year as documented. Human-in-the-loop is supported via callback endpoints: the workflow pauses and waits for an external HTTP callback to resume it. Serverless; no charges while idle.

Cloudflare Workflows

Code-on-Durable-Objects model. Each workflow instance runs on a dedicated Cloudflare Durable Object; step state is persisted as the object's durable storage. Step primitives: step.do() (execute with automatic retry), step.sleep() / step.sleepUntil() (hibernates the object — no compute consumed during sleep), and step.waitForEvent() (suspends until an external event arrives). Tightly integrated with the Cloudflare Workers ecosystem. Reached general availability.

LangGraph checkpointers

Framework-level checkpoint mechanism within LangGraph (part of LangChain). A checkpointer persists graph state after every node execution to a configurable backend (in-memory, SQLite, Postgres, and others as documented). Execution is tracked by thread_id; resuming with the same thread_id reloads the last checkpoint. Human-in-the-loop is implemented via interrupt(): calling interrupt inside a node raises a GraphInterrupt, saves state, and surfaces the interrupt value to the caller; the graph resumes when the caller re-invokes with a Command containing the human's response. This is a lighter-weight mechanism than a full durable execution engine — it provides fault tolerance and HITL within a single agent graph, not cross-process workflow orchestration.

OpenAI Agents SDK sessions

Session-persistence layer within the Agents SDK. A Session stores conversation history across agent runs to a configurable backend (SQLite, Redis, MongoDB, SQLAlchemy-compatible stores, OpenAI Conversations API, and others as documented). Before each run, the runner prepends session history to the input; after each run, new items are persisted. This is memory continuity, not durable execution: sessions do not guarantee transactional recovery from infrastructure failures mid-run. Suitable when the SDK's built-in session management is sufficient and full crash-resume semantics are not required.

When to use which

Situation	Recommended approach
Already on AWS; need durable multi-step agent workflows	AWS Step Functions (Standard Workflows) — native integration, zero extra infra
Already on Azure	Azure Durable Functions — first-party, event-sourcing model, supports long-running orchestrations
Already on GCP	GCP Workflows — managed state machine, callback-based HITL
Already on Cloudflare Workers	Cloudflare Workflows — co-located with edge compute, Durable Objects-backed
Need portable, code-first durability; want to self-host	Temporal (mature, large ecosystem), Restate (lighter footprint, suspension-native), or DBOS (Postgres-only dependency)
Managed serverless, code-first, TypeScript/JavaScript-first	Inngest — step memoization, managed infra, `waitForEvent` HITL
Already using LangGraph; need HITL and checkpoint-based fault tolerance within a graph	LangGraph checkpointers + `interrupt()` — no extra service needed
Using OpenAI Agents SDK; need memory continuity across sessions but not crash-resume semantics	Agents SDK Sessions — simplest path; add a dedicated engine if mid-run durability is required

Key tradeoffs to weigh:

Replay vs. state-store: event-sourcing/replay engines (Temporal, Azure DF) require strict workflow determinism; state-store engines (Restate, DBOS, Inngest) are often more forgiving but still disallow non-deterministic branching on replay.
Self-host vs. managed: Temporal, Restate, and DBOS are self-hostable but add operational burden; cloud-native engines offload ops at the cost of platform lock-in.
Granularity: declarative state machines (Step Functions, GCP) offer fine-grained service integration but require workflow logic to fit a declarative model; code-first engines let you write ordinary functions with SDK annotations.
Ecosystem fit: LangGraph checkpointers and Agents SDK sessions add no external service dependency, making them the lowest friction entry point when you are already in those frameworks — but they are not substitutes for a full orchestration engine when cross-service durability or complex retry/compensation logic is needed.

For reliability patterns at the tool-call level, see /resources/reliable-tool-calling. For multi-agent orchestration patterns that interact with durable workflows, see /resources/multi-agent-orchestration-patterns. For observability inside long-running agent workflows, see /resources/agent-observability. For comparing the broader framework landscape, see /resources/agent-frameworks-compared. For cost and latency considerations in long agent loops, see /resources/agent-cost-latency-optimization. For guardrails on autonomous actions that durable workflows may take, see /resources/agent-guardrails.

Verified sources

Temporal — Workflow Execution overview: https://docs.temporal.io/workflow-execution
Temporal — Event History: https://docs.temporal.io/encyclopedia/event-history
Temporal — Workflow message passing (Signals): https://docs.temporal.io/encyclopedia/workflow-message-passing
Temporal — Workflow Definition (determinism): https://docs.temporal.io/workflow-definition
Temporal — Human-in-the-Loop tutorial: https://learn.temporal.io/tutorials/ai/building-durable-ai-applications/human-in-the-loop/
Restate — Durable Execution concepts: https://docs.restate.dev/concepts/durable_execution/
DBOS — DBOS Transact open-source library: https://www.dbos.dev/dbos-transact
Inngest — How functions are executed (step memoization): https://www.inngest.com/docs/learn/how-functions-are-executed
Inngest — step.waitForEvent reference: https://www.inngest.com/docs/reference/typescript/functions/step-wait-for-event
AWS Step Functions — What is Step Functions: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
AWS Step Functions — State machines: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-statemachines.html
Azure — Durable Orchestrations overview: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-orchestrations
Cloudflare — Workflows durable execution: https://developers.cloudflare.com/agents/api-reference/durable-execution/
Cloudflare — Workflows GA announcement: https://blog.cloudflare.com/workflows-ga-production-ready-durable-execution/
GCP Workflows — Overview: https://docs.cloud.google.com/workflows/docs/overview
LangGraph — Persistence and checkpointers: https://docs.langchain.com/oss/python/langgraph/persistence
LangGraph — Interrupts (human-in-the-loop): https://docs.langchain.com/oss/python/langgraph/interrupts
LangGraph — Durable execution: https://docs.langchain.com/oss/python/langgraph/durable-execution
OpenAI Agents SDK — Sessions: https://openai.github.io/openai-agents-python/sessions/

#agents #durable-execution #workflows #reliability #idempotency #human-in-the-loop

Category: Guide