# Durable Execution for Long-Running Agents

> Vendor-neutral reference on durable execution: event logs, replay determinism, idempotency, retries, and human-in-the-loop pause/resume — plus a cross-vendor survey and tradeoffs guide for Temporal, Restate, DBOS, Inngest, Step Functions, Azure Durable Functions, Cloudflare Workflows, GCP Workflows, LangGraph, and OpenAI Agents SDK.

Category: Guide · Updated: 2026-06-17 · Tags: agents, durable-execution, workflows, reliability, idempotency, human-in-the-loop
Canonical: https://changegamer.ai/resources/durable-execution-for-agents

A long-running agent that cannot survive a crash, a restart, or a long wait for human approval is a liability in production. Durable execution is the programming model that solves this. This reference explains the core concepts vendor-neutrally, surveys the leading engines, and maps them to a "when to use which" decision guide.

## Core concepts

### Persisted event log / checkpoint

A durable execution engine records every meaningful step of a workflow — LLM call, tool invocation, timer, signal received — to a persistent log (or checkpoint store) before moving on. If the process crashes or is restarted, the engine replays the log to reconstruct in-memory state exactly where execution left off. No step that was recorded is re-executed against the real world; its stored result is injected instead.

### Determinism / replay constraints

For replay to be safe, *workflow code* (the orchestration logic) must be deterministic: given the same sequence of recorded inputs, it must produce the same sequence of scheduled commands. This means workflow code must not call wall-clock time, random-number generators, external APIs, or LLMs directly. Those calls are non-deterministic and belong in a separate unit — called an *activity*, *step*, or *handler* depending on the engine — that is executed outside the replay path, recorded once, and replayed from the record on retry.

### Idempotency

Because a crashed workflow may retry a step that was already in-flight, side-effecting tool calls (writes to a database, calls to a payment API, sends to a message bus) must be idempotent. The standard pattern is to derive a stable idempotency key from the workflow run ID and step index — not from wall-clock time or a random UUID — and pass it to the downstream service so that a duplicate call is a no-op. See also /resources/reliable-tool-calling for tool-call reliability patterns.

### Retries, timeouts, and backoff

All major engines provide per-step retry policies: maximum attempts, initial interval, backoff multiplier, and a non-retryable error list. The recommended distinction: *transient* failures (network timeout, rate limit, 5xx) are retried with backoff; *terminal* failures (4xx bad-request, business logic error) escalate immediately. Unbounded retries mask bugs; always set a maximum. Check current docs for default retry parameters, which vary by engine and are subject to change.

### Durable human-in-the-loop (pause/resume)

A durable engine can suspend a workflow at any step and wait for an external signal — a human approval, a webhook, an event — for an arbitrarily long duration without holding a live process or consuming compute. When the signal arrives the workflow resumes from exactly where it paused, with full state intact. This is architecturally distinct from a polling loop or a sleeping thread: the workflow is durably parked, not actively running.

## Cross-vendor survey

### Architectural models

Four broad models exist across current engines:

| Model | Description | Examples |
|---|---|---|
| Event-sourcing / full replay | Workflow function is re-executed from start on every resume; each step is compared against the event history and short-circuited | Temporal, Azure Durable Functions |
| Durable-state-store / code-first | Journal records step outcomes; replay injects stored results without full re-execution of orchestration logic | Restate, DBOS (Postgres-backed), Inngest (memoization) |
| Managed state machine | Workflow is a declarative YAML/JSON state machine; engine manages state externally | AWS Step Functions (ASL), GCP Workflows |
| Code-on-Durable-Objects | Worker code runs on a per-workflow Durable Object; steps are persisted as the object's durable state | Cloudflare Workflows |

### Temporal

Event-sourcing model. The Temporal service maintains a durable event history per workflow execution. On resume, the SDK re-executes the workflow function from start, comparing each generated command against the history; recorded activity results are injected without re-running. Non-deterministic operations (LLM calls, tool I/O, timers, randomness) must be wrapped as Activities. Human-in-the-loop is implemented with Signals: a workflow blocks on a signal condition while consuming no compute until the signal arrives. Open-source server; self-hostable or managed via Temporal Cloud.

### Azure Durable Functions (Durable Task Framework)

Event-sourcing model, closely analogous to Temporal. Orchestrator functions checkpoint progress at every `await`/`yield`, saving history to a durable storage backend (Azure Storage by default; MSSQL and other providers also supported). On resume, the framework replays the orchestrator from start and injects previously recorded activity results. Orchestrator code must be deterministic; side effects go in Activity functions. External events (`waitForExternalEvent`) implement human-in-the-loop pause/resume. Available as a first-party Azure Functions extension.

### Restate

Durable-state-store / code-first model. Restate tracks execution in a per-invocation journal on its server. If a handler crashes, Restate replays the journal — skipping completed steps by returning their stored results — and resumes from the failure point. Idempotency is built in via idempotency-key headers; duplicate requests are deduplicated automatically. Human-in-the-loop is supported via durable promises / suspension: a handler suspends and is resumed when an external call provides the result. Open-source; self-hostable or managed via Restate Cloud.

### DBOS

Durable-state-store model backed by Postgres. DBOS Transact is a library (Python, TypeScript) that annotates workflows and steps; execution state is stored in the application's own Postgres database. If a workflow is interrupted, it automatically resumes from the last completed step on restart. Designed to be added to an existing application without a separate orchestration server. Open-source library; also available as a managed cloud service.

### Inngest

Durable-state-store / step-memoization model. Functions are broken into `step.run()` units; each step's result is persisted after it completes. On retry the function re-executes, but completed steps return their memoized results immediately rather than re-running. `step.waitForEvent()` implements human-in-the-loop: the function suspends until a matching external event arrives, consuming no compute while waiting. Managed service with open-source SDK.

### AWS Step Functions

Managed state machine model. Workflows are defined in Amazon States Language (ASL), a JSON/YAML state machine definition. The Step Functions service manages all state externally; application code runs in Lambda or other compute only for task states. Standard Workflows are durable up to one year with an exactly-once execution model per state. No replay of application code: state is always held by the service, not reconstructed by replaying code. Integrates natively with most AWS services via optimized integrations. Human-in-the-loop via callback patterns with task tokens (`.waitForTaskToken`).

### GCP Workflows

Managed state machine model. Workflows are defined in YAML or JSON using Google's Workflows syntax. The service manages execution state; a workflow can hold state, retry, poll, or wait for up to one year as documented. Human-in-the-loop is supported via callback endpoints: the workflow pauses and waits for an external HTTP callback to resume it. Serverless; no charges while idle.

### Cloudflare Workflows

Code-on-Durable-Objects model. Each workflow instance runs on a dedicated Cloudflare Durable Object; step state is persisted as the object's durable storage. Step primitives: `step.do()` (execute with automatic retry), `step.sleep()` / `step.sleepUntil()` (hibernates the object — no compute consumed during sleep), and `step.waitForEvent()` (suspends until an external event arrives). Tightly integrated with the Cloudflare Workers ecosystem. Reached general availability.

### LangGraph checkpointers

Framework-level checkpoint mechanism within LangGraph (part of LangChain). A checkpointer persists graph state after every node execution to a configurable backend (in-memory, SQLite, Postgres, and others as documented). Execution is tracked by `thread_id`; resuming with the same `thread_id` reloads the last checkpoint. Human-in-the-loop is implemented via `interrupt()`: calling interrupt inside a node raises a `GraphInterrupt`, saves state, and surfaces the interrupt value to the caller; the graph resumes when the caller re-invokes with a `Command` containing the human's response. This is a lighter-weight mechanism than a full durable execution engine — it provides fault tolerance and HITL within a single agent graph, not cross-process workflow orchestration.

### OpenAI Agents SDK sessions

Session-persistence layer within the Agents SDK. A Session stores conversation history across agent runs to a configurable backend (SQLite, Redis, MongoDB, SQLAlchemy-compatible stores, OpenAI Conversations API, and others as documented). Before each run, the runner prepends session history to the input; after each run, new items are persisted. This is *memory continuity*, not durable execution: sessions do not guarantee transactional recovery from infrastructure failures mid-run. Suitable when the SDK's built-in session management is sufficient and full crash-resume semantics are not required.

## When to use which

| Situation | Recommended approach |
|---|---|
| Already on AWS; need durable multi-step agent workflows | AWS Step Functions (Standard Workflows) — native integration, zero extra infra |
| Already on Azure | Azure Durable Functions — first-party, event-sourcing model, supports long-running orchestrations |
| Already on GCP | GCP Workflows — managed state machine, callback-based HITL |
| Already on Cloudflare Workers | Cloudflare Workflows — co-located with edge compute, Durable Objects-backed |
| Need portable, code-first durability; want to self-host | Temporal (mature, large ecosystem), Restate (lighter footprint, suspension-native), or DBOS (Postgres-only dependency) |
| Managed serverless, code-first, TypeScript/JavaScript-first | Inngest — step memoization, managed infra, `waitForEvent` HITL |
| Already using LangGraph; need HITL and checkpoint-based fault tolerance within a graph | LangGraph checkpointers + `interrupt()` — no extra service needed |
| Using OpenAI Agents SDK; need memory continuity across sessions but not crash-resume semantics | Agents SDK Sessions — simplest path; add a dedicated engine if mid-run durability is required |

Key tradeoffs to weigh:

- **Replay vs. state-store:** event-sourcing/replay engines (Temporal, Azure DF) require strict workflow determinism; state-store engines (Restate, DBOS, Inngest) are often more forgiving but still disallow non-deterministic branching on replay.
- **Self-host vs. managed:** Temporal, Restate, and DBOS are self-hostable but add operational burden; cloud-native engines offload ops at the cost of platform lock-in.
- **Granularity:** declarative state machines (Step Functions, GCP) offer fine-grained service integration but require workflow logic to fit a declarative model; code-first engines let you write ordinary functions with SDK annotations.
- **Ecosystem fit:** LangGraph checkpointers and Agents SDK sessions add no external service dependency, making them the lowest friction entry point when you are already in those frameworks — but they are not substitutes for a full orchestration engine when cross-service durability or complex retry/compensation logic is needed.

For reliability patterns at the tool-call level, see /resources/reliable-tool-calling. For multi-agent orchestration patterns that interact with durable workflows, see /resources/multi-agent-orchestration-patterns. For observability inside long-running agent workflows, see /resources/agent-observability. For comparing the broader framework landscape, see /resources/agent-frameworks-compared. For cost and latency considerations in long agent loops, see /resources/agent-cost-latency-optimization. For guardrails on autonomous actions that durable workflows may take, see /resources/agent-guardrails.

## Verified sources

- Temporal — Workflow Execution overview: https://docs.temporal.io/workflow-execution
- Temporal — Event History: https://docs.temporal.io/encyclopedia/event-history
- Temporal — Workflow message passing (Signals): https://docs.temporal.io/encyclopedia/workflow-message-passing
- Temporal — Workflow Definition (determinism): https://docs.temporal.io/workflow-definition
- Temporal — Human-in-the-Loop tutorial: https://learn.temporal.io/tutorials/ai/building-durable-ai-applications/human-in-the-loop/
- Restate — Durable Execution concepts: https://docs.restate.dev/concepts/durable_execution/
- DBOS — DBOS Transact open-source library: https://www.dbos.dev/dbos-transact
- Inngest — How functions are executed (step memoization): https://www.inngest.com/docs/learn/how-functions-are-executed
- Inngest — step.waitForEvent reference: https://www.inngest.com/docs/reference/typescript/functions/step-wait-for-event
- AWS Step Functions — What is Step Functions: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
- AWS Step Functions — State machines: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-statemachines.html
- Azure — Durable Orchestrations overview: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-orchestrations
- Cloudflare — Workflows durable execution: https://developers.cloudflare.com/agents/api-reference/durable-execution/
- Cloudflare — Workflows GA announcement: https://blog.cloudflare.com/workflows-ga-production-ready-durable-execution/
- GCP Workflows — Overview: https://docs.cloud.google.com/workflows/docs/overview
- LangGraph — Persistence and checkpointers: https://docs.langchain.com/oss/python/langgraph/persistence
- LangGraph — Interrupts (human-in-the-loop): https://docs.langchain.com/oss/python/langgraph/interrupts
- LangGraph — Durable execution: https://docs.langchain.com/oss/python/langgraph/durable-execution
- OpenAI Agents SDK — Sessions: https://openai.github.io/openai-agents-python/sessions/
