Durability & replay
How checkpoint-and-replay makes a workflow survive crashes — and the one rule it imposes. The workflow body must be deterministic; all side effects live in steps.
Durability comes from checkpoint + deterministic replay — the same model DBOS and Temporal use. Understanding it is the one thing worth reading before you build.
The mechanism
Each step records its result in the store at a deterministic position (seq). Recovery works by
re-running the workflow function from the top — but a step that already has a completed
checkpoint returns its saved result instead of executing again:
run abc
step[0] reserveStock → execute → checkpoint {seq:0, completed, output}
step[1] chargeCard → execute → checkpoint {seq:1, completed, output}
💥 crash
--- engine restarts, resumes run abc ---
step[0] reserveStock → completed checkpoint → return saved output (NOT re-run)
step[1] chargeCard → completed checkpoint → return saved output (NOT re-run)
step[2] ship → no checkpoint → execute for realOnly a completed checkpoint short-circuits replay. A non-terminal checkpoint —
a remote step's pending (dispatched, awaiting its worker) or a local step's in-flight running
(see trackStepStart) — does not: the step is re-awaited
(remote) or re-run (local). So a crash mid-step is safe — there's no half-finished result to honour.
You write ordinary linear code; the engine guarantees each step runs exactly once, logically, even across crashes and deploys.
The execution model: start enqueues, a worker runs the body
start does not run the workflow inline. engine.start(workflow, input, runId?, opts?) (and
WorkflowService.start(...)) creates the run as 'pending' and returns immediately with
{ runId, status: 'pending' }. The caller — an HTTP handler, say — never blocks on workflow logic;
the body is dispatched to a worker.
start('checkout', order) → run created, status: 'pending', returns { runId } now
(a worker picks the run up) → status: 'running', body executes
ctx.waitForSignal('approve') → status: 'suspended'
signal('approve') → resumes → completes-
'pending'is a new run status: created and enqueued, not yet picked up by a worker. -
Where a run executes is a
RunDispatcher. The default is in-process — the run executes on the same instance, asynchronously on a microtask — so a single-process app still runs workflows with zero setup; it just doesn't block the caller. The worker/API split below is opt-in for scaling. -
Await the outcome with
waitForRunwhen you need it inline:engine.waitForRun(runId)resolves once the run settles — a terminal state (completed/failed/cancelled/dead) orsuspended.const { runId } = await engine.start('checkout', order); const result = await engine.waitForRun(runId); // resolves when the run settles
Execution & workers: API pods enqueue, driving pods run
For scale you can split an app into API pods (handle HTTP, the dashboard) and driving pods (execute workflows) sharing only the database:
- An instance with
drive: false(e.g. an API or dashboard pod) gets a no-op dispatcher: itsstartonly enqueues, leaving the runpendingfor a driving instance to pick up. It offloads workflow execution to the driving pods over the database alone. - Driving instances poll
engine.runPending()each tick (the NestJSTimerPollerdoes this automatically) to lease and run pending runs. The driving-side primitives areengine.runOne(runId)(lease + resume one run) andengine.runPending()(lease + run all pending runs); a broker-backed dispatcher can enqueue run ids for a driving pool that callsrunOne.
The default (no explicit drive option) is drive: true and the in-process dispatcher — so single-
process apps need none of this. Runs simply execute on the same instance, just asynchronously.
The one rule: workflows are deterministic
Because the body re-runs on recovery, it must be deterministic. All non-determinism — network
calls, queries, Date.now(), randomness, IO — must live inside a step. The workflow body is
just orchestration: call a step, use its result, decide the next.
// ✗ wrong — non-determinism in the workflow body
async run(ctx, order) {
if (Math.random() < 0.1) { /* re-runs differently on replay */ }
const now = Date.now(); // different on every replay
}
// ✓ right — side effects inside steps
async run(ctx, order) {
const quote = await ctx.step('quote', () => this.pricing.fetch(order)); // checkpointed
await ctx.step('charge', () => this.billing.charge(quote));
}Consequences
- Idempotency. The engine guarantees logical exactly-once, but if the process dies after a
remote worker ran and before its checkpoint was written, the step may physically run twice.
Steps receive a stable
stepId— make handlers idempotent or dedupe on it. - Retries. Configure per step:
ctx.step(name, fn, { retries: 3, backoff: 'exp' }). Throw aFatalErrorto stop retrying a business failure outright. - Self-healing recovery. The engine resumes every run a previous process left
running(engine.recoverIncomplete(), wired automatically byDurableModule). This runs both on boot and periodically — the NestJSTimerPollercalls it every tick — so a run orphaned by a crashed worker is reclaimed within ~leaseMs, not only on the next deploy. While a run executes, its worker renews the recovery lease (everyleaseMs/2), so a live worker keeps a long-running run while a crashed worker's lease still expires and another instance takes over.
Deploys, versioning & multiple instances
Because the engine is embedded in your app, deploys and replicas need a word.
- The state survives the deploy. Checkpoints live in the database, so killing the old instance
loses nothing. The new instance recovers
runningruns on boot — and any other instance reclaims orphaned runs within ~leaseMsvia periodic recovery — while the timer poller resumes due sleeps. Suspended andpendingruns don't even notice a deploy; a worker picks them up afterward. - Version-pinned replay (skew protection). Replay is positional — changing a workflow's body
(reordering/inserting steps) while runs are in flight would corrupt them. So
@Workflowcarries aversion, and a run resumes on the version it started on. Register the old and new versions side by side during a breaking change: in-flight runs drain on the old version, new runs start on the newest. A run whose version is no longer registered fails loudly rather than corrupting. - One instance per run (recovery lease). With several replicas, each would try to recover the
same
runningruns. The engine takes an atomic, self-expiring lease on a run before resuming it (StateStore.tryLockRun), so a run is picked up by exactly one instance, and renews the lease while it runs (StateStore.renewRunLock) so a long-running run isn't reclaimed out from under a live worker. A crashed worker stops renewing, so its lease expires and another instance recovers the run. SetleaseMsabove how long a single resume step runs (no longer your whole synchronous run, since the lease is renewed). - Graceful shutdown. On
OnApplicationShutdownthe engine drains: it stops picking up new runs and waits for in-flight ones to settle, then releases their leases so the next instance takes over immediately. Enable it withapp.enableShutdownHooks().
Getting Started
Run your first durable workflow in an existing NestJS app — install the module, write a workflow, register it, and start a run. Zero infrastructure with the event-emitter transport.
Workflows & steps
Declaring workflows with @Workflow, local steps with ctx.step, typed remote steps with remoteStep + ctx.remote and @DurableStep handlers, retries, fan-out and fatal errors.