Durability & replay

How checkpoint-and-replay makes a workflow survive crashes — and the one rule it imposes. The workflow body must be deterministic; all side effects live in steps.

Durability comes from checkpoint + deterministic replay — the same model DBOS and Temporal use. Understanding it is the one thing worth reading before you build.

The mechanism

Each step records its result in the store at a deterministic position (seq). Recovery works by re-running the workflow function from the top — but a step that already has a completed checkpoint returns its saved result instead of executing again:

run abc
  step[0] reserveStock  → execute → checkpoint {seq:0, completed, output}
  step[1] chargeCard     → execute → checkpoint {seq:1, completed, output}
  💥 crash
  --- engine restarts, resumes run abc ---
  step[0] reserveStock  → completed checkpoint → return saved output (NOT re-run)
  step[1] chargeCard     → completed checkpoint → return saved output (NOT re-run)
  step[2] ship           → no checkpoint → execute for real

Only a completed checkpoint short-circuits replay. A non-terminal checkpoint — a remote step's pending (dispatched, awaiting its worker) or a local step's in-flight running (see trackStepStart) — does not: the step is re-awaited (remote) or re-run (local). So a crash mid-step is safe — there's no half-finished result to honour.

You write ordinary linear code; the engine guarantees each step runs exactly once, logically, even across crashes and deploys.

The execution model: start enqueues, a worker runs the body

start does not run the workflow inline. engine.start(workflow, input, runId?, opts?) (and WorkflowService.start(...)) creates the run as 'pending' and returns immediately with { runId, status: 'pending' }. The caller — an HTTP handler, say — never blocks on workflow logic; the body is dispatched to a worker.

start('checkout', order)        → run created, status: 'pending', returns { runId } now
  (a worker picks the run up)   → status: 'running', body executes
  ctx.waitForSignal('approve')  → status: 'suspended'
  signal('approve')             → resumes → completes

'pending' is a new run status: created and enqueued, not yet picked up by a worker.
Where a run executes is a RunDispatcher. The default is in-process — the run executes on the same instance, asynchronously on a microtask — so a single-process app still runs workflows with zero setup; it just doesn't block the caller. The worker/API split below is opt-in for scaling.
Await the outcome with waitForRun when you need it inline: engine.waitForRun(runId) resolves once the run settles — a terminal state (completed/failed/cancelled/dead) or suspended.
```
const { runId } = await engine.start('checkout', order);
const result = await engine.waitForRun(runId); // resolves when the run settles
```

Execution & workers: API pods enqueue, driving pods run

For scale you can split an app into API pods (handle HTTP, the dashboard) and driving pods (execute workflows) sharing only the database:

An instance with drive: false (e.g. an API or dashboard pod) gets a no-op dispatcher: its start only enqueues, leaving the run pending for a driving instance to pick up. It offloads workflow execution to the driving pods over the database alone.
Driving instances poll engine.runPending() each tick (the NestJS TimerPoller does this automatically) to lease and run pending runs. The driving-side primitives are engine.runOne(runId) (lease + resume one run) and engine.runPending() (lease + run all pending runs); a broker-backed dispatcher can enqueue run ids for a driving pool that calls runOne.

The default (no explicit drive option) is drive: true and the in-process dispatcher — so single- process apps need none of this. Runs simply execute on the same instance, just asynchronously.

The one rule: workflows are deterministic

Because the body re-runs on recovery, it must be deterministic. All non-determinism — network calls, queries, Date.now(), randomness, IO — must live inside a step. The workflow body is just orchestration: call a step, use its result, decide the next.

// ✗ wrong — non-determinism in the workflow body
async run(ctx, order) {
  if (Math.random() < 0.1) { /* re-runs differently on replay */ }
  const now = Date.now(); // different on every replay
}

// ✓ right — side effects inside steps
async run(ctx, order) {
  const quote = await ctx.step('quote', () => this.pricing.fetch(order)); // checkpointed
  await ctx.step('charge', () => this.billing.charge(quote));
}

Consequences

Idempotency. The engine guarantees logical exactly-once, but if the process dies after a remote worker ran and before its checkpoint was written, the step may physically run twice. Steps receive a stable stepId — make handlers idempotent or dedupe on it.
Retries. Configure per step: ctx.step(name, fn, { retries: 3, backoff: 'exp' }). Throw a FatalError to stop retrying a business failure outright.
Self-healing recovery. The engine resumes every run a previous process left running (engine.recoverIncomplete(), wired automatically by DurableModule). This runs both on boot and periodically — the NestJS TimerPoller calls it every tick — so a run orphaned by a crashed worker is reclaimed within ~leaseMs, not only on the next deploy. While a run executes, its worker renews the recovery lease (every leaseMs/2), so a live worker keeps a long-running run while a crashed worker's lease still expires and another instance takes over.

Deploys, versioning & multiple instances

Because the engine is embedded in your app, deploys and replicas need a word.

The state survives the deploy. Checkpoints live in the database, so killing the old instance loses nothing. The new instance recovers running runs on boot — and any other instance reclaims orphaned runs within ~leaseMs via periodic recovery — while the timer poller resumes due sleeps. Suspended and pending runs don't even notice a deploy; a worker picks them up afterward.
Version-pinned replay (skew protection). Replay is positional — changing a workflow's body (reordering/inserting steps) while runs are in flight would corrupt them. So @Workflow carries a version, and a run resumes on the version it started on. Register the old and new versions side by side during a breaking change: in-flight runs drain on the old version, new runs start on the newest. A run whose version is no longer registered fails loudly rather than corrupting.
One instance per run (recovery lease). With several replicas, each would try to recover the same running runs. The engine takes an atomic, self-expiring lease on a run before resuming it (StateStore.tryLockRun), so a run is picked up by exactly one instance, and renews the lease while it runs (StateStore.renewRunLock) so a long-running run isn't reclaimed out from under a live worker. A crashed worker stops renewing, so its lease expires and another instance recovers the run. Set leaseMs above how long a single resume step runs (no longer your whole synchronous run, since the lease is renewed).
Graceful shutdown. On OnApplicationShutdown the engine drains: it stops picking up new runs and waits for in-flight ones to settle, then releases their leases so the next instance takes over immediately. Enable it with app.enableShutdownHooks().