Aviary
Concepts

Durability & replay

How checkpoint-and-replay makes a workflow survive crashes — and the one rule it imposes. The workflow body must be deterministic; all side effects live in steps.

Durability comes from checkpoint + deterministic replay — the same model DBOS and Temporal use. Understanding it is the one thing worth reading before you build.

The mechanism

Each step records its result in the store at a deterministic position (seq). Recovery works by re-running the workflow function from the top — but a step that already has a completed checkpoint returns its saved result instead of executing again:

run abc
  step[0] reserveStock  → execute → checkpoint {seq:0, completed, output}
  step[1] chargeCard     → execute → checkpoint {seq:1, completed, output}
  💥 crash
  --- engine restarts, resumes run abc ---
  step[0] reserveStock  → completed checkpoint → return saved output (NOT re-run)
  step[1] chargeCard     → completed checkpoint → return saved output (NOT re-run)
  step[2] ship           → no checkpoint → execute for real

Only a completed checkpoint short-circuits replay. A non-terminal checkpoint — a remote step's pending (dispatched, awaiting its worker) or a local step's in-flight running (see trackStepStart) — does not: the step is re-awaited (remote) or re-run (local). So a crash mid-step is safe — there's no half-finished result to honour.

You write ordinary linear code; the engine guarantees each step runs exactly once, logically, even across crashes and deploys.

The execution model: start enqueues, a worker runs the body

start does not run the workflow inline. engine.start(workflow, input, runId?, opts?) (and WorkflowService.start(...)) creates the run as 'pending' and returns immediately with { runId, status: 'pending' }. The caller — an HTTP handler, say — never blocks on workflow logic; the body is dispatched to a worker.

start('checkout', order)        → run created, status: 'pending', returns { runId } now
  (a worker picks the run up)   → status: 'running', body executes
  ctx.waitForSignal('approve')  → status: 'suspended'
  signal('approve')             → resumes → completes
  • 'pending' is a new run status: created and enqueued, not yet picked up by a worker.

  • Where a run executes is a RunDispatcher. The default is in-process — the run executes on the same instance, asynchronously on a microtask — so a single-process app still runs workflows with zero setup; it just doesn't block the caller. The worker/API split below is opt-in for scaling.

  • Await the outcome with waitForRun when you need it inline: engine.waitForRun(runId) resolves once the run settles — a terminal state (completed/failed/cancelled/dead) or suspended.

    const { runId } = await engine.start('checkout', order);
    const result = await engine.waitForRun(runId); // resolves when the run settles

Execution & workers: API pods enqueue, driving pods run

For scale you can split an app into API pods (handle HTTP, the dashboard) and driving pods (execute workflows) sharing only the database:

  • An instance with drive: false (e.g. an API or dashboard pod) gets a no-op dispatcher: its start only enqueues, leaving the run pending for a driving instance to pick up. It offloads workflow execution to the driving pods over the database alone.
  • Driving instances poll engine.runPending() each tick (the NestJS TimerPoller does this automatically) to lease and run pending runs. The driving-side primitives are engine.runOne(runId) (lease + resume one run) and engine.runPending() (lease + run all pending runs); a broker-backed dispatcher can enqueue run ids for a driving pool that calls runOne.

The default (no explicit drive option) is drive: true and the in-process dispatcher — so single- process apps need none of this. Runs simply execute on the same instance, just asynchronously.

The one rule: workflows are deterministic

Because the body re-runs on recovery, it must be deterministic. All non-determinism — network calls, queries, Date.now(), randomness, IO — must live inside a step. The workflow body is just orchestration: call a step, use its result, decide the next.

// ✗ wrong — non-determinism in the workflow body
async run(ctx, order) {
  if (Math.random() < 0.1) { /* re-runs differently on replay */ }
  const now = Date.now(); // different on every replay
}

// ✓ right — side effects inside steps
async run(ctx, order) {
  const quote = await ctx.step('quote', () => this.pricing.fetch(order)); // checkpointed
  await ctx.step('charge', () => this.billing.charge(quote));
}

Consequences

  • Idempotency. The engine guarantees logical exactly-once, but if the process dies after a remote worker ran and before its checkpoint was written, the step may physically run twice. Steps receive a stable stepId — make handlers idempotent or dedupe on it.
  • Retries. Configure per step: ctx.step(name, fn, { retries: 3, backoff: 'exp' }). Throw a FatalError to stop retrying a business failure outright.
  • Self-healing recovery. The engine resumes every run a previous process left running (engine.recoverIncomplete(), wired automatically by DurableModule). This runs both on boot and periodically — the NestJS TimerPoller calls it every tick — so a run orphaned by a crashed worker is reclaimed within ~leaseMs, not only on the next deploy. While a run executes, its worker renews the recovery lease (every leaseMs/2), so a live worker keeps a long-running run while a crashed worker's lease still expires and another instance takes over.

Deploys, versioning & multiple instances

Because the engine is embedded in your app, deploys and replicas need a word.

  • The state survives the deploy. Checkpoints live in the database, so killing the old instance loses nothing. The new instance recovers running runs on boot — and any other instance reclaims orphaned runs within ~leaseMs via periodic recovery — while the timer poller resumes due sleeps. Suspended and pending runs don't even notice a deploy; a worker picks them up afterward.
  • Version-pinned replay (skew protection). Replay is positional — changing a workflow's body (reordering/inserting steps) while runs are in flight would corrupt them. So @Workflow carries a version, and a run resumes on the version it started on. Register the old and new versions side by side during a breaking change: in-flight runs drain on the old version, new runs start on the newest. A run whose version is no longer registered fails loudly rather than corrupting.
  • One instance per run (recovery lease). With several replicas, each would try to recover the same running runs. The engine takes an atomic, self-expiring lease on a run before resuming it (StateStore.tryLockRun), so a run is picked up by exactly one instance, and renews the lease while it runs (StateStore.renewRunLock) so a long-running run isn't reclaimed out from under a live worker. A crashed worker stops renewing, so its lease expires and another instance recovers the run. Set leaseMs above how long a single resume step runs (no longer your whole synchronous run, since the lease is renewed).
  • Graceful shutdown. On OnApplicationShutdown the engine drains: it stops picking up new runs and waits for in-flight ones to settle, then releases their leases so the next instance takes over immediately. Enable it with app.enableShutdownHooks().

On this page