Aviary
Reliability

Retries & backoff

Local-step retries with fixed/exponential backoff and jitter, FatalError to opt out, the durable remote-step retry path (re-dispatch on a persisted wakeAt), retryable:false worker verdicts, and the in-memory timeoutMs + heartbeat liveness path.

Transient failures are the normal case for anything that crosses a network boundary, so every step is retryable. The mechanics differ between a local step — work that runs in-process — and a remote step dispatched to a worker, because the engine can re-run a local function in place but a remote step has to survive the worker (and even the orchestrator) disappearing mid-flight.

Local-step retries

ctx.step(name, fn, opts?) runs a unit of work locally and checkpoints its result. If fn throws, the engine retries it up to retries attempts, spacing the attempts with the step's backoff configuration:

const quote = await ctx.step(
  'quote',
  () => this.pricing.fetch(order),
  { retries: 5, backoff: 'exp', backoffMs: 200, backoffMaxMs: 10_000, jitter: true },
);

The options are:

  • retries — maximum number of attempts before the step (and the run) fails. Defaults to 1 (a single try).
  • backoff — how the delay between attempts grows: 'fixed' keeps it constant, 'exp' doubles it each attempt.
  • backoffMs — the base delay in ms. Omit (or set to 0) to retry with no delay at all.
  • backoffMaxMs — an upper bound, so an exponential backoff doesn't grow without limit.
  • jitter — adds random jitter (50–100% of the computed delay) so a fleet of runs retrying the same downstream don't synchronize into a thundering herd.

With backoff: 'exp', backoffMs: 200 the delays before attempts 2, 3, 4… are 200ms, 400ms, 800ms… until they hit backoffMaxMs. With jitter: true each of those is then scaled to a random point in its top half, so two runs that failed at the same instant don't retry in lockstep. The same backoffDelay helper backs both the local loop and the durable remote-step retry below, so the two stay consistent.

FatalError — never retried

Not every failure is worth retrying. A declined card or invalid input will fail the same way on every attempt, so burning the retry budget on it just delays the inevitable. Throw a FatalError to fail the run immediately, regardless of the step's retries:

import { FatalError } from '@dudousxd/nestjs-durable-core';

await ctx.step('charge', async () => {
  const res = await stripe.charge(order);
  if (res.declined) throw new FatalError('card declined', 'declined');
  return res;
});

The optional second argument is a machine-readable code (here 'declined') that ends up on the run's structured error. Use FatalError for deterministic business verdicts; let ordinary throws (a timeout, a 502, a dropped connection) flow through the retry path.

Remote-step retries

A remote step declared with remoteStep and invoked with ctx.remote runs on a worker — possibly in another process or another language. There are two distinct paths for handling its failures, chosen by whether the step sets timeoutMs.

The durable path (no timeoutMs)

This is the default and the recommended one. When you call a remote step that has no timeoutMs, the engine dispatches the task, persists a pending checkpoint, and suspends the run durably — it is not held in memory awaiting the result. Whichever instance receives the worker's result resumes the run, so the worker pod (and even the dispatching orchestrator) can scale down or crash mid-step without losing the run or re-running completed work.

export const chargeCard = remoteStep({
  name: 'payments.charge-card',
  input: z.object({ orderId: z.string(), amountCents: z.number().int() }),
  output: z.object({ chargeId: z.string() }),
  retries: 4,
  backoff: 'exp',
  backoffMs: 500,
  backoffMaxMs: 30_000,
  jitter: true,
});

// in the workflow:
const charge = await ctx.remote(chargeCard, { orderId: order.id, amountCents: order.total });

When a worker reports a failed result for a durable step, the engine consults retries. If the attempt budget remains, it re-dispatches — but durably, not in a loop. The first time it sees the failed checkpoint it computes the next retry deadline as now + backoffDelay(attempt) and stamps it on the failed checkpoint's wakeAt, in clock-space, then suspends. Because that deadline is persisted on the checkpoint rather than living in a timer in memory, it is replay-stable (a resume recomputes the same decision) and crash-safe (a process that dies before the retry fires picks it back up when the timer poller sees the wakeAt come due). Once the deadline passes, the poller resumes the run, the call re-dispatches with an incremented attempt, and the cycle continues until the result lands or retries is exhausted.

Opting out: retryable: false

A durable remote step retries on a failed worker result unless the worker marks the error as non-retryable. A worker that reports an error has returned a deterministic verdict — a declined card, a validation failure — so re-dispatching it just hammers the worker for the same answer. Set retryable: false on the StepError the worker returns and the engine surfaces it to the workflow immediately instead of retrying:

// inside the worker handler (TypeScript or, symmetrically, the Python SDK):
@DurableStep('payments.charge-card')
async charge(input: { orderId: string; amountCents: number }) {
  const res = await this.stripe.charge(input);
  if (res.declined) {
    // a deterministic verdict — don't make the engine retry it
    throw Object.assign(new Error('card declined'), { code: 'declined', retryable: false });
  }
  return { chargeId: res.id };
}

The default is retryable !== false, i.e. an ordinary error is retried; only an explicit retryable: false opts out. This is the worker-side counterpart of throwing FatalError in a local step.

The in-memory liveness path (timeoutMs)

Setting timeoutMs on a remote step opts it into a different path entirely. timeoutMs is a liveness window: if the worker produces neither a result nor a heartbeat within that many ms, the engine presumes it dead, fails the dispatch with a RemoteStepTimeout, and — because that timeout is retryable — re-dispatches it in-place up to retries. Each heartbeat the worker emits (via transport.onHeartbeat) rearms the window, so a long but healthy step that keeps beating stays alive well past timeoutMs.

export const renderVideo = remoteStep({
  name: 'media.render',
  input: z.object({ assetId: z.string() }),
  output: z.object({ url: z.string() }),
  timeoutMs: 60_000, // presume the worker dead after 60s of silence (no result, no heartbeat)
  retries: 3,
});

The crucial difference between the two paths: the durable path retries on a worker reporting a failure and suspends between attempts (crash-safe, not in memory); the liveness path retries on a worker going silent and awaits the result in memory between attempts. They also disagree on what a reported error means — the durable path retries it (subject to retryable), while a timeoutMs step only re-dispatches on a timeout, surfacing any reported error to the workflow on the first attempt. Reach for timeoutMs only when you genuinely need to detect and replace a stuck worker; otherwise leave it off and get the durable, crash-safe path.

On this page