Dead-letter queue

Cap crash-recovery with maxRecoveryAttempts so a poison-pill run moves to the terminal dead status instead of crash-looping forever, then route dead runs with engine.onDead or a DLQ workflow — an inline @DeadLetter() method, a per-workflow deadLetterWorkflow reference, or the module-level default — to alert, compensate, or queue for review.

Most failures are transient and recover cleanly. But occasionally a run is a poison pill: it crashes the process every time recovery picks it up — a deserialization bug, a non-deterministic change, an infinite loop in a step. Left alone, crash-recovery would resume it, crash, resume it again on the next boot, and loop forever, taking the whole instance down with it each time. The dead-letter queue breaks that loop.

Capping recovery: `maxRecoveryAttempts`

Crash-recovery is self-healing and periodic: while a run executes its worker renews a recovery lease, so a crashed worker's lease expires and recovery (engine.recoverIncomplete(), called every tick by the TimerPoller) reclaims the orphaned run within ~leaseMs — on any instance, not just on the next boot. That's what makes a genuine poison pill visible: it gets picked up again and again. Every time crash-recovery picks up a still-running run, the engine increments that run's recoveryAttempts before resuming it (so a crash mid-resume still advances the counter). Set maxRecoveryAttempts — an engine/module option — and once a run exceeds it, instead of resuming yet again the engine moves the run to the terminal dead status with a max_recovery_attempts error, releases its lease, and stops touching it:

DurableModule.forRoot({
  store,
  transport,
  maxRecoveryAttempts: 5, // after 5 crash-recoveries, dead-letter the run instead of looping
});

Omit maxRecoveryAttempts for unlimited recovery (the default). A dead run is terminal but not lost: it stays fully inspectable in the dashboard — its history, its checkpoints, its error — and it can be retried from there once you've shipped a fix. The point is that one poison pill no longer keeps the process crashing for every other run.

Handling dead runs: `engine.onDead`

Parking a dead run is the floor, not the ceiling. Subscribe with engine.onDead to be notified the moment a run is dead-lettered — the listener receives the dead WorkflowRun (status dead, with its error) — so you can do something active: page an on-call engineer, push to a real message queue, or kick off a workflow that handles it.

@Injectable()
export class DeadLetterAlerts implements OnModuleInit {
  private readonly logger = new Logger(DeadLetterAlerts.name);

  constructor(
    private readonly engine: WorkflowEngine,
    private readonly pager: PagerService,
  ) {}

  onModuleInit() {
    this.engine.onDead((run) => {
      this.logger.error(`run ${run.id} (${run.workflow}) dead-lettered: ${run.error?.message}`);
      void this.pager.alert('durable-poison-pill', { runId: run.id, workflow: run.workflow });
    });
  }
}

onDead returns an unsubscribe function. It fires on the instance that dead-letters the run.

Routing to a DLQ workflow

The most powerful option is to route dead runs into a workflow of their own — a durable handler that gets all the reliability machinery (retries, steps, its own observability) to triage the failure. The handler is started with a typed DeadLetter<TInput> payload, idempotent by a dlq:<runId> id so a run is never double-handled:

import type { DeadLetter } from '@dudousxd/nestjs-durable';

interface DeadLetter<TInput = unknown> {
  deadRunId: string; // the dead run — inspectable + retriable in the dashboard
  workflow: string; // the workflow whose run died
  input: TInput; // the original input it was started with
  error?: StepError; // the failure that moved it to `dead`
}

Dead-lettering is naturally per-workflow — a dead checkout run wants a different handler than a dead pipeline run — so the target is resolved per dead run, in this order:

an inline @DeadLetter() method on the workflow class — the co-located default;
the workflow's @Workflow({ deadLetterWorkflow }) reference to another registered workflow;
the module-level deadLetterWorkflow default, for everything that declares neither.

Inline `@DeadLetter()` — one file

The simplest case: a method on the workflow class, marked @DeadLetter(). It runs as a durable workflow in its own right (auto-registered as <name>.dlq, with checkpointing and a dashboard run), and it shares the class's injected dependencies — so the handler lives right next to the workflow it protects:

@Workflow({ name: 'pipeline', version: '3' })
export class PipelineWorkflow {
  constructor(
    private readonly alerts: AlertsService,
    private readonly tickets: TicketService,
  ) {}

  async run(ctx: WorkflowCtx, input: PipelineInput) {
    const extracted = await ctx.step('extract', () => this.extract(input));
    const transformed = await ctx.step('transform', () => this.transform(extracted));
    return ctx.remote(loadStep, transformed);
  }

  // Runs when a `pipeline` run is dead-lettered. Typed to the workflow's own input, same deps.
  @DeadLetter()
  async onDead(ctx: WorkflowCtx, dl: DeadLetter<PipelineInput>) {
    await ctx.step('alert', () =>
      this.alerts.page('durable-dead-letter', {
        runId: dl.deadRunId,
        workflow: dl.workflow,
        error: dl.error?.message,
      }),
    );

    const ticket = await ctx.step('open-ticket', () =>
      this.tickets.create({
        title: `Dead-lettered run ${dl.deadRunId} (${dl.workflow})`,
        body: dl.error?.message ?? 'unknown error',
        payload: dl.input, // typed PipelineInput, ready to replay after a fix
      }),
    );

    return { ticketId: ticket.id };
  }
}

`@Workflow({ deadLetterWorkflow })` — a separate workflow

When the DLQ handler is large, reused across workflows, or just deserves its own file, point at another registered workflow — by class (refactor-safe) or by name (for a cross-runtime handler). Register it as a normal provider like any other workflow:

@Workflow({ name: 'pipeline', version: '3', deadLetterWorkflow: PipelineDlqWorkflow })
export class PipelineWorkflow {
  /* ...run... */
}

@Workflow({ name: 'pipeline-dlq', version: '1' })
export class PipelineDlqWorkflow {
  async run(ctx: WorkflowCtx, dl: DeadLetter) {
    /* same handler body as above */
  }
}

Module-level default

Set the module's deadLetterWorkflow to catch every workflow that declares neither of the above — a single backstop DLQ for the whole app:

DurableModule.forRoot({
  store,
  transport,
  maxRecoveryAttempts: 5,
  deadLetterWorkflow: 'global-dlq', // fallback for workflows without their own
});

A workflow that declares both an inline @DeadLetter() method and a deadLetterWorkflow reference is a configuration error — two targets for one workflow is ambiguous, so the module fails fast at boot rather than silently picking one. Pick one.

Because every start is idempotent by dlq:<runId>, a run that gets dead-lettered (and re-detected) more than once still triggers exactly one DLQ run.

The dashboard links the two ends of this relationship: from a dead run you can see the DLQ run it spawned, and the DLQ run carries the originating deadRunId — so you can walk from a poison pill to its handler and back, retry the original once it's fixed, and confirm the DLQ run did its job.

Dead-letter queue

Capping recovery: maxRecoveryAttempts

Handling dead runs: engine.onDead

Routing to a DLQ workflow

Inline @DeadLetter() — one file

@Workflow({ deadLetterWorkflow }) — a separate workflow

Module-level default

On this page

Capping recovery: `maxRecoveryAttempts`

Handling dead runs: `engine.onDead`

Inline `@DeadLetter()` — one file

`@Workflow({ deadLetterWorkflow })` — a separate workflow