Dead-letter queue
Cap crash-recovery with maxRecoveryAttempts so a poison-pill run moves to the terminal dead status instead of crash-looping forever, then route dead runs with engine.onDead or a DLQ workflow — an inline @DeadLetter() method, a per-workflow deadLetterWorkflow reference, or the module-level default — to alert, compensate, or queue for review.
Most failures are transient and recover cleanly. But occasionally a run is a poison pill: it crashes the process every time recovery picks it up — a deserialization bug, a non-deterministic change, an infinite loop in a step. Left alone, crash-recovery would resume it, crash, resume it again on the next boot, and loop forever, taking the whole instance down with it each time. The dead-letter queue breaks that loop.
Capping recovery: maxRecoveryAttempts
Crash-recovery is self-healing and periodic: while a run executes its worker renews a recovery lease,
so a crashed worker's lease expires and recovery (engine.recoverIncomplete(), called every tick by the
TimerPoller) reclaims the orphaned run within ~leaseMs — on any instance, not just on the next boot.
That's what makes a genuine poison pill visible: it gets picked up again and again. Every time
crash-recovery picks up a still-running run, the engine increments that run's recoveryAttempts
before resuming it (so a crash mid-resume still advances the counter). Set
maxRecoveryAttempts — an engine/module option — and once a run exceeds it, instead of resuming yet again
the engine moves the run to the terminal dead status with a max_recovery_attempts error, releases its
lease, and stops touching it:
DurableModule.forRoot({
store,
transport,
maxRecoveryAttempts: 5, // after 5 crash-recoveries, dead-letter the run instead of looping
});Omit maxRecoveryAttempts for unlimited recovery (the default). A dead run is terminal but not lost:
it stays fully inspectable in the dashboard — its history, its checkpoints, its error — and it can be
retried from there once you've shipped a fix. The point is that one poison pill no longer keeps the process
crashing for every other run.
Handling dead runs: engine.onDead
Parking a dead run is the floor, not the ceiling. Subscribe with engine.onDead to be notified the moment a
run is dead-lettered — the listener receives the dead WorkflowRun (status dead, with its error) — so you
can do something active: page an on-call engineer, push to a real message queue, or kick off a workflow that
handles it.
@Injectable()
export class DeadLetterAlerts implements OnModuleInit {
private readonly logger = new Logger(DeadLetterAlerts.name);
constructor(
private readonly engine: WorkflowEngine,
private readonly pager: PagerService,
) {}
onModuleInit() {
this.engine.onDead((run) => {
this.logger.error(`run ${run.id} (${run.workflow}) dead-lettered: ${run.error?.message}`);
void this.pager.alert('durable-poison-pill', { runId: run.id, workflow: run.workflow });
});
}
}onDead returns an unsubscribe function. It fires on the instance that dead-letters the run.
Routing to a DLQ workflow
The most powerful option is to route dead runs into a workflow of their own — a durable handler that gets
all the reliability machinery (retries, steps, its own observability) to triage the failure. The handler is
started with a typed DeadLetter<TInput> payload, idempotent by a dlq:<runId> id so a run is never
double-handled:
import type { DeadLetter } from '@dudousxd/nestjs-durable';
interface DeadLetter<TInput = unknown> {
deadRunId: string; // the dead run — inspectable + retriable in the dashboard
workflow: string; // the workflow whose run died
input: TInput; // the original input it was started with
error?: StepError; // the failure that moved it to `dead`
}Dead-lettering is naturally per-workflow — a dead checkout run wants a different handler than a dead
pipeline run — so the target is resolved per dead run, in this order:
- an inline
@DeadLetter()method on the workflow class — the co-located default; - the workflow's
@Workflow({ deadLetterWorkflow })reference to another registered workflow; - the module-level
deadLetterWorkflowdefault, for everything that declares neither.
Inline @DeadLetter() — one file
The simplest case: a method on the workflow class, marked @DeadLetter(). It runs as a durable workflow in
its own right (auto-registered as <name>.dlq, with checkpointing and a dashboard run), and it shares the
class's injected dependencies — so the handler lives right next to the workflow it protects:
@Workflow({ name: 'pipeline', version: '3' })
export class PipelineWorkflow {
constructor(
private readonly alerts: AlertsService,
private readonly tickets: TicketService,
) {}
async run(ctx: WorkflowCtx, input: PipelineInput) {
const extracted = await ctx.step('extract', () => this.extract(input));
const transformed = await ctx.step('transform', () => this.transform(extracted));
return ctx.remote(loadStep, transformed);
}
// Runs when a `pipeline` run is dead-lettered. Typed to the workflow's own input, same deps.
@DeadLetter()
async onDead(ctx: WorkflowCtx, dl: DeadLetter<PipelineInput>) {
await ctx.step('alert', () =>
this.alerts.page('durable-dead-letter', {
runId: dl.deadRunId,
workflow: dl.workflow,
error: dl.error?.message,
}),
);
const ticket = await ctx.step('open-ticket', () =>
this.tickets.create({
title: `Dead-lettered run ${dl.deadRunId} (${dl.workflow})`,
body: dl.error?.message ?? 'unknown error',
payload: dl.input, // typed PipelineInput, ready to replay after a fix
}),
);
return { ticketId: ticket.id };
}
}@Workflow({ deadLetterWorkflow }) — a separate workflow
When the DLQ handler is large, reused across workflows, or just deserves its own file, point at another registered workflow — by class (refactor-safe) or by name (for a cross-runtime handler). Register it as a normal provider like any other workflow:
@Workflow({ name: 'pipeline', version: '3', deadLetterWorkflow: PipelineDlqWorkflow })
export class PipelineWorkflow {
/* ...run... */
}
@Workflow({ name: 'pipeline-dlq', version: '1' })
export class PipelineDlqWorkflow {
async run(ctx: WorkflowCtx, dl: DeadLetter) {
/* same handler body as above */
}
}Module-level default
Set the module's deadLetterWorkflow to catch every workflow that declares neither of the above — a single
backstop DLQ for the whole app:
DurableModule.forRoot({
store,
transport,
maxRecoveryAttempts: 5,
deadLetterWorkflow: 'global-dlq', // fallback for workflows without their own
});A workflow that declares both an inline @DeadLetter() method and a deadLetterWorkflow reference is a
configuration error — two targets for one workflow is ambiguous, so the module fails fast at boot rather than
silently picking one. Pick one.
Because every start is idempotent by dlq:<runId>, a run that gets dead-lettered (and re-detected) more than
once still triggers exactly one DLQ run.
The dashboard links the two ends of this relationship: from a dead run you can see the DLQ run it spawned,
and the DLQ run carries the originating deadRunId — so you can walk from a poison pill to its handler and
back, retry the original once it's fixed, and confirm the DLQ run did its job.
Flow control
Durable queues for remote steps — cap concurrency and enforce fixed-window rate limits with engine.registerQueue (or the module's queues option) and ctx.remote(step, input, { queue }). A call that can't be admitted re-suspends and the timer poller retries admission, so the limit survives crashes.
Overview
How remote steps travel to workers. From an in-process event-emitter for zero-infra single-process handlers, to BullMQ/Redis, SQS, and a broker-less SQL transport for cross-process and cross-language steps.