Control plane
The embedded dashboard — a React SPA served by NestJS that lists runs and renders each one as a graph across local and remote steps, with retry and cancel.
@dudousxd/nestjs-durable-dashboard mounts a control plane at /durable: a bundled React SPA
plus its JSON API. It reads straight from the state store, so it works without an
OpenTelemetry collector.
import { DurableDashboardModule } from '@dudousxd/nestjs-durable-dashboard';
@Module({
imports: [DurableModule.forRoot({ store, transport }), DurableDashboardModule.forRoot()],
})
export class AppModule {}DurableModule is global, so the dashboard resolves the engine and store automatically. Front the
base route with your own guard to protect it.
Where it mounts
forRoot() defaults to /durable (SPA at /durable, JSON API at /durable/api). Pass basePath
to move it — e.g. under your app's /api prefix so the same auth/proxy rules cover the dashboard API:
DurableDashboardModule.forRoot({ basePath: '/api/durable' });
// SPA at /api/durable, API at /api/durable/apiThe SPA's asset URLs and its API base are derived from basePath at serve time, so the bundle works
at any mount point. If your app sets a global prefix (app.setGlobalPrefix('api')), exclude the base
route so the path isn't prefixed twice — or fold it in by setting basePath to include the prefix.
Serving it from an API pod (without processing there)
When you split your app into API and driving pods, you usually want the dashboard served from
the API (it handles HTTP) while only the driving pods process runs. Set drive: false on the API
instance's DurableModule:
DurableModule.forRoot({
store,
transport,
drive: process.env.APP_TYPE === 'API' ? false : true, // driving pods process; the API only serves the dashboard
});With drive: false the engine is still fully available — the dashboard can dispatch retries and
cancels, and reads work — but the instance does not drive: it won't register @DurableStep
handlers (so it never consumes the task queue), won't recover incomplete runs on boot, and won't poll
due timers. Leave that to the driving pods. So a retry clicked in the UI on an API pod still
works: the API dispatches the step, a driving pod picks it up and runs it. Only the processing is
gated, not the control plane. (Default is drive: true, so single-process apps need nothing.)
What it shows
- Runs — every run with its status (pending / running / suspended / completed / failed /
cancelled / dead), filterable.
pendingis a run that's been created and enqueued but not yet picked up by a worker — you'll see it briefly on every run, and persistently on an API pod (drive: false) whose runs execute on separate driving pods. The status filter is a row of chips, including adeadchip for the dead-letter state; a dead run carries a distinct badge so a poison pill is easy to spot. - Timeline as a graph — the selected run rendered with
react-flow: a node pipeline from start through each step to the terminal state, tagged by kind (local / remote / sleep / signal), with worker group and duration. This is the "see the whole flow" view — a workflow whose steps span apps shown as one rail. In-flight steps show up here while they run, not only on completion: a remote step in flight reads aspendingand a local step's executing body asrunning(both rendered as an in-flight node). Local in-flight visibility is on by default — it's the engine'strackStepStartoption — so a longctx.stepis visible the moment it begins. - Step timing — click a step to inspect it. A remote step shows its queued time (how long it sat in the queue before a worker picked it up) next to its processing duration, so you can tell a slow handler apart from a backed-up worker pool. Queue-wait is reported by every transport and every language SDK (including the Python worker).
- Live-tail — the run view streams new lifecycle events as they happen (see below) instead of polling, so a running workflow updates in place.
- Actions — retry a failed/suspended/dead run, cancel a running one, cancel + undo (cancel
with saga compensation), and continue a run paused at a breakpoint.
These are non-blocking: retry re-enqueues the run (
engine.requeue(runId)→pending+ dispatch) and cancel + undo runs the saga compensation in the background — neither replays the workflow inline in the HTTP request, so the dashboard action returns immediately and a worker does the work. (Previously an inline replay could hang the request.)
Live-tail over SSE
The run view tails a run's lifecycle events over Server-Sent Events rather than re-fetching. The
server exposes an @Sse route, GET runs/:id/stream, and the client subscribes with
durableClient.streamRun(id, onEvent):
import { durableClient } from '@dudousxd/nestjs-durable-dashboard/client';
const close = durableClient.streamRun(runId, (event) => {
// event: { type: 'step.completed' | 'run.failed' | …, runId, seq, name, … }
console.log(event.type, event.name);
});
// later: close();This is cross-pod when the engine has a control plane: an event produced by a worker pod is broadcast over the control plane, and a dashboard-only API pod re-delivers it to its SSE subscribers. So you can live-tail a run on the API pod even though the work runs elsewhere. Without a control plane, the stream still works for runs executing on the same instance (single-process apps).
Cancel + Undo (saga compensation)
The run actions include both a plain Cancel and a Cancel + Undo. Plain cancel is immediate —
the run is marked cancelled and a late worker result is dropped. Cancel + Undo instead asks the
engine to compensate the saga — in the background, not in the HTTP request: the run is resumed
on a worker so its completed steps' compensate callbacks run in reverse order (each visible as a
compensate:<step> step event in the timeline), then the run is marked cancelled. The HTTP action
returns immediately. The UI sends this as cancel with ?compensate=true:
await durableClient.cancel(runId); // immediate, no undo
await durableClient.cancel(runId, { compensate: true }); // ?compensate=true — undo firstUse Cancel + Undo when the run has already done outside-world work (a charge, a reservation) that must be reversed; use plain Cancel to just stop it.
Dead-letter runs and their links
A run that crash-recovery gives up on — after exceeding maxRecoveryAttempts — is moved to the
dead dead-letter state (a poison pill that would otherwise crash the process every boot). The
dashboard surfaces these as a first-class status: a dead filter chip and a badge on the run.
A dead run is terminal but inspectable, and still retriable from the UI once you've fixed the
cause.
When you configure a deadLetterWorkflow, dead-lettering also starts a handler run (idempotent by
a dlq:<runId> id) carrying { deadRunId, workflow, input, error }. The dashboard renders the
relationship both ways: the dead run links forward to its dlq:<id> handler, and the handler links
back to the dead run it was started for — so you can jump between the failure and whatever your DLQ
workflow did about it (alert, compensate, queue for review).
Webhook callbacks
A ctx.webhook() hands a third party a callback URL (built by your webhookUrl option). When they
POST it, the dashboard's POST webhooks/:token endpoint turns that body into the signal the waiting
run resumes on. The token embeds runId:seq, so treat it as a secret — this endpoint is reachable by
external systems; front it with signature verification in your own middleware.
Live queries and updates
Two more endpoints expose a run's query/update surface over HTTP:
GET runs/:id/events/:key— read the latest value a run published withctx.setEvent(key, …), a side-effect-free live query of an in-flight (or finished) run's state (progress, a partial result). Returnsundefinedif the run never published that key.POST runs/:id/updates/:name— deliver a validated update to a run waiting atctx.onUpdate(name); the request body is the update argument. Any validator registered viaengine.registerUpdateValidatorruns first, so a rejected update never reaches the run.
API
The SPA is backed by a small JSON API you can also call directly. A codegen-emitted typed client
ships at the ./client subpath as durableClient, so an external dashboard or script can call the
same routes with full types instead of hand-rolling fetch:
import { durableClient } from '@dudousxd/nestjs-durable-dashboard/client';
const runs = await durableClient.runs('failed');
const detail = await durableClient.run(runs[0].id);
await durableClient.retry(detail.run.id);| Method | Route | |
|---|---|---|
GET | /durable/api/runs | list runs (?status=, ?workflow=) |
GET | /durable/api/runs/:id | run + step timeline |
GET (SSE) | /durable/api/runs/:id/stream | live-tail the run's lifecycle events |
POST | /durable/api/runs/:id/retry | re-enqueue the run (→ pending, a worker resumes it) |
POST | /durable/api/runs/:id/cancel | cancel the run (?compensate=true to undo the saga first) |
POST | /durable/api/runs/:id/continue | resume a run paused at a breakpoint |
POST | /durable/api/webhooks/:token | deliver a ctx.webhook() callback (body resumes the run) |
GET | /durable/api/runs/:id/events/:key | read a live query value (ctx.setEvent) |
POST | /durable/api/runs/:id/updates/:name | deliver an update (ctx.onUpdate); body is the arg |
Drizzle
The Drizzle StateStore adapter for SQLite / libSQL (Turso, edge). Build your drizzle db with the package's schema and pass it to DrizzleStateStore. Schema is owned by drizzle-kit — no auto-schema.
OpenTelemetry
One trace per run, one span per step. Bridge the engine's lifecycle events to OpenTelemetry and see workflows in Jaeger, Grafana or Datadog.