Aviary
Observability

Control plane

The embedded dashboard — a React SPA served by NestJS that lists runs and renders each one as a graph across local and remote steps, with retry and cancel.

@dudousxd/nestjs-durable-dashboard mounts a control plane at /durable: a bundled React SPA plus its JSON API. It reads straight from the state store, so it works without an OpenTelemetry collector.

import { DurableDashboardModule } from '@dudousxd/nestjs-durable-dashboard';

@Module({
  imports: [DurableModule.forRoot({ store, transport }), DurableDashboardModule.forRoot()],
})
export class AppModule {}

DurableModule is global, so the dashboard resolves the engine and store automatically. Front the base route with your own guard to protect it.

Where it mounts

forRoot() defaults to /durable (SPA at /durable, JSON API at /durable/api). Pass basePath to move it — e.g. under your app's /api prefix so the same auth/proxy rules cover the dashboard API:

DurableDashboardModule.forRoot({ basePath: '/api/durable' });
// SPA at /api/durable, API at /api/durable/api

The SPA's asset URLs and its API base are derived from basePath at serve time, so the bundle works at any mount point. If your app sets a global prefix (app.setGlobalPrefix('api')), exclude the base route so the path isn't prefixed twice — or fold it in by setting basePath to include the prefix.

Serving it from an API pod (without processing there)

When you split your app into API and driving pods, you usually want the dashboard served from the API (it handles HTTP) while only the driving pods process runs. Set drive: false on the API instance's DurableModule:

DurableModule.forRoot({
  store,
  transport,
  drive: process.env.APP_TYPE === 'API' ? false : true, // driving pods process; the API only serves the dashboard
});

With drive: false the engine is still fully available — the dashboard can dispatch retries and cancels, and reads work — but the instance does not drive: it won't register @DurableStep handlers (so it never consumes the task queue), won't recover incomplete runs on boot, and won't poll due timers. Leave that to the driving pods. So a retry clicked in the UI on an API pod still works: the API dispatches the step, a driving pod picks it up and runs it. Only the processing is gated, not the control plane. (Default is drive: true, so single-process apps need nothing.)

What it shows

  • Runs — every run with its status (pending / running / suspended / completed / failed / cancelled / dead), filterable. pending is a run that's been created and enqueued but not yet picked up by a worker — you'll see it briefly on every run, and persistently on an API pod (drive: false) whose runs execute on separate driving pods. The status filter is a row of chips, including a dead chip for the dead-letter state; a dead run carries a distinct badge so a poison pill is easy to spot.
  • Timeline as a graph — the selected run rendered with react-flow: a node pipeline from start through each step to the terminal state, tagged by kind (local / remote / sleep / signal), with worker group and duration. This is the "see the whole flow" view — a workflow whose steps span apps shown as one rail. In-flight steps show up here while they run, not only on completion: a remote step in flight reads as pending and a local step's executing body as running (both rendered as an in-flight node). Local in-flight visibility is on by default — it's the engine's trackStepStart option — so a long ctx.step is visible the moment it begins.
  • Step timing — click a step to inspect it. A remote step shows its queued time (how long it sat in the queue before a worker picked it up) next to its processing duration, so you can tell a slow handler apart from a backed-up worker pool. Queue-wait is reported by every transport and every language SDK (including the Python worker).
  • Live-tail — the run view streams new lifecycle events as they happen (see below) instead of polling, so a running workflow updates in place.
  • Actions — retry a failed/suspended/dead run, cancel a running one, cancel + undo (cancel with saga compensation), and continue a run paused at a breakpoint. These are non-blocking: retry re-enqueues the run (engine.requeue(runId)pending + dispatch) and cancel + undo runs the saga compensation in the background — neither replays the workflow inline in the HTTP request, so the dashboard action returns immediately and a worker does the work. (Previously an inline replay could hang the request.)

Live-tail over SSE

The run view tails a run's lifecycle events over Server-Sent Events rather than re-fetching. The server exposes an @Sse route, GET runs/:id/stream, and the client subscribes with durableClient.streamRun(id, onEvent):

import { durableClient } from '@dudousxd/nestjs-durable-dashboard/client';

const close = durableClient.streamRun(runId, (event) => {
  // event: { type: 'step.completed' | 'run.failed' | …, runId, seq, name, … }
  console.log(event.type, event.name);
});
// later: close();

This is cross-pod when the engine has a control plane: an event produced by a worker pod is broadcast over the control plane, and a dashboard-only API pod re-delivers it to its SSE subscribers. So you can live-tail a run on the API pod even though the work runs elsewhere. Without a control plane, the stream still works for runs executing on the same instance (single-process apps).

Cancel + Undo (saga compensation)

The run actions include both a plain Cancel and a Cancel + Undo. Plain cancel is immediate — the run is marked cancelled and a late worker result is dropped. Cancel + Undo instead asks the engine to compensate the saga — in the background, not in the HTTP request: the run is resumed on a worker so its completed steps' compensate callbacks run in reverse order (each visible as a compensate:<step> step event in the timeline), then the run is marked cancelled. The HTTP action returns immediately. The UI sends this as cancel with ?compensate=true:

await durableClient.cancel(runId);                      // immediate, no undo
await durableClient.cancel(runId, { compensate: true }); // ?compensate=true — undo first

Use Cancel + Undo when the run has already done outside-world work (a charge, a reservation) that must be reversed; use plain Cancel to just stop it.

A run that crash-recovery gives up on — after exceeding maxRecoveryAttempts — is moved to the dead dead-letter state (a poison pill that would otherwise crash the process every boot). The dashboard surfaces these as a first-class status: a dead filter chip and a badge on the run. A dead run is terminal but inspectable, and still retriable from the UI once you've fixed the cause.

When you configure a deadLetterWorkflow, dead-lettering also starts a handler run (idempotent by a dlq:<runId> id) carrying { deadRunId, workflow, input, error }. The dashboard renders the relationship both ways: the dead run links forward to its dlq:<id> handler, and the handler links back to the dead run it was started for — so you can jump between the failure and whatever your DLQ workflow did about it (alert, compensate, queue for review).

Webhook callbacks

A ctx.webhook() hands a third party a callback URL (built by your webhookUrl option). When they POST it, the dashboard's POST webhooks/:token endpoint turns that body into the signal the waiting run resumes on. The token embeds runId:seq, so treat it as a secret — this endpoint is reachable by external systems; front it with signature verification in your own middleware.

Live queries and updates

Two more endpoints expose a run's query/update surface over HTTP:

  • GET runs/:id/events/:key — read the latest value a run published with ctx.setEvent(key, …), a side-effect-free live query of an in-flight (or finished) run's state (progress, a partial result). Returns undefined if the run never published that key.
  • POST runs/:id/updates/:name — deliver a validated update to a run waiting at ctx.onUpdate(name); the request body is the update argument. Any validator registered via engine.registerUpdateValidator runs first, so a rejected update never reaches the run.

API

The SPA is backed by a small JSON API you can also call directly. A codegen-emitted typed client ships at the ./client subpath as durableClient, so an external dashboard or script can call the same routes with full types instead of hand-rolling fetch:

import { durableClient } from '@dudousxd/nestjs-durable-dashboard/client';

const runs = await durableClient.runs('failed');
const detail = await durableClient.run(runs[0].id);
await durableClient.retry(detail.run.id);
MethodRoute
GET/durable/api/runslist runs (?status=, ?workflow=)
GET/durable/api/runs/:idrun + step timeline
GET (SSE)/durable/api/runs/:id/streamlive-tail the run's lifecycle events
POST/durable/api/runs/:id/retryre-enqueue the run (→ pending, a worker resumes it)
POST/durable/api/runs/:id/cancelcancel the run (?compensate=true to undo the saga first)
POST/durable/api/runs/:id/continueresume a run paused at a breakpoint
POST/durable/api/webhooks/:tokendeliver a ctx.webhook() callback (body resumes the run)
GET/durable/api/runs/:id/events/:keyread a live query value (ctx.setEvent)
POST/durable/api/runs/:id/updates/:namedeliver an update (ctx.onUpdate); body is the arg

On this page