Building a durable execution engine from scratch using TypeScript, Node.js and Postgres.
Durable execution is a mechanism to incrementally checkpoint the state of a function as it makes progress, so that in the case of unexpected failure, the function can recover from where it left off. It's particularly relevant in newer stacks and projects implementing AI agents, which are long-running and stateful. A system which implements durable execution is often called a "workflow engine."
Inspired by the Go project Durable Execution, the Hard Way.
graph LR
Producer -->|enqueue| Queue[(Postgres)]
Queue -->|dequeue| Worker
Worker -->|ack/nack| Queue
Worker -->|heartbeat| Queue
Sweeper -->|reclaim stuck tasks| Queue
Sweeper -->|fail expired tasks| Queue
Worker -->|checkpoint steps| EventLog[(durable_events)]
Worker -->|replay steps| EventLog
stateDiagram-v2
[*] --> pending
pending --> running: dequeued by worker
running --> completed: handler succeeds
running --> pending: handler fails (retryable, attempts < max)
running --> failed: handler fails (max attempts or non-retryable)
running --> pending: worker crashes (reclaimed by sweeper)
pending --> failed: scheduling timeout exceeded
| Component | File | Responsibility |
|---|---|---|
| Queue | src/queue.ts |
Enqueue/dequeue tasks, ack/nack, worker registration, sweeper queries |
| Worker | src/worker.ts |
Poll loop, concurrent task dispatch, heartbeat, timeout enforcement |
| DurableContext | src/durableContext.ts |
Step checkpointing and replay for durable tasks |
| DB | src/db.ts |
Postgres connection pool via Postgres.js |
Tasks are stored in Postgres and dequeued atomically using FOR UPDATE SKIP LOCKED — multiple workers can run in parallel without ever processing the same task twice. Each worker processes up to N tasks concurrently (configurable).
When a task fails, it's automatically retried up to max_attempts times. Each retry waits exponentially longer (1s, 2s, 4s, 8s...) via a run_after column that the dequeue query respects. Handlers can throw NonRetryableError to fail immediately regardless of remaining attempts.
Each task has a timeout_seconds value. The worker enforces this via Promise.race — even if the handler ignores the AbortSignal, it will be timed out and nacked. This prevents hung tasks from consuming concurrency slots indefinitely.
Workers register themselves in a workers table and update last_heartbeat_at every few seconds. A sweeper periodically checks for workers that have gone silent (crashed, OOM-killed, network-partitioned) and requeues their running tasks so another worker can pick them up.
Tasks can specify a scheduling timeout — if they sit in pending for too long without being picked up by any worker, they're failed permanently. Useful for time-sensitive work where a stale result is worse than no result.
Tasks have a priority field (default 0). Higher priority tasks skip ahead in the queue. Payment processing at priority 10 will always be dequeued before a background report at priority 0, regardless of creation time.
Tasks registered with registerDurable receive a DurableContext instead of raw payload. Handlers call ctx.run(label, fn) for each logical step:
registerDurable("onboard-user", async (ctx) => {
const userId = await ctx.run("create-account", async () => createAccount());
await ctx.run("send-email", async () => sendWelcomeEmail(userId));
});Each step's return value is persisted to durable_events. On retry, completed steps are skipped and their stored output is returned directly — the handler resumes from where it left off. If new steps are inserted between retries, a NonDeterminismError is thrown.
Enqueue a durable task with isDurable: true:
await enqueue("onboard-user", { userId: "abc" }, { isDurable: true });- Node.js 22+
- Docker
# Install dependencies
npm install
# Start Postgres
npm run db:up
# Set the connection string
export DATABASE_URL="postgresql://durable:durable@localhost:5432/durable"
# Apply the schema
npm run db:migrate
# Run the demo
npm startPress Ctrl+C to shut down the worker gracefully.
npm testnpm run db:reset # Drop and recreate all tables