Skip to content

TJPain/durable-execution-ts

Repository files navigation

Durable Execution in TypeScript

Building a durable execution engine from scratch using TypeScript, Node.js and Postgres.

Durable execution is a mechanism to incrementally checkpoint the state of a function as it makes progress, so that in the case of unexpected failure, the function can recover from where it left off. It's particularly relevant in newer stacks and projects implementing AI agents, which are long-running and stateful. A system which implements durable execution is often called a "workflow engine."

Inspired by the Go project Durable Execution, the Hard Way.

Architecture

graph LR
  Producer -->|enqueue| Queue[(Postgres)]
  Queue -->|dequeue| Worker
  Worker -->|ack/nack| Queue
  Worker -->|heartbeat| Queue
  Sweeper -->|reclaim stuck tasks| Queue
  Sweeper -->|fail expired tasks| Queue
  Worker -->|checkpoint steps| EventLog[(durable_events)]
  Worker -->|replay steps| EventLog
Loading

Task lifecycle

stateDiagram-v2
  [*] --> pending
  pending --> running: dequeued by worker
  running --> completed: handler succeeds
  running --> pending: handler fails (retryable, attempts < max)
  running --> failed: handler fails (max attempts or non-retryable)
  running --> pending: worker crashes (reclaimed by sweeper)
  pending --> failed: scheduling timeout exceeded
Loading

Components

Component File Responsibility
Queue src/queue.ts Enqueue/dequeue tasks, ack/nack, worker registration, sweeper queries
Worker src/worker.ts Poll loop, concurrent task dispatch, heartbeat, timeout enforcement
DurableContext src/durableContext.ts Step checkpointing and replay for durable tasks
DB src/db.ts Postgres connection pool via Postgres.js

Features

Task queue with concurrent processing

Tasks are stored in Postgres and dequeued atomically using FOR UPDATE SKIP LOCKED — multiple workers can run in parallel without ever processing the same task twice. Each worker processes up to N tasks concurrently (configurable).

Retries with exponential backoff

When a task fails, it's automatically retried up to max_attempts times. Each retry waits exponentially longer (1s, 2s, 4s, 8s...) via a run_after column that the dequeue query respects. Handlers can throw NonRetryableError to fail immediately regardless of remaining attempts.

Execution timeouts

Each task has a timeout_seconds value. The worker enforces this via Promise.race — even if the handler ignores the AbortSignal, it will be timed out and nacked. This prevents hung tasks from consuming concurrency slots indefinitely.

Worker heartbeats and stuck task recovery

Workers register themselves in a workers table and update last_heartbeat_at every few seconds. A sweeper periodically checks for workers that have gone silent (crashed, OOM-killed, network-partitioned) and requeues their running tasks so another worker can pick them up.

Scheduling timeouts

Tasks can specify a scheduling timeout — if they sit in pending for too long without being picked up by any worker, they're failed permanently. Useful for time-sensitive work where a stale result is worse than no result.

Priority ordering

Tasks have a priority field (default 0). Higher priority tasks skip ahead in the queue. Payment processing at priority 10 will always be dequeued before a background report at priority 0, regardless of creation time.

Durable execution

Tasks registered with registerDurable receive a DurableContext instead of raw payload. Handlers call ctx.run(label, fn) for each logical step:

registerDurable("onboard-user", async (ctx) => {
  const userId = await ctx.run("create-account", async () => createAccount());
  await ctx.run("send-email", async () => sendWelcomeEmail(userId));
});

Each step's return value is persisted to durable_events. On retry, completed steps are skipped and their stored output is returned directly — the handler resumes from where it left off. If new steps are inserted between retries, a NonDeterminismError is thrown.

Enqueue a durable task with isDurable: true:

await enqueue("onboard-user", { userId: "abc" }, { isDurable: true });

Prerequisites

  • Node.js 22+
  • Docker

Getting started

# Install dependencies
npm install

# Start Postgres
npm run db:up

# Set the connection string
export DATABASE_URL="postgresql://durable:durable@localhost:5432/durable"

# Apply the schema
npm run db:migrate

# Run the demo
npm start

Press Ctrl+C to shut down the worker gracefully.

Tests

npm test

Other commands

npm run db:reset   # Drop and recreate all tables

About

Durable execution engine from scratch in TypeScript/Node + Postgres (with LISTEN/NOTIFY)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors