diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
index ae6aa85..2f389ac 100644
--- a/DEVELOPMENT.md
+++ b/DEVELOPMENT.md
@@ -1,82 +1,164 @@
-# cljs repl
+# Development Guide
-```shell
-npx shadow-cljs watch dev
-```
+This document describes how to set up your development environment and contribute to the project.
-Observe browser window open with a message like:
-> Code entered in a browser-repl prompt will be evaluated here.
+## Prerequisites
-The, connect and select the appropriate shadow repl.
+- **Java 21+** (uses virtual threads)
+- **Clojure CLI** 1.12+
+- **Node.js** (for ClojureScript tests)
+- **Docker** + **Docker Compose** (for integration tests with PostgreSQL and FoundationDB)
-```clojure
-(require '[shadow.cljs.devtools.api :as shadow])
-(shadow/browser-repl)
-```
+## Quick Start
-# doc
+```bash
+# Install dependencies and run tests
+bin/kaocha
+```
-```shell
-npx shadow-cljs watch doc
-...
-shadow-cljs - HTTP server available at http://localhost:8000
-#open the browser
+Run `ClojureMCP`
-# or
-npx shadow-cljs compile doc
-python -m http.server --directory public
+```bash
+clojure -Tmcp start
```
-# cljs repl
+## Project Structure
```
-clj -A:dev:doc:cljs
+intemporal/
+├── src/intemporal/ # Main source code
+│ ├── core.cljc # Public API
+│ ├── protocol.cljc # Core protocols (IStore, etc.)
+│ ├── store.cljc # In-memory store
+│ ├── store/ # JDBC and FDB stores
+│ └── internal/ # Internal implementation
+├── test/ # Tests
+├── dev/ # Development utilities
+└── resources/migrations/ # Database migrations
+```
+
+## Database Setup
+
+For integration and chaos tests, start the databases:
+
+```bash
+docker compose up -d postgresql foundation
```
-# Tests
+- **PostgreSQL** on port 5432 — `jdbc:postgresql://localhost:5432/root?user=root&password=root`
+- **FoundationDB** on port 4500 — cluster file at `docker/fdb.cluster`
-```shell
+Override the Postgres URL with `DATABASE_URL` (kaocha store/integration tests) or
+`POSTGRES_JDBC_URI` (the chaos harness) if your setup differs.
+
+## Running Tests
+
+```bash
+# Everything: JVM + ClojureScript
+bin/kaocha
+
+# Fast JVM tests, skips ^:integration (no DB needed)
+bin/kaocha :in-memory
+
+# JVM tests incl. ^:integration (needs PostgreSQL + FoundationDB)
bin/kaocha :test
+
+# ClojureScript tests (Node)
bin/kaocha :test-cljs
-# or run everything
-bin/run-coverage
+# Focus a single namespace (use hyphens, not underscores)
+bin/kaocha :test --focus intemporal.tests.signal-test
+```
+
+## Jepsen / Chaos Tests
-# focusing
-./bin/kaocha :test --focus intemporal.tests.signal-test
+There are **two** distinct things under the "jepsen" name.
-# cljs focus is a bit different
-./bin/kaocha :test-cljs --focus 'cljs:intemporal.tests.signal-test'
+### 1. Per-scenario bug guard tests — `test/intemporal/tests/jepsen/`
-```
+Deterministic single-JVM tests, one namespace per known failure mode, each exercising
+InMemory + JDBC + FDB. They double as regression guards: a *fixed* bug's test asserts the
+correct behaviour, an *unfixed* bug's test asserts the buggy behaviour it still exhibits.
-# CI runs
+| Namespace | Bug (see `improvements.md`) | State |
+|---|---|---|
+| `bug-1-1-test` | Lost wake on signal across pods | buggy (Phase C) |
+| `bug-1-2-test` | Concurrent same-seq write corruption | buggy (Phase C) |
+| `bug-1-3-test` | No recovery poller on restart | buggy (Phase C) |
+| `bug-2-1-test` | Register-then-consume signal race | **fixed** (Phase A) |
+| `bug-2-3-test` | Cancel can't reach a sleeper | **fixed** (Phase A) |
-Install earthly: https://earthly.dev
+```bash
+# in-memory variants only (no DB)
+bin/kaocha :in-memory --focus intemporal.tests.jepsen.bug-2-1-test \
+ --focus intemporal.tests.jepsen.bug-2-3-test
+# all three stores (start PG + FDB first)
+docker compose up -d postgresql foundation
+bin/kaocha :test --focus intemporal.tests.jepsen.bug-1-1-test \
+ --focus intemporal.tests.jepsen.bug-1-2-test \
+ --focus intemporal.tests.jepsen.bug-1-3-test \
+ --focus intemporal.tests.jepsen.bug-2-1-test \
+ --focus intemporal.tests.jepsen.bug-2-3-test
```
-earthly -P -i +test
-```
-# Check FDB is working for your architecture
+`racing_store.clj` is a shared `IStore` wrapper that pins the executing thread inside the
+signal consume/register window so `bug-2-1` reproduces its race 100% deterministically.
+
+### 2. Forked-JVM chaos harness — `test/intemporal/jepsen/`
-```shell
+Boots N worker JVMs against one Postgres, drives a submit/signal/cancel generator and a
+nemesis that SIGKILL/SIGTERMs and restarts workers, then checks invariants after a quiesce
+phase. This is the integration vehicle for the Phase C multi-pod work. Full design:
+[test/intemporal/jepsen/README.md](test/intemporal/jepsen/README.md).
-$ JAVA_OPTS="-DFDB_LIBRARY_PATH_FDB_C=/usr/local/lib/libfdb_c.dylib -DFDB_LIBRARY_PATH_FDB_JAVA=/usr/local/lib/libfdb_java.jnilib" clj -A:fdb:jdbc
+```bash
+docker compose up -d postgresql
-(import 'com.apple.foundationdb.JNIUtil)
-(let [method (.getDeclaredMethod com.apple.foundationdb.JNIUtil "loadLibrary" (into-array Class [String]))]
- (.setAccessible method true)
- (.invoke method com.apple.foundationdb.JNIUtil (object-array ["fdb_java"]))
- (.invoke method com.apple.foundationdb.JNIUtil (object-array ["fdb_c"])))
+# default chaos run: 4 workers, 120s active, 90s grace
+clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run :workers 4 :duration 120
+# no-kill baseline (should pass all checkers)
+clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run :workers 4 :duration 60 :no-kill true
+
+# aggressive
+clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run \
+ :workers 6 :duration 180 :nemesis-min-ms 1500 :nemesis-jitter-ms 3000 :min-alive 1 :grace-s 120
```
-# Telemetry
+The runner forks workers via the `:jepsen-worker` alias; both `:jepsen` and `:jepsen-worker`
+are defined in `deps.edn`. The Postgres URL comes from `POSTGRES_JDBC_URI` (default localhost).
+
+### Standalone bug reproducer
-# Get the OT javaagent
+`dev/verify_bugs.clj` runs all five scenarios against JDBC + FDB and prints a pass/fail
+report — a quick end-to-end smoke check:
-```shell
-wget --content-disposition https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.21.0/opentelemetry-javaagent.jar
+```bash
+clojure -X:dev:jdbc:fdb verify-bugs/run
```
-Run with the `dev` profile to activate the java agent.
\ No newline at end of file
+
+### Known flaky test
+
+`intemporal.tests.replay-check-test/test-log-once-workflow` can fail under full-suite load
+(`run-once` persists its dedup marker lazily; parallel `async`/`join-all` can re-run the
+thunk). It is **pre-existing** (reproduces on pre-Phase-A commits) and unrelated to the
+signal/cancel work. It passes reliably in isolation.
+
+## REPL Development
+
+```bash
+clojure -A:dev # REPL with dev + test deps
+clojure -A:dev:jdbc # + PostgreSQL/JDBC
+clojure -A:dev:fdb # + FoundationDB
+clojure -M:nrepl # nREPL server on port 7888
+```
+
+## Code Style
+
+- Follow standard Clojure conventions
+- Use `kebab-case` for functions and variables
+- Keep functions small and focused
+- Write tests for new functionality
+- File names use underscores (`signal_test.clj`); namespaces use hyphens (`signal-test`)
+- Always pass `--color=never` to `grep`
diff --git a/README.md b/README.md
index ddb1580..72046d1 100644
--- a/README.md
+++ b/README.md
@@ -55,6 +55,56 @@ Examples:
(println result)))
```
+### Saga / compensations
+
+Create a saga with `intemporal/saga`, register a compensation for each step *after*
+it succeeds with `intemporal/add-compensation`, and roll back from a catch block
+with `intemporal/compensate`. Compensations run in reverse registration order
+(LIFO). A step that fails before its `add-compensation` registers nothing to undo.
+Compensations should themselves call activity stubs so they are durable and
+replay-safe.
+
+Both real failures and workflow cancellation flow through the catch, so the one
+idiom rolls back in either case. Catch `Exception`: the engine's normal
+control-flow *suspensions* subclass `Error`, so they are excluded automatically
+and propagate to the engine untouched.
+
+```clojure
+(defn booking-saga [order]
+ (let [saga (intemporal/saga)
+ book-hotel (intemporal/stub #'book-hotel)
+ book-flight (intemporal/stub #'book-flight)
+ charge-card (intemporal/stub #'charge-card)
+ cancel-hotel (intemporal/stub #'cancel-hotel)
+ cancel-flight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (book-hotel order)
+ _ (intemporal/add-compensation saga #(cancel-hotel h))]
+ (let [f (book-flight order)
+ _ (intemporal/add-compensation saga #(cancel-flight f))]
+ ;; if charge-card throws, the catch runs compensate -> cancel-flight then
+ ;; cancel-hotel (LIFO) -> then rethrows so the workflow finalizes :failed
+ (charge-card order)
+ :booked))
+ (catch Exception e
+ (intemporal/compensate saga)
+ (throw e)))))
+```
+
+Cancellation is a catchable `Exception`, so any `(catch Exception ...)` in a
+workflow will intercept it — that is what lets a cancelled saga roll back.
+
+In **ClojureScript** there is no `Error`/`Exception` split (everything is a
+`js/Error`), so `(catch :default e)` would also catch suspensions. There, rethrow
+them explicitly:
+
+```clojure
+ (catch :default e
+ (when (intemporal/suspension? e) (throw e)) ;; engine control flow
+ (intemporal/compensate saga)
+ (throw e))
+```
+
# TODO
- [X] Activites + Workflows
diff --git a/architecture.md b/architecture.md
new file mode 100644
index 0000000..753777d
--- /dev/null
+++ b/architecture.md
@@ -0,0 +1,219 @@
+# (in)temporal Architecture & Design Guide
+
+This document describes the high-level architecture of the **intemporal** library, its core lifecycle states, and the internal execution engine logic for workflow management.
+
+---
+
+## 1. High-Level System Architecture
+
+The library is organized into layered components separating persistence, scheduling, activity execution, and workflow orchestration logic:
+
+```mermaid
+graph TD
+ subgraph Client Application
+ W[Workflow Functions]
+ A[Activity Functions]
+ end
+
+ subgraph Core API Layer [intemporal.core]
+ Start[start-workflow / submit-workflow]
+ Resume[resume-workflow]
+ Stub[stub / stub-protocol]
+ Worker[start-worker Poller]
+ end
+
+ subgraph Execution Engine [intemporal.internal.execution]
+ InternalLoop[run-workflow-internal Loop]
+ Ctx[intemporal.internal.context]
+ Err[intemporal.internal.error]
+ end
+
+ subgraph Platform Abstractions [intemporal.protocol]
+ IStore[(IStore Persistence)]
+ IExec[IActivityExecutor]
+ ISched[IScheduler]
+ IObs[IWorkflowObserver]
+ end
+
+ subgraph Implementations
+ InMemory[InMemoryStore]
+ JDBC[JdbcStore Postgres]
+ FDB[FDBStore FoundationDB]
+ VThread[Virtual Thread Executor]
+ Timer[ScheduledExecutorService]
+ end
+
+ %% Wiring
+ W --> Start
+ A --> Stub
+ Start --> InternalLoop
+ Resume --> InternalLoop
+ Worker --> Resume
+ InternalLoop --> Ctx
+ InternalLoop --> Err
+ InternalLoop --> Platform Abstractions
+
+ IStore --> InMemory
+ IStore --> JDBC
+ IStore --> FDB
+ IExec --> VThread
+ ISched --> Timer
+```
+
+### Component Details
+* **Core API**: Entry points for starting and resuming workflows. A `Worker` runs a continuous polling loop (`start-worker`) that scans for pending workflows, claims ownership, and resumes them automatically.
+* **Orchestration Loop**: `run-workflow-internal` manages execution iterations. It binds dynamic context (`*workflow-context*`) that keeps track of the sequence counter, pending events buffer, and registered protocols.
+* **Persistence (`IStore`)**: Event-sourced history log, signal queue, and state variables are kept in a database or in-memory map.
+* **Activity Executor & Scheduler**: The runtime executor runs activities (usually on Java 21+ Virtual Threads), while the scheduler manages deferred timer callbacks.
+
+---
+
+## 2. Workflow States
+
+Workflows transition through the following states, which are derived from database flags/columns (`status`, `cancelled`) or by scanning event history:
+
+| State | Status Keyword | Description |
+|---------------|----------------|--------------------------------------------------------------------------------------------------------------|
+| **Not Found** | `:not-found` | Workflow has not started yet or has no history logs. |
+| **Running** | `:running` | Active and executing or suspended waiting for an event/timer. |
+| **Completed** | `:completed` | Terminated successfully. Persists a `:workflow-completed` event. |
+| **Failed** | `:failed` | Terminated with a runtime failure. Persists a `:workflow-failed` event. |
+| **Cancelled** | `:cancelled` | Terminated via explicit cancel. Sets `cancelled = true` (and currently persists a `:workflow-failed` event). |
+
+### Internal Suspension Wait States
+When a workflow is in the `:running` status, it may be suspended waiting for external input. The engine tracks these sub-states to determine if a worker needs to wake the workflow:
+* **Timer Wait (`:waiting-timer`)**: Waiting for the clock to reach `fire-at` (timer expiry).
+* **Signal Wait (`:waiting-signal`)**: Blocked waiting for a specific signal name to be delivered.
+* **Signal Timeout Wait (`:waiting-signal-timeout`)**: Waiting for a signal name, or a clock deadline if the signal doesn't arrive.
+* **Async/Join Wait (`:waiting-async`)**: Blocked waiting for parallel async handles to finish execution.
+
+---
+
+## 3. Workflow Execution Flow (`start-workflow`)
+
+The diagram below details the step-by-step lifecycle of `start-workflow` and the internal execution loop in `run-workflow-internal`, including **replay**, **suspension**, **child workflows**, **signals**, and **cancellations**.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ actor Client
+ participant API as "start-workflow API"
+ participant Store as "IStore (DB/Memory)"
+ participant WfInternal as "run-workflow-internal"
+ participant Body as "Workflow Function Body"
+ participant Scheduler as "IScheduler / Executor"
+
+ Client->>API: start-workflow(engine, wf-fn, args)
+ API->>Store: save-event(:workflow-started)
+
+ loop Workflow iteration (up to max-iterations)
+ API->>WfInternal: run-workflow-internal
+ Note over WfInternal: Binding *workflow-context*
Initialize sequence counter = 0
+ WfInternal->>WfInternal: check-cancelled! (frontier sequence point)
+ alt is-cancelled? is true
+ WfInternal-->>API: Throw workflow-cancelled-exception
+ else not cancelled
+ WfInternal->>Body: Apply workflow-fn(args)
+
+ Note over Body: Replaying Cached Steps
+ loop For each step (activities, signals, timers)
+ Note over Body: Sequence counter increments
+ Body->>Store: find-event(type, current-seq)
+ alt Event found in History (Replay)
+ Store-->>Body: Return cached result / error
+ else Event not found (Frontier reached)
+ Note over Body: Frontier: First un-cached operation
+ Body-->>WfInternal: Throw suspension exception
+ end
+ end
+ end
+
+ alt Completed successfully
+ Body-->>WfInternal: Return result value
+ WfInternal->>Store: save-events(pending-events)
+ WfInternal->>Store: save-event(:workflow-completed)
+ WfInternal-->>API: Return {:status :completed :result result}
+ API-->>Client: Final workflow result
+ else Caught suspension (e.g. sleep, wait-for-signal, activity)
+ WfInternal->>Store: save-events(pending-events)
+ Note over WfInternal: Dispatch suspension type
+ alt Type is :activity
+ WfInternal->>Scheduler: Execute activity (Virtual Thread)
+ Note over WfInternal: Action: :continue (activity completed inline or re-scheduled)
+ else Type is :timer (sleep)
+ WfInternal->>Scheduler: schedule-timer(fire-at)
+ Note over WfInternal: Action: :wait-timer
+ else Type is :wait-signal
+ WfInternal->>Store: register-signal-callback()
+ Store->>Store: check if signal available inline
+ alt Signal present inline
+ WfInternal->>Store: consume-signal()
+ Note over WfInternal: Action: :continue
+ else Signal absent
+ Note over WfInternal: Action: :wait-signal
+ end
+ else Type is :child-workflow
+ WfInternal->>WfInternal: run-workflow-internal(child-wf-id)
+ Note over WfInternal: Action: :continue (child runs synchronously for now)
+ end
+
+ alt Action is :continue
+ Note over WfInternal: Increment iteration, loop back
+ else Action is :wait-*
+ WfInternal->>Store: register-wake-callback(wake-fn)
+ WfInternal->>Store: set-wake-at(deadline)
+ WfInternal-->>API: Return action wait status
+ Note over API: Block thread / wait on wake-q (.take)
+ end
+ end
+
+ alt Caught cancellation or failure
+ WfInternal->>Store: save-events(pending-events)
+ WfInternal->>Store: save-event(:workflow-failed)
+ WfInternal-->>API: Return {:status :failed :error error}
+ API-->>Client: Throw Exception
+ end
+ end
+```
+
+## 4. Internal Execution Loop Flowchart (`run-workflow-internal`)
+
+```mermaid
+flowchart TD
+ Start([run-workflow-internal]) --> IterCheck{"iteration >=\nmax-iterations?"}
+ IterCheck -->|Yes| Fail["finalize-failed\n'Replay budget exceeded'"] --> RetFail(["↩ :failed"])
+ IterCheck -->|No| ShutCheck{"executor\nshutting down?"}
+ ShutCheck -->|Yes| RetSusp(["↩ :suspended"])
+ ShutCheck -->|No| Load["Load history · Create context\nbind *workflow-context*\nseq-counter = 0"]
+ Load --> Exec["execute-workflow-fn\n───────────────────\nREPLAY: stubs return cached events\nFRONTIER: stub throws suspension"]
+
+ Exec --> Status{"result status?"}
+
+ Status -->|":completed"| Done["finalize-completed\nsave events + :workflow-completed"] --> RetOK(["↩ :completed"])
+ Status -->|":cancelled"| Canc["finalize-cancelled\nsave :workflow-cancelled"] --> RetCanc(["↩ :cancelled"])
+ Status -->|":failed"| Failed["finalize-failed\nsave :workflow-failed"] --> RetFailed(["↩ :failed"])
+
+ Status -->|":suspended"| Dispatch{"suspension-type?"}
+
+ Dispatch -->|":activity"| HasAsync{"pending-asyncs?"}
+ HasAsync -->|Yes| Parallel["execute all asyncs in parallel\nVirtual Thread pool\nsave :activity-completed/failed"] --> AC[":continue"]
+ HasAsync -->|No| Single["execute-with-retry\nVirtual Thread\nsave :activity-completed/failed"] --> AC
+
+ Dispatch -->|":timer\n:wait-signal\n:wait-signal-timeout\n:child-workflow\n:join-*"| Other["schedule timer / register signal callback\nor run child workflow"] --> OtherAction{"action?"}
+
+ OtherAction -->|":continue"| AC
+ OtherAction -->|":wait-*"| Wait
+
+ AC --> Recur["recur iteration + 1"] --> IterCheck
+
+ Wait["register-wake-callback\nset-wake-at deadline"] --> RetWait(["↩ :waiting-*"])
+```
+
+### Detailed Execution Steps
+1. **Startup**: The `start-workflow` call registers protocol activities and persists a `:workflow-started` event to history.
+2. **Replay Phase**: The engine invokes the workflow function. Each stubbed operation queries the store for existing events matching the current sequence number. If a cached event exists, the result is returned directly, ensuring side-effects are skipped.
+3. **Frontier Phase**: When a step is reached that has no corresponding history event, the stub throws a `suspension` exception to abort execution and yield control back to the engine loop.
+4. **Suspension Dispatch**: The engine catches the suspension, saves any buffered events (e.g., `:activity-scheduled`), and schedules the required task (e.g., timer, signal callback registration, or activity run).
+5. **Resume / Wake**:
+ - When a timer expires or a signal arrives, the database callback or poller fires `wake-fn`, which enqueues a token into a thread-safe `LinkedBlockingQueue` (`wake-q`).
+ - The main execution thread wakes up, clears the queue, and triggers the next iteration of the loop, re-running the workflow function from the beginning.
diff --git a/deps.edn b/deps.edn
index ff0a562..d5a86bb 100644
--- a/deps.edn
+++ b/deps.edn
@@ -55,4 +55,14 @@
:ns-default build}
:test {:jvm-opts ["--enable-native-access=ALL-UNNAMED"]
- :main-opts ["-m" "kaocha.runner"]}}}
+ :main-opts ["-m" "kaocha.runner"]}
+
+ ;; Run the chaos harness against a live Postgres instance:
+ ;; clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run
+ :jepsen {:extra-paths ["test" "test/resources"]
+ :jvm-opts ["--enable-native-access=ALL-UNNAMED"]
+ :main-opts ["-m" "intemporal.jepsen.runner"]}
+
+ ;; Entry point for forked worker JVMs launched by db.clj/fork!
+ :jepsen-worker {:extra-paths ["test" "test/resources"]
+ :jvm-opts ["--enable-native-access=ALL-UNNAMED"]}}}
diff --git a/dev/verify_bugs.clj b/dev/verify_bugs.clj
new file mode 100644
index 0000000..8c36ffb
--- /dev/null
+++ b/dev/verify_bugs.clj
@@ -0,0 +1,415 @@
+(ns verify-bugs
+ "Standalone verification of the five structural bugs described in
+ improvements.md. Runs each scenario against the JDBC (Postgres) store
+ and the FoundationDB store and prints a side-by-side report.
+
+ Usage:
+ clojure -X:dev:jdbc:fdb verify-bugs/run
+
+ Environment / files required:
+ Postgres — POSTGRES_JDBC_URI or jdbc:postgresql://localhost:5432/root?user=root&password=root
+ FoundationDB — docker/fdb.cluster (written by the docker-compose foundation service)"
+ (:require [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as mem-store]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]
+ [clojure.string :as str]))
+
+;; ── helpers ──────────────────────────────────────────────────────────────────
+
+(def ^:private pg-url
+ (or (System/getenv "POSTGRES_JDBC_URI")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root"))
+
+(defn- open-fdb []
+ (let [fdb (cfdb/select-api-version 710)]
+ (cfdb/open fdb "docker/fdb.cluster")))
+
+(defn- timeout-ms [ms f]
+ (let [res (future (f))]
+ (deref res ms ::timeout)))
+
+(defn- workflow-completed? [store wf-id]
+ (let [status (p/get-workflow-status store wf-id)]
+ (contains? #{:completed :failed :cancelled} status)))
+
+(defn- print-banner [title]
+ (let [line (apply str (repeat 70 "-"))]
+ (println line)
+ (println (str " " title))
+ (println line)))
+
+;; ── workflow shapes used in scenarios ────────────────────────────────────────
+
+(defn- wait-signal-wf
+ "Suspends on signal 'go' and returns :woke."
+ []
+ (intemporal/wait-for-signal "go")
+ :woke)
+
+(defn- cancel-sleep-wf
+ "Suspends on signal 'wake', which is never sent — relies on cancel."
+ []
+ (intemporal/wait-for-signal "wake")
+ :woke)
+
+(defn- counting-activity [counter]
+ (swap! counter inc)
+ @counter)
+
+(defn- chain-wf [counter n]
+ (let [act (intemporal/stub #'counting-activity)]
+ (dotimes [_ n]
+ (act counter))
+ :done))
+
+;; ── RacingStore: deterministic race injector ──────────────────────────────────
+;;
+;; Wraps any IStore so that the first time consume-signal returns nil for a
+;; specific (workflow-id, signal-name):
+;;
+;; 1. It delivers gate-nil ("race window is open")
+;; 2. It blocks on gate-sent ("sender has injected signal into the window")
+;; 3. Then returns nil, letting process-signal proceed to register-callback
+;;
+;; The sender thread:
+;; 1. Waits on gate-nil
+;; 2. Calls p/add-signal on the INNER store directly, writing the signal row
+;; but firing no callback (none registered yet — we're in the window)
+;; 3. Delivers gate-sent
+;;
+;; After the window closes:
+;; - p/register-signal-callback is called → callback registered in inner store
+;; - Signal is already in inner store; add-signal already ran with empty callbacks
+;; - Callback will never fire retroactively
+;; - Workflow is permanently stuck with an undelivered wake
+;;
+;; Proof of stuck: p/get-pending-signals returns the signal row;
+;; workflow status remains :running; the workflow future times out.
+
+(defrecord RacingStore [inner gate-nil gate-sent armed?]
+ p/IStore
+ (load-history [_ wf-id] (p/load-history inner wf-id))
+ (save-event [_ wf-id ev] (p/save-event inner wf-id ev))
+ (save-events [_ wf-id evs] (p/save-events inner wf-id evs))
+ (find-event [_ wf-id et sq] (p/find-event inner wf-id et sq))
+ (get-pending-signals [_ wf-id] (p/get-pending-signals inner wf-id))
+ (add-signal [_ wf-id sn sd] (p/add-signal inner wf-id sn sd))
+ (register-signal-callback [_ wf-id sn cb] (p/register-signal-callback inner wf-id sn cb))
+ (unregister-signal-callback [_ wf-id sn] (p/unregister-signal-callback inner wf-id sn))
+ (is-cancelled? [_ wf-id] (p/is-cancelled? inner wf-id))
+ (mark-cancelled [_ wf-id] (p/mark-cancelled inner wf-id))
+ (get-workflow-status [_ wf-id] (p/get-workflow-status inner wf-id))
+
+ (consume-signal [_ wf-id sig-name]
+ (let [result (p/consume-signal inner wf-id sig-name)]
+ ;; Only intercept once (armed? tracks first nil-return)
+ (when (and (nil? result) (compare-and-set! armed? true false))
+ (deliver gate-nil {:wf-id wf-id :sig-name sig-name})
+ (deref gate-sent 5000 :timeout-waiting-for-sender))
+ result)))
+
+;; ── Bug scenarios ─────────────────────────────────────────────────────────────
+
+(defn- scenario-1-1
+ "Bug 1.1 — Signal sent via a SECOND store instance (simulating another pod)
+ is not delivered because the callback atom is in the first store's memory.
+
+ Two store instances against the same database: store-a starts the workflow
+ and registers the callback; store-b sends the signal. The signal row lands
+ in the DB but no pod fires the callback."
+ [make-store-a make-store-b label]
+ (let [store-a (make-store-a)
+ store-b (make-store-b)
+ wf-id (str "bug11-" (random-uuid))
+ result (promise)
+ engine-a (intemporal/make-workflow-engine :store store-a :threads 2)]
+ (try
+ ;; Start workflow on store-a in background (it will suspend on signal)
+ (future
+ (try
+ (let [r (intemporal/start-workflow engine-a wait-signal-wf []
+ :workflow-id wf-id)]
+ (deliver result r))
+ (catch Exception e (deliver result {:error (str e)}))))
+ ;; Let it reach the wait-for-signal suspension
+ (Thread/sleep 400)
+ ;; Send signal via store-b (another "process" – empty callback atom)
+ (p/add-signal store-b wf-id "go" {:source :store-b})
+ ;; Wait up to 2 s for the workflow to wake
+ (let [r (deref result 2000 ::timeout)]
+ {:store label
+ :bug? (= ::timeout r)
+ :detail (if (= ::timeout r)
+ "Workflow stuck: signal row written to DB but callback only in store-a"
+ (str "Workflow woke unexpectedly: " r))})
+ (finally
+ (intemporal/shutdown-engine engine-a)
+ (when (instance? java.io.Closeable store-a) (.close store-a))
+ (when (instance? java.io.Closeable store-b) (.close store-b))))))
+
+(defn- scenario-1-2
+ "Bug 1.2 — Concurrent save-events with the same (workflow-id, seq).
+
+ JDBC: ON CONFLICT DO UPDATE silently overwrites — one write is lost, both
+ futures return without exception (silent data loss).
+ FDB: UUID-keyed writes produce DUPLICATE events at the same seq (history
+ has >1 row at seq=0, violating the 'one event per seq' invariant)."
+ [make-store label]
+ (let [store (make-store)
+ wf-id (str "bug12-" (random-uuid))
+ event-a {:event-type :workflow-started :seq 0 :writer "thread-a"
+ :timestamp (System/currentTimeMillis)}
+ event-b {:event-type :workflow-started :seq 0 :writer "thread-b"
+ :timestamp (System/currentTimeMillis)}
+ latch (promise)
+ t-a (future (deref latch) (try (p/save-events store wf-id [event-a]) :ok
+ (catch Exception e {:error (str e)})))
+ t-b (future (deref latch) (try (p/save-events store wf-id [event-b]) :ok
+ (catch Exception e {:error (str e)})))]
+ (deliver latch :go)
+ (let [ra @t-a
+ rb @t-b
+ history (p/load-history store wf-id)
+ seq0 (filter #(= 0 (:seq %)) history)
+ cnt (count seq0)
+ writers (set (keep :writer seq0))
+ ;; JDBC: both writes succeed but only 1 row survives → silent clobber
+ ;; FDB: both writes succeed and 2 rows survive → duplicate seq
+ jdbc-silent-overwrite? (and (= :ok ra) (= :ok rb) (= 1 cnt))
+ fdb-duplicate-seq? (> cnt 1)
+ result {:store label
+ :bug? (or jdbc-silent-overwrite? fdb-duplicate-seq?)
+ :detail (cond
+ jdbc-silent-overwrite?
+ (str "Both writes returned :ok but seq=0 has 1 row (writer="
+ (:writer (first seq0))
+ ") — one write silently clobbered by ON CONFLICT DO UPDATE")
+ fdb-duplicate-seq?
+ (str "seq=0 has " cnt " rows (writers=" writers
+ ") — UUID-keyed inserts produce duplicate-seq history")
+ :else
+ (str "No corruption detected: writes=" [ra rb] " seq0-count=" cnt))
+ :seq0-count cnt}]
+ (when (instance? java.io.Closeable store) (.close store))
+ result)))
+
+(defn- scenario-2-1
+ "Bug 2.1 — Register-then-consume signal race (deterministic via RacingStore).
+
+ process-signal executes:
+ (1) consume-signal → nil (no signal available)
+ (2) register-signal-callback
+
+ The RacingStore intercepts step (1): after consume-signal returns nil it
+ blocks on gate-nil/gate-sent, letting us inject a signal into the INNER
+ store BEFORE step (2) runs. After the sender delivers gate-sent the
+ consume returns nil and process-signal proceeds to register-callback.
+
+ At that point:
+ • Signal row IS in inner store (written by add-signal in the window)
+ • add-signal checked inner callbacks atom → found empty → fired no wake
+ • Callback IS now registered (step 2 ran after the window)
+ • But add-signal already ran with empty callbacks → wake was lost
+ • Callback will never fire retroactively
+ • Workflow is permanently stuck
+
+ Proof:
+ • workflow future times out (stuck)
+ • p/get-pending-signals returns the signal row (unconsumed)
+ • workflow status is :running"
+ [make-inner-store label]
+ (let [inner (make-inner-store)
+ gate-nil (promise)
+ gate-sent (promise)
+ store (->RacingStore inner gate-nil gate-sent (atom true))
+ wf-id (str "bug21-" (random-uuid))
+ result (promise)
+ engine (intemporal/make-workflow-engine :store store :threads 2)]
+ ;; Workflow thread
+ (future
+ (try
+ (let [r (intemporal/start-workflow engine wait-signal-wf []
+ :workflow-id wf-id)]
+ (deliver result r))
+ (catch Exception e (deliver result {:error (str e)}))))
+
+ ;; Wait until consume-signal returned nil (race window is open)
+ (let [gate-info (deref gate-nil 5000 ::timeout)]
+ (if (= ::timeout gate-info)
+ (do (intemporal/shutdown-engine engine)
+ (when (instance? java.io.Closeable inner) (.close inner))
+ {:store label :bug? false
+ :detail "Gate never opened — workflow did not reach consume-signal in time"})
+ (do
+ ;; Inject signal directly into the inner store.
+ ;; At this moment process-signal is parked between consume-check and register-callback.
+ ;; inner.add-signal writes the signal and checks callbacks atom → empty → no wake.
+ (p/add-signal inner wf-id "go" {:source :injected-in-race-window})
+ ;; Release the gate — let consume-signal return nil to process-signal
+ (deliver gate-sent :signal-injected)
+ ;; Give process-signal time to register the callback (step 2)
+ (Thread/sleep 200)
+ ;; Check outcome
+ (let [r (deref result 2000 ::timeout)
+ pending (p/get-pending-signals inner wf-id)
+ status (p/get-workflow-status inner wf-id)]
+ (intemporal/shutdown-engine engine)
+ (when (instance? java.io.Closeable inner) (.close inner))
+ {:store label
+ :bug? (= ::timeout r)
+ :detail (if (= ::timeout r)
+ (str "RACE CONFIRMED — signal injected in race window; "
+ "wake never fired; status=" status
+ "; orphaned signal keys=" (keys pending))
+ (str "Workflow woke (race not reproduced): " r))
+ :pending-signals (keys pending)
+ :final-status status})))))) ; closes: map let[r] do if let[gate-info] outer-let, defn
+
+(defn- scenario-2-3
+ "Bug 2.3 — Cancellation can't reach a sleeping workflow.
+
+ cancel-workflow sets cancelled=true in the store but does NOT call any
+ wake mechanism. A workflow sleeping in wait-for-signal never re-enters
+ the execution loop and therefore never observes the flag."
+ [make-store label]
+ (let [store (make-store)
+ wf-id (str "bug23-" (random-uuid))
+ result (promise)
+ engine (intemporal/make-workflow-engine :store store :threads 2)]
+ (try
+ (future
+ (try
+ (let [r (intemporal/start-workflow engine cancel-sleep-wf []
+ :workflow-id wf-id)]
+ (deliver result r))
+ (catch Exception e (deliver result {:error (str e)}))))
+ ;; Wait for workflow to suspend
+ (Thread/sleep 400)
+ ;; Cancel the workflow (sets the DB flag but sends no wake signal)
+ (intemporal/cancel-workflow store wf-id)
+ ;; Wait up to 2 s for the workflow to observe the cancellation
+ (let [r (deref result 2000 ::timeout)]
+ {:store label
+ :bug? (= ::timeout r)
+ :detail (if (= ::timeout r)
+ "Workflow stuck: cancelled flag set but sleeper never re-entered loop"
+ (str "Workflow woke after cancel (status=" (:status r) ")"))})
+ (finally
+ (intemporal/shutdown-engine engine)
+ (when (instance? java.io.Closeable store) (.close store))))))
+
+(defn- scenario-no-recovery-poller
+ "Bug 1.3 — No recovery poller: resume requires caller to know the function.
+
+ Simulates a pod restart by using TWO separate store instances (store-a for
+ engine-a, store-b for engine-b) pointing at the same backing database.
+ This mirrors a real restart: each JVM gets a fresh store object with an
+ empty callbacks atom.
+
+ After engine-a crashes, engine-b sends the signal via store-b. The signal
+ row lands in the DB, but store-a's callback atom (holding the wake-fn) is
+ gone. Engine-b has no recovery poller to detect the suspended workflow —
+ it must be resumed explicitly."
+ [make-store-a make-store-b label]
+ (let [store-a (make-store-a)
+ wf-id (str "bug13-" (random-uuid))
+ result (promise)
+ engine-a (intemporal/make-workflow-engine :store store-a :threads 2)]
+ (try
+ (future
+ (try
+ (let [r (intemporal/start-workflow engine-a wait-signal-wf []
+ :workflow-id wf-id)]
+ (deliver result r))
+ (catch Exception e (deliver result {:error (str e)}))))
+ ;; Let it suspend and register its callback
+ (Thread/sleep 500)
+ ;; "Crash" engine-a
+ (intemporal/shutdown-engine engine-a)
+ (when (instance? java.io.Closeable store-a) (.close store-a))
+ ;; Create engine-b with a FRESH store instance — simulates pod restart
+ (let [store-b (make-store-b)
+ engine-b (intemporal/make-workflow-engine :store store-b :threads 2)]
+ ;; Send signal via store-b (empty callback atom — just like a new process)
+ (p/add-signal store-b wf-id "go" {:source :engine-b-restart})
+ ;; Wait: engine-b has no poller to pick up the workflow
+ (let [r (deref result 2000 ::timeout)]
+ (intemporal/shutdown-engine engine-b)
+ (when (instance? java.io.Closeable store-b) (.close store-b))
+ {:store label
+ :bug? (= ::timeout r)
+ :detail (if (= ::timeout r)
+ "Engine-b (fresh store) sent signal but workflow never woke — no recovery poller"
+ "Workflow woke unexpectedly after engine restart")}))
+ (finally nil))))
+
+;; ── Store factories ───────────────────────────────────────────────────────────
+
+(defn- make-mem-store [] (mem-store/->InMemoryStore (atom {})))
+
+(defn- make-jdbc-store [] (jdbc-store/make-jdbc-store pg-url))
+
+(defn- make-fdb-store []
+ (let [db (open-fdb)]
+ (fdb-store/make-fdb-store db (str "verify-" (random-uuid)))))
+
+;; ── Report rendering ──────────────────────────────────────────────────────────
+
+(defn- fmt-result [{:keys [store bug? detail]}]
+ (let [icon (if bug? "FAIL ✗" "PASS ✓")]
+ (format " %-10s %s\n %s" store icon detail)))
+
+(defn- print-scenario [bug-id title results]
+ (print-banner (str bug-id " — " title))
+ (doseq [r results]
+ (println (fmt-result r)))
+ (println))
+
+;; ── Main entry point ─────────────────────────────────────────────────────────
+
+(defn run
+ "Entry point: clojure -X:dev:jdbc:fdb verify-bugs/run"
+ [_opts]
+ (println "\n╔══════════════════════════════════════════════════════════════════╗")
+ (println "║ intemporal bug verification — JDBC (Postgres) + FoundationDB ║")
+ (println "╚══════════════════════════════════════════════════════════════════╝\n")
+
+ ;; ----------------------------------------------------------------------------
+ (print-scenario
+ "Bug 1.1" "Lost wake on signal across store instances"
+ [(scenario-1-1 make-jdbc-store make-jdbc-store "JDBC")
+ (scenario-1-1 make-fdb-store make-fdb-store "FDB")])
+
+ ;; ----------------------------------------------------------------------------
+ (print-scenario
+ "Bug 1.2" "Concurrent write corruption at the same seq"
+ [(scenario-1-2 make-jdbc-store "JDBC")
+ (scenario-1-2 make-fdb-store "FDB")])
+
+ ;; ----------------------------------------------------------------------------
+ (print-scenario
+ "Bug 1.3" "No recovery poller — engine restart does not resume workflows"
+ [(scenario-no-recovery-poller make-jdbc-store make-jdbc-store "JDBC")
+ (scenario-no-recovery-poller make-fdb-store make-fdb-store "FDB")])
+
+ ;; ----------------------------------------------------------------------------
+ (print-scenario
+ "Bug 2.1" "Register-then-consume signal race (deterministic)"
+ [(scenario-2-1 make-jdbc-store "JDBC")
+ (scenario-2-1 make-fdb-store "FDB")])
+
+ ;; ----------------------------------------------------------------------------
+ (print-scenario
+ "Bug 2.3" "Cancellation cannot reach a sleeping workflow"
+ [(scenario-2-3 make-jdbc-store "JDBC")
+ (scenario-2-3 make-fdb-store "FDB")])
+
+ ;; ----------------------------------------------------------------------------
+ (println "\nNote: Bug 2.1 uses a latch-synchronized RacingStore to deterministically")
+ (println " inject a signal into the consume-nil→register-callback window.")
+ (println " The race is guaranteed to reproduce on every run.\n")
+
+ (System/exit 0))
diff --git a/doc/intemporal/doc.cljs b/doc/intemporal/doc.cljs
index 028a342..92b66b4 100644
--- a/doc/intemporal/doc.cljs
+++ b/doc/intemporal/doc.cljs
@@ -8,7 +8,7 @@
;;;;
;; main code
(defn activity-fn [a]
- [:some a ])
+ [:some a])
(defprotocol MyActivities
(foo [this a]))
diff --git a/docker/fdb.cluster b/docker/fdb.cluster
index 1e658b5..fc3b14b 100644
--- a/docker/fdb.cluster
+++ b/docker/fdb.cluster
@@ -1 +1 @@
-docker:docker@172.19.0.2:4500
+docker:docker@192.168.107.2:4500
diff --git a/issues.md b/issues.md
new file mode 100644
index 0000000..b7cebba
--- /dev/null
+++ b/issues.md
@@ -0,0 +1,218 @@
+# Intemporal Design Critique — Multi-Pod / k8s Replica Set Context
+
+## Context
+
+The user asked for a design analysis of the `intemporal` library, with explicit focus on:
+- Deadlocks and "lost workflow" failure modes
+- Other criticism
+- Behaviour under a Kubernetes deployment where multiple pods of a replica set run the engine concurrently and can scale up/down
+
+The library positions itself as "Temporal/Cadence-inspired" — an event-sourced workflow engine where workflow state is reconstructed from a persisted event log so that activities don't re-execute after a process crash.
+
+This document is an **analysis deliverable**, not an implementation plan. It is structured to be useful as input for an architecture decision (adopt / fork / replace / contribute fixes).
+
+---
+
+## TL;DR
+
+**`intemporal` is a single-process resilient workflow engine. It is unsafe to run more than one replica against the same store. The README's "not production-ready" disclaimer is accurate — and the gap to a true distributed orchestrator is structural, not cosmetic.**
+
+The two showstoppers for any k8s replica-set deployment:
+
+1. **No durable wake mechanism.** Signal callbacks and timers live in a process-local atom. When the pod that registered them dies, nothing in the system knows to wake the workflow again. The workflow is *persisted but orphaned*.
+2. **No ownership / leasing.** Two pods can race on the same `workflow-id` with no detection. The JDBC store actively masks the race with `ON CONFLICT … DO UPDATE`, silently corrupting the event log.
+
+A third structural issue: there is **no poller / recovery worker**. After a crash, no pod scans for workflows that should be running. Resume only happens if an external actor explicitly calls `resume-workflow` with the right function and args. Durability of state without durability of execution is illusion.
+
+---
+
+## Architecture Summary (what I'm critiquing)
+
+Verified from the source:
+
+- **Engine** (`src/intemporal/core.cljc:445`): a map of `{:store :executor :scheduler :registry :observer}`. No identity, no node id, no clustering primitives.
+- **start-workflow** (`src/intemporal/internal/fns/start_workflow.clj:8-78`): generates a UUID, writes `:workflow-started`, calls `run-workflow-internal` **synchronously on the calling thread**, then loops blocking on a local `promise` until the workflow completes or is interrupted.
+- **resume-workflow** (`src/intemporal/core.cljc:366-390`): the caller must supply `workflow-id`, `workflow-fn`, and `args`. The engine replays history and re-enters execution.
+- **Stores** implement `IStore` (`src/intemporal/protocol.cljc:8-21`). The protocol contains only: history read/write, signal add/consume, callback register/unregister, cancellation flag, and status. **No claim, no lease, no heartbeat, no "list running workflows".**
+- **JDBC store** (`src/intemporal/store/jdbc.clj`): events written under transactions; uses `ON CONFLICT (workflow_id, seq) DO UPDATE` (line 100-103); signal consumption uses `FOR UPDATE SKIP LOCKED` (line 137). Signal callbacks are kept in a **per-process atom** on the store record (line 72, 143-147).
+- **InMemoryStore** (`src/intemporal/store.cljc:8-78`): identical callback semantics — a single atom.
+- **DefaultScheduler** (`src/intemporal/internal/runtime.clj`): timers held in an in-memory `pending-timers` atom. Lost on process exit.
+
+---
+
+## Section 1 — Distributed-Deployment Showstoppers
+
+### 1.1 No wake mechanism survives a pod restart *(severity: critical)*
+
+`register-signal-callback` and `schedule-timer` store their continuation **inside the process** (atom for signals, `ScheduledFuture` for timers). Concretely:
+
+- `JdbcStore` carries `callbacks` as `(atom {})` at construction (`src/intemporal/store/jdbc.clj:72`). Two pods sharing the same Postgres each have their own empty atom.
+- `add-signal` in JDBC (`src/intemporal/store/jdbc.clj:122-132`) writes the signal under a transaction, then does `(when-let [callback (get-in @callbacks [workflow-id signal-name])] (future (callback)))` — **only the pod that registered the callback can fire it**.
+
+Failure trace:
+
+1. Pod A executes workflow X up to `(wait-for-signal :go)`. It writes the suspension to history, registers callback in **pod-A-local atom**, blocks the calling thread on a promise.
+2. Pod A crashes (k8s scale-down, OOM, node failure).
+3. Pod B receives an HTTP request → calls `(send-signal store "X" :go {})`. Postgres now has the signal row. Pod B's local callbacks atom is empty for X → **no wake**.
+4. Workflow X is permanently stuck. Its history is intact, its signal is queued, and no process knows to re-enter execution.
+
+Timers have the same problem with worse blast radius: a 1-hour timer scheduled on pod A *vanishes* the moment pod A dies, even if nobody sends a signal. No row in any table, no scheduled job, no poller.
+
+**This is the "losing workflows" failure mode**, and it does not require any race: it happens on every routine k8s rolling restart.
+
+### 1.2 No ownership → silent concurrent execution *(severity: critical)*
+
+`start-workflow` and `resume-workflow` do not claim anything. Two pods can run the same workflow id concurrently. Specifically:
+
+- `JdbcStore.save-events` uses `ON CONFLICT (workflow_id, seq) DO UPDATE` (`src/intemporal/store/jdbc.clj:100-103`). When two pods append the same seq, **the loser silently wins** (last writer overwrites). The race is masked; the event log is non-deterministic.
+- Activities are re-executed on each pod — at-least-once degrades to at-many-times.
+- The replay invariant ("same input → same event stream") is violated because two engines may emit different events at the same seq.
+
+This becomes very easy to trigger: `start-workflow` blocks the caller. A reverse-proxy retry on a slow `POST /workflows` will re-invoke `start-workflow`, and if the client supplies `:workflow-id` for idempotency the second call appends a duplicate `:workflow-started` event to history rather than rejecting (no uniqueness check; the `seq` for the first event likely overwrites if both pods reach `seq=0` simultaneously).
+
+A safer schema would use `ON CONFLICT DO NOTHING` (or reject) on event inserts, plus a `(workflow_id, owner_lease, lease_expires_at)` claim row.
+
+### 1.3 No recovery poller *(severity: critical)*
+
+There is no background process anywhere in the codebase that scans for workflows requiring execution. `resume-workflow` is **on-demand only** and **requires the caller to know the workflow function and args**.
+
+This couples recovery to application code: every pod that starts up must explicitly enumerate "things that might be suspended" and call `resume-workflow` with the right vars in scope. The library provides no list-by-status query and no API for "given a workflow id, find the function and resume". For a multi-replica deployment, recovery is essentially a problem the user has to solve outside the library.
+
+Temporal solves this with task queues + workers that long-poll the server. Intemporal has neither concept.
+
+---
+
+## Section 2 — Deadlock & Lost-Workflow Scenarios
+
+### 2.1 Register-after-consume race *(severity: high, even on a single pod)*
+
+`process-signal` (`src/intemporal/internal/execution.clj:223-255`) is described by the explore agent as:
+
+```
+1. consume-signal — if present, return
+2. otherwise register-signal-callback
+3. suspend
+```
+
+This is a classic TOCTOU. Between (1) returning nil and (2) writing the callback, another thread (or another pod) can call `add-signal`. The signal lands in the store; the callback fires nothing (it isn't registered yet); the registration completes after the signal write; the workflow suspends forever.
+
+The fix is the standard one: register the callback first, then check, then unregister + consume if a signal was already present.
+
+### 2.2 Signal sent to a workflow not yet started
+
+`send-signal` will happily write a signal for an unknown workflow id (in JDBC, line 124 inserts a workflow row via upsert). If the workflow is later started but the start path doesn't drain pre-existing signals before reaching `wait-for-signal`, the signal may or may not be picked up depending on ordering — worth a targeted test, since signals are addressed by name and the workflow expects FIFO semantics per name.
+
+### 2.3 Cancellation cannot reach a suspended workflow
+
+`cancel-workflow` (`src/intemporal/core.cljc:411-418`) sets a flag. Cancellation is **polled at sequence points** (per the explore agent: `check-cancelled!` before each operation, and at the top of the execution loop). A workflow that is suspended on `wait-for-signal` with no signal will:
+
+- never re-enter the execution loop on its own,
+- never poll the flag,
+- be invisible to cancellation.
+
+So `cancel-workflow` is **not reliable for any workflow that is currently waiting**. The flag is set in the DB but the workflow only sees it next time it wakes — which may be never (see 1.1).
+
+### 2.4 Long-lived `start-workflow` thread
+
+`start-workflow` blocks the calling thread until the workflow either completes or is interrupted (`src/intemporal/internal/fns/start_workflow.clj:67-75`). A workflow that waits 30 days for a signal holds the caller's thread for 30 days. This is incompatible with HTTP request/response in any normal web framework and leaks pod resources at scale. The "right" pattern (return a workflow id immediately; durable wake later) is precisely the pattern that doesn't exist (see 1.1, 1.3).
+
+### 2.5 `max-iterations` foot-gun
+
+Default 1000 replay iterations (`src/intemporal/internal/fns/start_workflow.clj:23`). A workflow with thousands of activities or a long signal-driven loop will silently fail at replay. No clear surface to detect this in production.
+
+### 2.6 In-flight activities at shutdown become "interrupted"
+
+Per the crash-test agent: activities crashed mid-execution are marked `:activity-interrupted` and re-run on resume. This is correct behaviour for at-least-once. **However**, combined with 1.1, the resume never happens automatically — so the activity neither completes nor restarts. Worth distinguishing in docs: "at-least-once if you remember to call resume-workflow".
+
+---
+
+## Section 3 — Other Criticism
+
+### 3.1 JDBC schema lacks the columns it needs
+
+Looking at `resources/migrations/postgres/20260215214002-initial-schema.up.sql` (referenced by the explore agent): three tables, no `status` column on `intemporal_workflows`, no `owner`, no `lease_expires_at`, no `last_heartbeat_at`, no index for "find running workflows". Status is derived by scanning `intemporal_history` and reading `last(event-type)` — O(history-length) per status query.
+
+`ON CONFLICT (workflow_id, seq) DO UPDATE` (line 100-103) is the wrong policy. Two correct writers should not be allowed to coexist; the conflict should be loud (`DO NOTHING` + check `affected`, or a `version` column with CAS).
+
+### 3.2 The IStore protocol is too thin
+
+For a multi-tenant durable orchestrator you'd expect at minimum:
+
+- `claim-workflow [store workflow-id worker-id lease-ttl]` → boolean
+- `renew-lease [store workflow-id worker-id]`
+- `release [store workflow-id worker-id]`
+- `list-runnable [store worker-id batch-size]` (signals arrived / timers due / leases expired)
+- Persistent timer rows (`{workflow_id, seq, fire_at}`)
+- Persistent "needs wake" markers
+
+None of these exist. Adding them is not a small patch; it touches the execution engine's assumption that wakes are local.
+
+### 3.3 Signal callbacks duplicated as in-process state across store impls
+
+Every store maintains its own callback atom (`InMemoryStore` via `:signal-callbacks` in the state map, `JdbcStore` via a separate `(atom {})` field). For the JDBC store this is conceptually wrong: the store is shared, but a process-local atom shadows it. A correct multi-pod implementation would use a notification mechanism the database already provides — Postgres `LISTEN/NOTIFY`, an explicit watch table, or an external pub/sub — and would remove `register-signal-callback` from `IStore` entirely (it isn't really a store concern).
+
+### 3.4 No separation between "orchestrator" and "worker"
+
+`IActivityExecutor` runs activities in the same process that runs the workflow. There is no way to dispatch activities to a separate worker pool (e.g., a "heavy I/O" replica set distinct from "orchestrator" replicas). Heavy activities consume the same thread budget that drives workflows.
+
+### 3.5 Recovery requires the caller to know the workflow function
+
+`resume-workflow` takes `workflow-fn` and `args`. The library has no registry that maps `workflow-id → function var`. Every pod that wants to recover must:
+
+1. Query the store for workflows in `:running` status (no such query exists).
+2. Look up the right function var (no such mapping exists).
+3. Recover the original args (they live in the `:workflow-started` event — accessible, but undocumented).
+
+In practice this means the user writes their own dispatch table and recovery loop. The library does not provide a working recovery story out of the box.
+
+### 3.6 Observer protocol is a good idea, slightly under-spec'd
+
+`IWorkflowObserver` (`src/intemporal/protocol.cljc:45-62`) is clean and gives the right hooks for tracing. Two gaps worth noting:
+
+- No `on-store-write` / `on-suspension-persisted` — useful for "did the durability write succeed before we acked the activity?"
+- No `on-replay-iteration` — useful for diagnosing slow replays.
+
+### 3.7 Documentation gap
+
+The README says "not production-ready" but doesn't enumerate **why**. A short "Operational Caveats" section listing 1.1, 1.2, 1.3 would prevent users from misjudging the library based on the Temporal-flavoured API surface.
+
+---
+
+## Section 4 — What a Multi-Pod-Safe Version Would Need
+
+Not a request to implement; a calibration of how far the library is from the goal.
+
+1. **Lease-based ownership.** Add `claim_workflow(worker_id, ttl)` + `renew` + `release`. Reject all writes from a worker whose lease has expired. Heartbeat from a background thread.
+2. **Persistent timers.** Add a `intemporal_timers (workflow_id, seq, fire_at, claimed_by, claimed_until)` table and a poller (`SELECT … WHERE fire_at <= now() AND claimed_until < now() FOR UPDATE SKIP LOCKED`).
+3. **Persistent wake markers.** When a signal arrives or a timer fires, write a row to `intemporal_runnable (workflow_id)`. Each pod polls this table (or `LISTEN`s on `NOTIFY`).
+4. **Durable workflow registry.** Map `workflow_id → workflow_function_symbol + args`. Store the symbol in the `:workflow-started` event; have every pod register the symbols it can resolve.
+5. **Reject concurrent writers.** Change `ON CONFLICT DO UPDATE` to `DO NOTHING` and fail the workflow run on conflict (lease violation).
+6. **Async `start-workflow`.** Return `{:workflow-id …}` immediately; let the worker loop pick up the new workflow from the runnable queue.
+7. **Fix the signal register-then-consume race** (2.1) — even single-process correctness depends on this.
+8. **Cancellation that wakes a sleeper.** Cancellation should write a runnable marker that forces the workflow to wake and observe the flag (currently it only sets the flag).
+
+Items 1–6 are essentially "build a real distributed workflow engine". Item 7 is a bug fix. Item 8 is a small targeted change.
+
+---
+
+## Section 5 — Pragmatic Recommendations (no code changes implied)
+
+For someone evaluating this library:
+
+- **Safe today**: single process, in-memory store, side-effects inside short-lived activities — i.e., as a structured way to write resumable in-memory orchestrations. Fine for tests, batch jobs, single-node tools.
+- **Risky**: any deployment with `>1` replica, even with the JDBC store. Will not lose data, but **will lose execution liveness** on every pod restart, and **will corrupt history** under concurrent retries.
+- **Don't**: rely on it as a Temporal replacement in k8s without writing significant infrastructure on top (leasing, polling, dispatch, signal fan-out).
+
+---
+
+## Verification (how to confirm the above claims yourself)
+
+Quick reproductions, each ~10–30 minutes:
+
+1. **Lost wake on signal across processes.** Start two REPLs with the same Postgres URL. REPL A: `start-workflow` a workflow that calls `wait-for-signal`. Kill REPL A (`System/exit`). REPL B: `send-signal` for that workflow id. Confirm the signal sits in `intemporal_signals` and nothing happens. Restart REPL A: the workflow only resumes if you explicitly call `resume-workflow`.
+2. **Concurrent start corrupts history.** Two REPLs call `start-workflow` with the same `:workflow-id` simultaneously. Inspect `intemporal_history` — observe duplicate `:workflow-started` rows or silently overwritten events at the same `seq`.
+3. **Lost timer.** Start a workflow that sleeps for 5 minutes. Kill the JVM within 30 seconds. Restart it without calling `resume-workflow`. Confirm the workflow never fires.
+4. **Cancellation cannot reach a sleeper.** Start a workflow that does `(wait-for-signal :go)` and immediately `cancel-workflow`. Observe the cancelled flag is set but the workflow never terminates (it never re-enters the loop to observe the flag).
+5. **Register-then-consume race.** A targeted test that interleaves `wait-for-signal` and `send-signal` on the same workflow id at the consume-then-register window. May require thread sleep instrumentation in `process-signal` to reproduce reliably.
+
+If any of these *don't* reproduce, the analysis is wrong on that point and the relevant section should be revised.
diff --git a/resources/migrations/postgres/20260531000001-add-status.down.sql b/resources/migrations/postgres/20260531000001-add-status.down.sql
new file mode 100644
index 0000000..a4cb13d
--- /dev/null
+++ b/resources/migrations/postgres/20260531000001-add-status.down.sql
@@ -0,0 +1,3 @@
+DROP INDEX IF EXISTS idx_intemporal_workflows_status;
+--;;
+ALTER TABLE intemporal_workflows DROP COLUMN IF EXISTS status;
diff --git a/resources/migrations/postgres/20260531000001-add-status.up.sql b/resources/migrations/postgres/20260531000001-add-status.up.sql
new file mode 100644
index 0000000..96a418a
--- /dev/null
+++ b/resources/migrations/postgres/20260531000001-add-status.up.sql
@@ -0,0 +1,7 @@
+-- Phase B2: O(1) workflow status. Avoids scanning intemporal_history to derive
+-- the current status, and gives the Phase C recovery poller a cheap predicate.
+ALTER TABLE intemporal_workflows
+ ADD COLUMN IF NOT EXISTS status TEXT NOT NULL DEFAULT 'running';
+--;;
+CREATE INDEX IF NOT EXISTS idx_intemporal_workflows_status
+ ON intemporal_workflows (status);
diff --git a/resources/migrations/postgres/20260531000002-ownership.down.sql b/resources/migrations/postgres/20260531000002-ownership.down.sql
new file mode 100644
index 0000000..222f4e4
--- /dev/null
+++ b/resources/migrations/postgres/20260531000002-ownership.down.sql
@@ -0,0 +1,3 @@
+DROP INDEX IF EXISTS idx_intemporal_workflows_owner;
+--;;
+ALTER TABLE intemporal_workflows DROP COLUMN IF EXISTS owner;
diff --git a/resources/migrations/postgres/20260531000002-ownership.up.sql b/resources/migrations/postgres/20260531000002-ownership.up.sql
new file mode 100644
index 0000000..bee9b8c
--- /dev/null
+++ b/resources/migrations/postgres/20260531000002-ownership.up.sql
@@ -0,0 +1,9 @@
+-- Phase C: ownership-based recovery.
+-- A workflow is owned by at most one pod (a stable owner-id). A worker resumes
+-- the non-terminal workflows it owns-or-null; a crashed pod's work is reclaimed
+-- when it restarts with the same owner-id. No time-based leases.
+ALTER TABLE intemporal_workflows
+ ADD COLUMN IF NOT EXISTS owner TEXT;
+--;;
+CREATE INDEX IF NOT EXISTS idx_intemporal_workflows_owner
+ ON intemporal_workflows (owner);
diff --git a/resources/migrations/postgres/20260531000003-wake-at.down.sql b/resources/migrations/postgres/20260531000003-wake-at.down.sql
new file mode 100644
index 0000000..71d3650
--- /dev/null
+++ b/resources/migrations/postgres/20260531000003-wake-at.down.sql
@@ -0,0 +1,3 @@
+DROP INDEX IF EXISTS idx_intemporal_workflows_wake_at;
+--;;
+ALTER TABLE intemporal_workflows DROP COLUMN IF EXISTS wake_at;
diff --git a/resources/migrations/postgres/20260531000003-wake-at.up.sql b/resources/migrations/postgres/20260531000003-wake-at.up.sql
new file mode 100644
index 0000000..3a217aa
--- /dev/null
+++ b/resources/migrations/postgres/20260531000003-wake-at.up.sql
@@ -0,0 +1,13 @@
+-- C2: earliest-wake filter for the ownership scan. A workflow suspended on a
+-- timer (sleep / signal-with-timeout) records when it next needs attention, so
+-- the recovery worker can skip long-sleeping workflows until they are due
+-- instead of replaying them every poll. NULL = always eligible (e.g. waiting on
+-- an external signal, not the clock).
+ALTER TABLE intemporal_workflows
+ ADD COLUMN IF NOT EXISTS wake_at TIMESTAMPTZ;
+--;;
+-- Partial index for the due-scan: only non-terminal rows with a future wake_at
+-- are interesting to the poller's "skip until due" predicate.
+CREATE INDEX IF NOT EXISTS idx_intemporal_workflows_wake_at
+ ON intemporal_workflows (wake_at)
+ WHERE wake_at IS NOT NULL;
diff --git a/src/intemporal/core.cljc b/src/intemporal/core.cljc
index a0502ed..b9a284d 100644
--- a/src/intemporal/core.cljc
+++ b/src/intemporal/core.cljc
@@ -6,6 +6,7 @@
[intemporal.internal.execution :as exec]
[intemporal.internal.logging :as log]
[intemporal.internal.fns.start-workflow :as sw]
+ [intemporal.internal.workflow-registry :as wreg]
[intemporal.protocol :as p]
[intemporal.store :as store]
[intemporal.observer :as obs]
@@ -32,10 +33,8 @@
effective-timeout (or timeout-ms (:timeout-ms activity-info))
effective-retry (or retry-policy (:retry-policy activity-info))]
(fn [& args]
- (let [seq-num (ctx/next-seq!)]
+ (let [seq-num (ctx/next-seq!)] ;; next-seq! already checks cancellation
(log/with-mdc {:activity activity-name :seqnum seq-num}
-
- (ctx/check-cancelled!)
(let [ctx (ctx/current-context)
store (ctx/current-store)
workflow-id (ctx/current-workflow-id)
@@ -290,13 +289,19 @@
existing (p/find-event store workflow-id :timer-fired seq-num)]
(if existing
nil
- (let [fire-at (+ (utils/current-time-ms) ms)]
- (ctx/add-pending-event! {:event-type :timer-scheduled
- :seq seq-num
- :fire-at fire-at
- :duration-ms ms
- :timestamp (utils/current-time-ms)})
- (ctx/notify-observer p/on-timer-scheduled (:workflow-id ctx) seq-num fire-at)
+ ;; Reuse the fire-at from a prior :timer-scheduled event if one was already
+ ;; persisted for this seq. Recomputing (now + ms) on every replay would push
+ ;; the deadline later on each resume (drift) and make a crash-resumed sleep
+ ;; never reliably fire. The fire time must be deterministic across replays.
+ (let [prior (p/find-event store workflow-id :timer-scheduled seq-num)
+ fire-at (or (:fire-at prior) (+ (utils/current-time-ms) ms))]
+ (when-not prior
+ (ctx/add-pending-event! {:event-type :timer-scheduled
+ :seq seq-num
+ :fire-at fire-at
+ :duration-ms ms
+ :timestamp (utils/current-time-ms)})
+ (ctx/notify-observer p/on-timer-scheduled (:workflow-id ctx) seq-num fire-at))
(throw (error/make-suspension :timer {:seq seq-num
:fire-at fire-at}))))))
;; ============================================================================
@@ -363,6 +368,51 @@
[engine workflow-fn args & opts]
(apply sw/start-workflow engine workflow-fn args opts))
+
+#?(:clj
+ (defn submit-workflow
+ "Start a workflow asynchronously and return {:workflow-id id} immediately,
+ without blocking the caller until completion (improvements.md §B4). The
+ workflow runs on a background thread; use await-workflow to wait for the
+ result, or resume-workflow/get-workflow-status to observe it later.
+
+ Accepts the same options as start-workflow (:workflow-id, :observer, …)."
+ [engine workflow-fn args & opts]
+ (let [m (apply hash-map opts)
+ wid (or (:workflow-id m) (str (random-uuid)))
+ opts' (mapcat identity (assoc m :workflow-id wid))]
+ (future
+ (try
+ (apply sw/start-workflow engine workflow-fn args opts')
+ (catch Throwable t
+ (log/warnf t "submit-workflow background run failed"))))
+ {:workflow-id wid})))
+
+#?(:clj
+ (defn await-workflow
+ "Block until the workflow reaches a terminal state (:completed, :failed,
+ :cancelled) and return {:status … :result …}. Polls get-workflow-status;
+ a workflow id that is briefly :not-found (still starting) is tolerated.
+ Returns {:status :timeout} if the deadline elapses first."
+ [{:keys [store]} workflow-id & {:keys [poll-ms timeout-ms]
+ :or {poll-ms 50 timeout-ms 30000}}]
+ (let [deadline (+ (System/currentTimeMillis) timeout-ms)]
+ (loop []
+ (let [st (p/get-workflow-status store workflow-id)]
+ (cond
+ (#{:completed :failed :cancelled} st)
+ {:status st
+ :result (->> (p/load-history store workflow-id)
+ (filter #(= :workflow-completed (:event-type %)))
+ first
+ :result)}
+
+ (> (System/currentTimeMillis) deadline)
+ {:status :timeout :workflow-id workflow-id}
+
+ :else
+ (do (Thread/sleep (long poll-ms)) (recur))))))))
+
(defn resume-workflow
"Resume a waiting workflow (e.g., after signal delivery or timer).
@@ -379,16 +429,100 @@
Options:
- :observer - IWorkflowObserver
- :max-iterations - Maximum replay iterations"
- [{:keys [store executor scheduler registry] :as engine} workflow-id workflow-fn args
- & {:keys [observer max-iterations]
- :or {max-iterations 1000}}]
- (when observer
- (p/on-workflow-resumed observer workflow-id))
- (log/info "Workflow resumed")
- (exec/run-workflow-internal engine workflow-id workflow-fn args
- {:observer observer
- :max-iterations max-iterations}))
-
+ ([{:keys [store] :as engine} workflow-id]
+ ;; Resolve fn + args from the :workflow-started event via the workflow
+ ;; registry (improvements.md §B3). Requires the workflow fn to have been
+ ;; registered in this process (start-workflow does so automatically; a
+ ;; restarted/other process must register its workflow vars at startup).
+ (let [history (p/load-history store workflow-id)
+ started (first (filter #(= :workflow-started (:event-type %)) history))]
+ (when-not started
+ (throw (ex-info "Cannot resume: no :workflow-started event in history"
+ {:workflow-id workflow-id})))
+ ;; resolve-workflow throws a descriptive ex-info if the fn is not registered
+ ;; in this process (e.g. a fresh process that forgot to register its vars).
+ (let [wf-name (:workflow-fn-name started)
+ wf-fn (wreg/resolve-workflow wf-name)]
+ (resume-workflow engine workflow-id wf-fn (vec (:args started))))))
+ ([{:keys [store executor scheduler registry] :as engine} workflow-id workflow-fn args
+ & {:keys [observer max-iterations]
+ :or {max-iterations 1000}}]
+ (when observer
+ (p/on-workflow-resumed observer workflow-id))
+ (log/info "Workflow resumed")
+ (exec/run-workflow-internal engine workflow-id workflow-fn args
+ {:observer observer
+ :max-iterations max-iterations})))
+
+#?(:clj
+ (defn start-worker
+ "Start a background recovery worker (Phase C, ownership model). Each poll it
+ lists the non-terminal workflows this owner may run — its own plus any
+ unowned (`owner = owner-id OR owner IS NULL`) — claims each by stamping
+ ownership, and resumes it by id. This is the cross-pod wake AND the crash
+ recovery: the first poll re-picks this owner's orphaned workflows, and a
+ later poll re-resumes a signalled/cancelled one (replay consumes the
+ signal / observes the cancellation). Workflows are resumed sequentially on
+ the poll thread, so neither cross-pod nor intra-pod double-execution occurs.
+
+ Use a STABLE owner-id per pod (e.g. StatefulSet ordinal / config) so a
+ crashed pod reclaims its own work on restart. Returns a 0-arg stop fn that
+ releases this owner's workflows (so other pods can pick them up).
+
+ The worker resumes via resume-workflow [engine workflow-id], so the workflow
+ function must be registered in this process (start-workflow registers it
+ automatically; a fresh process must register its workflow vars at startup).
+
+ Options:
+ :owner-id stable id for this worker (default: random uuid)
+ :poll-ms poll interval (default 500)
+ :batch-size max workflows scanned per poll (default 100)"
+ [{:keys [store] :as engine}
+ & {:keys [owner-id poll-ms batch-size]
+ :or {owner-id (str (random-uuid)) poll-ms 500 batch-size 100}}]
+ (let [running (atom true)
+ process-one
+ (fn [wf-id]
+ (when (p/claim-owner store wf-id owner-id)
+ (try
+ (resume-workflow engine wf-id)
+ (catch Throwable t
+ (log/warnf t "Worker %s failed resuming %s" owner-id wf-id)))))
+ ;; Exponential backoff on consecutive poll failures so a downed
+ ;; database doesn't get hammered (and the logs flooded). Resets to
+ ;; poll-ms after any successful list-pending query.
+ max-backoff-ms (* poll-ms 60)
+ backoff-ms (atom (long poll-ms))
+ thread
+ (Thread.
+ ^Runnable
+ (fn []
+ (while @running
+ (try
+ (let [ids (p/list-pending store owner-id batch-size)]
+ (reset! backoff-ms (long poll-ms)) ; healthy query: reset backoff
+ (if (seq ids)
+ (doseq [wf-id ids :while @running]
+ (process-one wf-id))
+ (Thread/sleep (long poll-ms))))
+ (catch InterruptedException _ (reset! running false))
+ (catch Throwable t
+ (let [wait @backoff-ms]
+ (log/warnf t "Worker %s loop error; backing off %dms" owner-id wait)
+ (Thread/sleep wait)
+ (swap! backoff-ms #(min max-backoff-ms (* 2 %)))))))))]
+ (doto thread
+ (.setDaemon true)
+ (.setName (str "intemporal-worker-" owner-id))
+ (.start))
+ (fn stop-worker []
+ (reset! running false)
+ (.interrupt thread)
+ ;; Wait briefly for an in-flight resume to finish before releasing
+ ;; ownership, so another pod doesn't pick up a workflow that is still
+ ;; executing here (#7). Bounded so stop never blocks indefinitely.
+ (.join thread (long poll-ms))
+ (p/release-owner store owner-id)))))
(defn send-signal
"Send a signal to a workflow.
@@ -402,6 +536,10 @@
Options:
- :signal-id - Custom signal ID for idempotency"
[store workflow-id signal-name payload & {:keys [signal-id]}]
+ (let [status (p/get-workflow-status store workflow-id)]
+ (when-not (= status :running)
+ (throw (ex-info "Cannot send signal: workflow is not active"
+ {:workflow-id workflow-id :status status}))))
(let [id (or signal-id (str (random-uuid)))]
(log/with-mdc {:workflow-id workflow-id}
(p/add-signal store workflow-id signal-name {:id id :payload payload})
@@ -410,11 +548,18 @@
(defn cancel-workflow
"Cancel a running workflow.
- The workflow will be cancelled at the next suspension point."
+ The workflow is cancelled at the next suspension point. If it is currently
+ suspended (e.g. waiting on a signal), wake-workflow forces it to re-enter its
+ loop so it observes the cancellation flag rather than waiting forever."
[store workflow-id]
(log/with-mdc {:workflow-id workflow-id}
- (p/mark-cancelled store workflow-id)
- (log/debugf "Cancelling workflow"))
+ (let [status (p/get-workflow-status store workflow-id)]
+ (if (#{:completed :failed :cancelled} status)
+ (log/debugf "Cancelling workflow that is already in terminal state %s, skipping" status)
+ (do
+ (p/mark-cancelled store workflow-id)
+ (p/wake-workflow store workflow-id)
+ (log/debugf "Cancelling workflow")))))
{:cancelled true :workflow-id workflow-id})
(defn get-workflow-history
@@ -438,6 +583,86 @@
[proto & opts]
`(im/stub-protocol ~proto ~@opts))
+;; ============================================================================
+;; Saga / Compensations
+;; ============================================================================
+
+(defn suspension?
+ "True if `e` is an internal workflow suspension (the engine's normal control
+ flow for activities, timers, signals, etc.). Mainly needed in ClojureScript,
+ where every throwable is a js/Error and `(catch :default e)` catches
+ suspensions too - a saga catch there must rethrow them via this predicate.
+ On the JVM suspensions subclass Error, so `(catch Exception e)` already
+ excludes them and no guard is needed. See `saga`."
+ [e]
+ (error/suspension? e))
+
+(defn saga
+ "Create a saga: a handle that collects compensation thunks for the steps a
+ workflow has completed. Register compensations as you go with
+ `add-compensation`, and run them with `compensate` from a catch block.
+
+ Both real failures and workflow cancellation flow through the catch (so this
+ rolls back in either case); the engine's normal control-flow suspensions do
+ not. On the JVM, catch `Exception` - suspensions subclass Error and are
+ excluded automatically:
+
+ (let [s (saga)]
+ (try
+ (let [h (book-hotel order)
+ _ (add-compensation s #(cancel-hotel h))]
+ (charge-card order))
+ (catch Exception e
+ (compensate s) ;; rolls back completed steps, LIFO
+ (throw e))))
+
+ In ClojureScript there is no Error/Exception split, so catch :default and
+ rethrow suspensions explicitly:
+
+ (catch :default e
+ (when (suspension? e) (throw e))
+ (compensate s)
+ (throw e))"
+ []
+ {::compensations (atom [])})
+
+(defn add-compensation
+ "Register a 0-arg compensation thunk on `saga`. Compensations run in reverse
+ registration order (LIFO) when `compensate` is called. The thunk should call
+ activity stubs (closing over the step's result) so it is durable / replay-safe.
+ Register a step's compensation only after the step succeeds, so a step that
+ never completed registers nothing to undo."
+ [saga thunk]
+ (swap! (::compensations saga) conj thunk))
+
+(defn compensate
+ "Run `saga`'s registered compensations in reverse (LIFO). Real errors from a
+ compensation are logged and skipped (best-effort rollback); a suspension (a
+ compensating activity running for the first time) is rethrown so the engine
+ schedules and resumes it - on replay already-run compensations return cached
+ results."
+ [saga]
+ (let [comps @(::compensations saga)]
+ (when (seq comps)
+ (ctx/notify-observer p/on-compensation-started (ctx/current-workflow-id)))
+ ;; Suppress the cancellation check so compensating activities can run even
+ ;; when this rollback was triggered by a cancellation (the cancel exception
+ ;; was already caught by the user before calling compensate).
+ (ctx/set-compensating! true)
+ (try
+ (doseq [c (reverse comps)]
+ (try
+ (c)
+ (catch #?(:clj Throwable :cljs js/Error) t
+ (when (error/suspension? t) (throw t))
+ (ctx/notify-observer p/on-compensation-failed
+ (ctx/current-workflow-id) (error/throwable->map t))
+ (log/warnf "Compensation failed, continuing: %s" (ex-message t)))))
+ (finally
+ (ctx/set-compensating! false)))
+ (when (seq comps)
+ (ctx/notify-observer p/on-compensation-completed (ctx/current-workflow-id)))))
+
;; ============================================================================
;; Convenience Functions
;; ============================================================================
@@ -452,26 +677,29 @@
- :scheduler-threads - Number of scheduler threads (default: 2)
- :default-timeout-ms - Default activity timeout (default: 30000)
- :enable-logging - Enable logging observer (default: false)
- - :observer - Custom observer instance (overrides :enable-logging)"
- [& {:keys [store threads scheduler-threads default-timeout-ms enable-logging observer]
+ - :enable-telemetry - Enable OpenTelemetry observer (default: false, JVM only)
+ - :observer - Additional observer instance, composed on top of built-in observers"
+ [& {:keys [store threads scheduler-threads default-timeout-ms enable-logging enable-telemetry observer]
:or {store (store/->InMemoryStore (atom {}))
threads 4
scheduler-threads 2
default-timeout-ms 30000
- enable-logging false}}]
+ enable-logging false
+ enable-telemetry false}}]
(let [registry (a/make-registry)
- log-atom (when enable-logging (atom []))]
+ log-atom (when enable-logging (atom []))
+ logging-observer (when enable-logging (obs/make-logging-observer log-atom))
+ otel-observer #?(:clj (when enable-telemetry
+ ((requiring-resolve 'intemporal.observer.otel/make-otel-observer)))
+ :cljs nil)
+ composite-observer (obs/make-composite-observer [logging-observer otel-observer observer])]
{:store store
:executor (runtime/make-vthreads-executor registry
:threads threads
:default-timeout-ms default-timeout-ms)
:scheduler (runtime/make-scheduler :threads scheduler-threads)
:registry registry
- ;; opts
- :observer (or observer
- (if enable-logging
- (obs/make-logging-observer log-atom)
- (obs/noop-observer)))
+ :observer composite-observer
:log (when enable-logging log-atom)}))
(defn shutdown-engine
diff --git a/src/intemporal/internal/activity.cljc b/src/intemporal/internal/activity.cljc
index 7d42a73..89b0708 100644
--- a/src/intemporal/internal/activity.cljc
+++ b/src/intemporal/internal/activity.cljc
@@ -88,6 +88,7 @@
;; Retry Policy
;; ============================================================================
+;; TODO no need for a record?
(defrecord RetryPolicy [max-attempts
backoff-ms
max-backoff-ms
diff --git a/src/intemporal/internal/context.cljc b/src/intemporal/internal/context.cljc
index 548a790..cd2f561 100644
--- a/src/intemporal/internal/context.cljc
+++ b/src/intemporal/internal/context.cljc
@@ -1,9 +1,11 @@
(ns intemporal.internal.context
(:require [intemporal.internal.error :as error]
+ [intemporal.internal.logging :as log]
[intemporal.protocol :as p]
[promesa.core])
#?(:clj (:require [net.cgrand.macrovich :as macros])
- :cljs (:require-macros [net.cgrand.macrovich :as macros])))
+ :cljs (:require-macros [net.cgrand.macrovich :as macros]
+ [intemporal.internal.logging :as log])))
;; ============================================================================
@@ -20,6 +22,7 @@
:seq-counter (atom 0)
:pending-events pending-events
:pending-asyncs pending-asyncs
+ :compensating? (atom false)
:store store
:registry registry
:observer observer
@@ -34,10 +37,72 @@
(defn current-store []
(:store (current-context)))
+(defn compensating?
+ "True while the workflow is inside intemporal/compensate. Used to suppress the
+ cancellation check so compensating activities can run even though the workflow
+ is being cancelled (the cancel exception was already caught by the user)."
+ []
+ (boolean (some-> (:compensating? (current-context)) deref)))
+
+(defn set-compensating! [v]
+ (some-> (:compensating? (current-context)) (reset! v)))
+
+(declare find-event add-pending-event!)
+
+(defn- seq-has-event?
+ "True if history (or pending events) holds any event at sequence `s`. The
+ pending-events scan is load-bearing: it lets a :workflow-cancelling marker
+ added earlier in the *current* pass count as present, so the frontier op does
+ not record a second marker / throw twice at the same seq within one pass."
+ [ctx s]
+ (or (some #(= (:seq %) s) @(:history ctx))
+ (some #(= (:seq %) s) @(:pending-events ctx))))
+
+(defn replaying?
+ "True when the operation about to run at the current sequence position already
+ has recorded history (it is being replayed, not executed for the first time).
+ Used to defer the cancellation check to the frontier - the first un-cached
+ operation - so that a saga's compensation registrations (which re-run during
+ replay) are rebuilt before cancellation surfaces into the user's catch.
+ Per-seq equality (not max-seq) so that compensation events, which take higher
+ seq numbers, don't make a not-yet-reached forward op look replayed."
+ []
+ (seq-has-event? (current-context) @(:seq-counter (current-context))))
+
+(defn- surface-cancellation!
+ "Decide where a cancellation surfaces into the workflow body, then throw.
+
+ Cancellation must surface deterministically so that a saga's compensations
+ (registered as the body re-runs) are rebuilt before the user's catch runs, and
+ so the compensation seq space stays stable across crashes/resumes. We anchor it
+ to a single frontier sequence number, recorded once as a :workflow-cancelling
+ marker and re-thrown at that same seq on every later pass (like a recorded
+ :activity-failed):
+
+ - marker already at `cur` -> re-throw (deterministic replay frontier);
+ - still replaying cached steps -> return nil so the body advances toward the
+ frontier (re-registering compensations along the way);
+ - frontier (first un-cached op) -> record the marker, then throw."
+ [ctx cur]
+ (cond
+ (find-event @(:history ctx) :workflow-cancelling cur)
+ (throw (error/workflow-cancelled-exception))
+
+ (replaying?)
+ nil
+
+ :else
+ (do
+ (add-pending-event! {:event-type :workflow-cancelling :seq cur})
+ (throw (error/workflow-cancelled-exception)))))
+
(defn check-cancelled! []
(let [ctx (current-context)]
- (when (p/is-cancelled? (:store ctx) (:workflow-id ctx))
- (throw (error/workflow-cancelled-exception)))))
+ ;; Suppress while compensating: the cancel exception was already caught by
+ ;; the user and the compensating activities must run.
+ (when (and (not (compensating?))
+ (p/is-cancelled? (:store ctx) (:workflow-id ctx)))
+ (surface-cancellation! ctx @(:seq-counter ctx)))))
(defn next-seq! []
(check-cancelled!)
@@ -73,7 +138,7 @@
(apply event-fn observer args)
(catch #?(:clj Exception :cljs js/Error) e
;; Don't let observer errors break workflow
- (println "Observer error:" (ex-message e))))))
+ (log/warnf e "Observer error: %s" (ex-message e))))))
;; ============================================================================
;; Context-Aware Macros, cljs only
diff --git a/src/intemporal/internal/error.cljc b/src/intemporal/internal/error.cljc
index f1539eb..048c328 100644
--- a/src/intemporal/internal/error.cljc
+++ b/src/intemporal/internal/error.cljc
@@ -69,12 +69,15 @@
(-> e ex-data :data))))
(defn workflow-cancelled-exception []
- (internal-error "Workflow cancelled" {::cancelled true}))
+ ;; A plain ex-info (catchable by `(catch Exception ...)`) - unlike suspensions,
+ ;; which subclass Error to stay invisible to userland catches. This lets a saga
+ ;; workflow catch cancellation and run compensations to roll completed steps
+ ;; back, while still letting suspensions propagate to the engine untouched.
+ (ex-info "Workflow cancelled" {::cancelled true}))
(defn cancelled-exception? [e]
#?(:clj
- (and (instance? Error e)
- (instance? IExceptionInfo e)
+ (and (instance? IExceptionInfo e)
(::cancelled (ex-data e)))
:cljs
(and (instance? js/Error e)
@@ -112,25 +115,57 @@
:handle-seq handle-seq
:cause cause}))
+(defn exception-kind
+ "Classify an intemporal exception by the marker key in its ex-data, returning a
+ stable keyword (or nil for a plain/unknown exception). Survives JSON round-trips
+ because it is stored explicitly as :exception-kind in the serialized map."
+ [data]
+ (when (map? data)
+ (cond
+ (::cancelled data) :cancelled
+ (::rejected data) :rejected
+ (::activity-timeout data) :activity-timeout
+ (::activity-interrupted data) :activity-interrupted
+ (::activity-failed data) :activity-failed
+ (::async-failed data) :async-failed
+ (::suspension data) :suspension)))
+
(defn throwable->map [t]
(when t
- #?(:clj
- {:type (str (type t))
- :message (ex-message t)
- :data (when (instance? IExceptionInfo t)
- (ex-data t))
- :stack-trace (mapv str (.getStackTrace t))
- :cause (throwable->map (.getCause t))}
- :cljs
- {:type (str (type t))
- :message (.-message t)
- :data (or (.-data t) (ex-data t))
- :stack-trace (when (.-stack t)
- (str/split-lines (.-stack t)))
- :cause (when (.-cause t)
- (throwable->map (.-cause t)))})))
-
-(defn map->exception [m]
+ (let [data #?(:clj (when (instance? IExceptionInfo t) (ex-data t))
+ :cljs (or (.-data t) (ex-data t)))]
+ (cond-> #?(:clj
+ {:type (str (type t))
+ :message (ex-message t)
+ :data data
+ :stack-trace (mapv str (.getStackTrace t))
+ :cause (throwable->map (.getCause t))}
+ :cljs
+ {:type (str (type t))
+ :message (.-message t)
+ :data data
+ :stack-trace (when (.-stack t)
+ (str/split-lines (.-stack t)))
+ :cause (when (.-cause t)
+ (throwable->map (.-cause t)))})
+ (exception-kind data) (assoc :exception-kind (exception-kind data))))))
+
+(defn map->exception
+ "Reconstruct an exception from a serialized map. Dispatches on :exception-kind
+ (added by throwable->map) so type predicates such as cancelled-exception? keep
+ working on replayed/resumed errors; falls back to a generic ex-info otherwise."
+ [m]
(when m
- (ex-info (or (:message m) "Restored exception")
- (merge {:restored true} (:data m)))))
+ (let [{:keys [data]} m
+ activity-name (:activity-name data)]
+ (case (some-> (:exception-kind m) keyword)
+ :cancelled (workflow-cancelled-exception)
+ :rejected (activity-rejected-exception activity-name (:cause data))
+ :activity-timeout (activity-timeout-exception activity-name (:timeout-ms data))
+ :activity-interrupted (activity-interrupted-exception activity-name (:cause data))
+ :activity-failed (activity-failed-exception activity-name
+ (when-let [c (:cause m)] (map->exception c)))
+ :async-failed (async-failed-exception (:handle-seq data) (:cause data))
+ ;; Unknown / plain exception: preserve the original data and mark restored.
+ (ex-info (or (:message m) "Restored exception")
+ (merge {:restored true} data))))))
diff --git a/src/intemporal/internal/execution.clj b/src/intemporal/internal/execution.clj
index 8b7fe8a..7ec41b3 100644
--- a/src/intemporal/internal/execution.clj
+++ b/src/intemporal/internal/execution.clj
@@ -37,6 +37,10 @@
:pending-events @(:pending-events (ctx/current-context))}
:else
+ ;; Real failure. Any saga rollback happens inside the workflow body (the
+ ;; user's catch calls intemporal/compensate); a compensating activity that
+ ;; suspends throws out of compensate and arrives here as a suspension,
+ ;; caught above, so the loop schedules + resumes it.
{:status :failed
:error e
:pending-events @(:pending-events (ctx/current-context))}))))
@@ -159,40 +163,39 @@
now (utils/current-time-ms)
;; Create completion events for both activities and async handles
- completion-events
- (mapcat (fn [{:keys [activity-name activity-seq] :as async-info} result]
- (log/with-mdc {:activity activity-name :seqnum activity-seq}
- (if (= :success (:status result))
- (do
- (-notify p/on-async-completed observer workflow-id (:handle-seq async-info) (:result result))
- (log/tracef "Got completion event: activity succeeded, result: %s" result))
- (do
- (-notify p/on-async-failed observer workflow-id (:handle-seq async-info) (:error result))
- (log/tracef "Got completion event: activity failed, error: %s" (:error result))))
- (if (= :success (:status result))
- [{:event-type :activity-completed
- :seq (:activity-seq async-info)
- :activity-name (:activity-name async-info)
- :result (:result result)
- :duration-ms (:duration result)
- :timestamp now}
- {:event-type :async-completed
- :seq (:handle-seq async-info)
- :last-seq (:activity-seq async-info)
- :result (:result result)
- :timestamp now}]
- ;; else
- [{:event-type :activity-failed
- :seq (:activity-seq async-info)
- :activity-name (:activity-name async-info)
- :error (:error result)
- :timestamp now}
- {:event-type :async-failed
- :seq (:handle-seq async-info)
- :last-seq (:activity-seq async-info)
- :error (:error result)
- :timestamp now}])))
- pending-asyncs results)]
+ completion-events (mapcat (fn [{:keys [activity-name activity-seq] :as async-info} result]
+ (log/with-mdc {:activity activity-name :seqnum activity-seq}
+ (if (= :success (:status result))
+ (do
+ (-notify p/on-async-completed observer workflow-id (:handle-seq async-info) (:result result))
+ (log/tracef "Got completion event: activity succeeded, result: %s" result))
+ (do
+ (-notify p/on-async-failed observer workflow-id (:handle-seq async-info) (:error result))
+ (log/tracef "Got completion event: activity failed, error: %s" (:error result))))
+ (if (= :success (:status result))
+ [{:event-type :activity-completed
+ :seq (:activity-seq async-info)
+ :activity-name (:activity-name async-info)
+ :result (:result result)
+ :duration-ms (:duration result)
+ :timestamp now}
+ {:event-type :async-completed
+ :seq (:handle-seq async-info)
+ :last-seq (:activity-seq async-info)
+ :result (:result result)
+ :timestamp now}]
+ ;; else
+ [{:event-type :activity-failed
+ :seq (:activity-seq async-info)
+ :activity-name (:activity-name async-info)
+ :error (:error result)
+ :timestamp now}
+ {:event-type :async-failed
+ :seq (:handle-seq async-info)
+ :last-seq (:activity-seq async-info)
+ :error (:error result)
+ :timestamp now}])))
+ pending-asyncs results)]
(p/save-events store workflow-id completion-events)))
:continue)
@@ -221,77 +224,81 @@
:wait-timer))))
(defn process-signal [store workflow-id suspension-data pending-events wake-fn observer]
- (let [{:keys [seq signal-name]} suspension-data]
+ (let [{:keys [seq signal-name]} suspension-data
+ save-received (fn [signal-data]
+ (p/save-event store workflow-id {:event-type :signal-received
+ :seq seq
+ :signal-name signal-name
+ :signal-id (:id signal-data)
+ :payload (:payload signal-data)
+ :timestamp (utils/current-time-ms)})
+ (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data)))]
;; Save pending events
(p/save-events store workflow-id pending-events)
+ ;; Register the wake callback FIRST, then check for an already-available
+ ;; signal (fixes bug 2.1: a signal arriving between the consume-check and
+ ;; the registration could previously be lost). consume-signal is atomic in
+ ;; every store, so exactly one of {the inline check below, the callback}
+ ;; consumes the signal — the other observes nil and no-ops. The callback
+ ;; only wakes if it was the one that consumed, so the inline :continue path
+ ;; never double-executes the workflow.
+ (p/register-signal-callback store workflow-id signal-name
+ (fn []
+ (when-let [signal-data (p/consume-signal store workflow-id signal-name)]
+ (save-received signal-data)
+ (p/unregister-signal-callback store workflow-id signal-name)
+ (when wake-fn (wake-fn)))))
(if-let [signal-data (p/consume-signal store workflow-id signal-name)]
- ;; Signal already available - process immediately
+ ;; We won the race inline: handle the signal and continue synchronously.
(do
- (p/save-event store workflow-id {:event-type :signal-received
- :seq seq
- :signal-name signal-name
- :signal-id (:id signal-data)
- :payload (:payload signal-data)
- :timestamp (utils/current-time-ms)})
- (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data))
+ (p/unregister-signal-callback store workflow-id signal-name)
+ (save-received signal-data)
:continue)
- ;; ELSE Signal not yet available - register callback and wait
- (do
- (p/register-signal-callback store workflow-id signal-name
- (fn []
- ;; When signal arrives, consume it and save event
- (when-let [signal-data (p/consume-signal store workflow-id signal-name)]
- (p/save-event store workflow-id {:event-type :signal-received
- :seq seq
- :signal-name signal-name
- :signal-id (:id signal-data)
- :payload (:payload signal-data)
- :timestamp (utils/current-time-ms)})
- (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data)))
- ;; Unregister callback
- (p/unregister-signal-callback store workflow-id signal-name)
- ;; Wake up the workflow
- (when wake-fn (wake-fn))))
- :wait-signal))))
+ ;; No signal yet: stay suspended; the armed callback will wake us.
+ :wait-signal)))
(defn process-signal-with-timeout [store scheduler workflow-id suspension-data
pending-events wake-fn observer]
(let [{:keys [seq signal-name deadline]} suspension-data
- now (utils/current-time-ms)]
+ now (utils/current-time-ms)
+ save-completed (fn [signal-data?]
+ (p/save-event store workflow-id
+ (cond-> {:event-type :signal-wait-completed
+ :seq seq
+ :received (some? signal-data?)
+ :signal-name signal-name
+ :timestamp (utils/current-time-ms)}
+ (some? signal-data?) (assoc :payload (:payload signal-data?))))
+ (when signal-data?
+ (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data?))))]
(p/save-events store workflow-id pending-events)
;; Check if signal already available
(if-let [signal-data (p/consume-signal store workflow-id signal-name)]
(do
- (p/save-event store workflow-id {:event-type :signal-wait-completed
- :seq seq
- :received true
- :signal-name signal-name
- :payload (:payload signal-data)
- :timestamp now})
- (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data))
+ (save-completed signal-data)
:continue)
;; ELSE Check if already timed out
(if (>= now deadline)
(do
- (p/save-event store workflow-id {:event-type :signal-wait-completed
- :seq seq
- :received false
- :signal-name signal-name
- :timestamp now})
+ (save-completed nil)
:continue)
- ;; Schedule timeout
+ ;; Register signal callback FIRST (mirrors the process-signal fix for bug 2.1):
+ ;; a signal arriving between the consume-check above and the timer firing would
+ ;; otherwise be silently lost. With the callback armed, exactly one of {the
+ ;; timer callback, the signal callback} wins the atomic consume-signal race.
(do
+ (p/register-signal-callback store workflow-id signal-name
+ (fn []
+ (when-let [signal-data (p/consume-signal store workflow-id signal-name)]
+ (p/unregister-signal-callback store workflow-id signal-name)
+ (p/cancel-timer scheduler workflow-id seq)
+ (save-completed signal-data)
+ (when wake-fn (wake-fn)))))
(p/schedule-timer scheduler workflow-id seq deadline
(fn []
- ;; Check one more time for signal
+ (p/unregister-signal-callback store workflow-id signal-name)
(let [signal-data? (p/consume-signal store workflow-id signal-name)]
- (p/save-event store workflow-id (cond-> {:event-type :signal-wait-completed
- :seq seq
- :received (some? signal-data?)
- :signal-name signal-name
- :timestamp (utils/current-time-ms)}
- (some? signal-data?) (assoc :payload (:payload signal-data?)))))
-
+ (save-completed signal-data?))
(when wake-fn (wake-fn))))
:wait-signal-timeout)))))
@@ -329,6 +336,7 @@
:seq-counter (atom 0)
:pending-events (atom [])
:pending-asyncs (atom [])
+ :compensating? (atom false)
:store store
:registry registry
:observer observer})
@@ -354,18 +362,20 @@
:result result})
(defn finalize-cancelled
- "Save cancellation event and return result as failed."
+ "Save a dedicated cancellation event and return the cancelled result.
+ The history event is :workflow-cancelled (a first-class terminal state), so
+ history and the derived status agree rather than recording cancellation as a
+ failure."
[store workflow-id pending-events observer]
(p/save-events store workflow-id pending-events)
(let [error-map {:type "clojure.lang.ExceptionInfo"
:message "Workflow cancelled"
:data {:workflow-id workflow-id}}]
- (p/save-event store workflow-id {:event-type :workflow-failed
+ (p/save-event store workflow-id {:event-type :workflow-cancelled
:error error-map
:timestamp (utils/current-time-ms)})
(-notify p/on-workflow-cancelled observer workflow-id)
- (-notify p/on-workflow-failed observer workflow-id error-map)
- {:status :failed
+ {:status :cancelled
:workflow-id workflow-id
:error error-map}))
@@ -505,11 +515,19 @@
{:keys [observer max-iterations wake-fn]
:or {max-iterations 1000}}]
(loop [iteration 0]
- (when (>= iteration max-iterations)
- (throw (ex-info "Max iterations exceeded" {:workflow-id workflow-id
- :iterations iteration})))
-
- (log/debugf "Internal loop %d of %d" iteration max-iterations)
+ (if (>= iteration max-iterations)
+ ;; Replay budget exhausted (e.g. a non-terminating workflow loop). Persist a
+ ;; terminal :workflow-failed event so the workflow becomes resolvable instead
+ ;; of staying "running" forever with an un-recorded exception thrown out of
+ ;; the loop.
+ (do
+ (log/warnf "Workflow %s exceeded replay budget of %d iterations" workflow-id max-iterations)
+ (finalize-failed store workflow-id []
+ (ex-info "Replay budget exceeded"
+ {:workflow-id workflow-id :iterations iteration})
+ observer))
+ (do
+ (log/debugf "Internal loop %d of %d" iteration max-iterations)
;; Check if executor is shutting down - stop processing to avoid endless rejections
(if (p/shutdown? executor)
@@ -518,28 +536,11 @@
{:status :suspended
:workflow-id workflow-id})
- ;; Check cancellation at start of each iteration
- (if (p/is-cancelled? store workflow-id)
- (let [error-map {:type "clojure.lang.ExceptionInfo"
- :message "Workflow cancelled"
- :data {:workflow-id workflow-id}}]
-
- (-notify p/on-workflow-cancelled observer workflow-id)
- (p/save-event store workflow-id {:event-type :workflow-failed
- :error error-map
- :timestamp (utils/current-time-ms)})
-
- (log/info "Workflow cancelled, failing")
- (-notify p/on-workflow-failed observer workflow-id error-map)
- {:status :failed
- :workflow-id workflow-id
- :error error-map})
- ;; else
- (let [history (p/load-history store workflow-id)
- ctx (make-workflow-context workflow-id history store registry observer)
- exec-result (binding [ctx/*workflow-context* ctx]
- (log/debugf "Executing workflow function %s..." workflow-fn)
- (execute-workflow-fn workflow-fn args))]
+ (let [history (p/load-history store workflow-id)
+ ctx (make-workflow-context workflow-id history store registry observer)
+ exec-result (binding [ctx/*workflow-context* ctx]
+ (log/debugf "Executing workflow function %s..." workflow-fn)
+ (execute-workflow-fn workflow-fn args))]
(log/debugf "Workflow function executed, got: %s" (:status exec-result))
(case (:status exec-result)
@@ -551,6 +552,9 @@
observer)
:cancelled
+ ;; Cancellation surfaced from the body (a stub's check-cancelled!).
+ ;; Any saga rollback already ran inside the user's catch before the
+ ;; cancel exception was rethrown, so just finalize.
(finalize-cancelled store workflow-id
(:pending-events exec-result)
observer)
@@ -569,13 +573,28 @@
(if (= action :continue)
(recur (inc iteration))
- (action->result action workflow-id)))
+ ;; About to wait: register a generic wake callback so an external
+ ;; actor (e.g. cancel-workflow) can force this workflow to
+ ;; re-enter its loop and observe state such as the cancel flag.
+ (do
+ (when wake-fn
+ (p/register-wake-callback store workflow-id wake-fn))
+ ;; C2: record when this workflow next needs attention so the
+ ;; ownership scan can skip it until due. Timer waits carry a
+ ;; clock deadline; signal/async waits are always eligible (nil).
+ (let [sd (:suspension-data exec-result)
+ wake-at (case action
+ :wait-timer (:fire-at sd)
+ :wait-signal-timeout (:deadline sd)
+ nil)]
+ (p/set-wake-at store workflow-id wake-at))
+ (action->result action workflow-id))))
:failed
(finalize-failed store workflow-id
(:pending-events exec-result)
(:error exec-result)
- observer)))))))
+ observer))))))))
(defn process-child-workflow [{:keys [store executor scheduler registry] :as engine} workflow-id
suspension-data pending-events observer]
diff --git a/src/intemporal/internal/execution.cljs b/src/intemporal/internal/execution.cljs
index 5add223..bbb6051 100644
--- a/src/intemporal/internal/execution.cljs
+++ b/src/intemporal/internal/execution.cljs
@@ -51,6 +51,10 @@
:pending-events @pending-events}
:else
+ ;; Real failure. Saga rollback happens inside the
+ ;; body (user's catch -> intemporal/compensate); a
+ ;; suspending compensation surfaces above as a
+ ;; suspension so the loop schedules + resumes it.
{:status :failed
:error e
:pending-events @pending-events})))))
@@ -247,77 +251,81 @@
:wait-timer))))
(defn process-signal [store workflow-id suspension-data pending-events wake-fn observer]
- (let [{:keys [seq signal-name]} suspension-data]
+ (let [{:keys [seq signal-name]} suspension-data
+ save-received (fn [signal-data]
+ (p/save-event store workflow-id {:event-type :signal-received
+ :seq seq
+ :signal-name signal-name
+ :signal-id (:id signal-data)
+ :payload (:payload signal-data)
+ :timestamp (utils/current-time-ms)})
+ (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data)))]
;; Save pending events
(p/save-events store workflow-id pending-events)
+ ;; Register the wake callback FIRST, then check for an already-available
+ ;; signal (fixes bug 2.1: a signal arriving between the consume-check and
+ ;; the registration could previously be lost). consume-signal is atomic in
+ ;; every store, so exactly one of {the inline check below, the callback}
+ ;; consumes the signal — the other observes nil and no-ops. The callback
+ ;; only wakes if it was the one that consumed, so the inline :continue path
+ ;; never double-executes the workflow.
+ (p/register-signal-callback store workflow-id signal-name
+ (fn []
+ (when-let [signal-data (p/consume-signal store workflow-id signal-name)]
+ (save-received signal-data)
+ (p/unregister-signal-callback store workflow-id signal-name)
+ (when wake-fn (wake-fn)))))
(if-let [signal-data (p/consume-signal store workflow-id signal-name)]
- ;; Signal already available - process immediately
+ ;; We won the race inline: handle the signal and continue synchronously.
(do
- (p/save-event store workflow-id {:event-type :signal-received
- :seq seq
- :signal-name signal-name
- :signal-id (:id signal-data)
- :payload (:payload signal-data)
- :timestamp (utils/current-time-ms)})
- (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data))
+ (p/unregister-signal-callback store workflow-id signal-name)
+ (save-received signal-data)
:continue)
- ;; ELSE Signal not yet available - register callback and wait
- (do
- (p/register-signal-callback store workflow-id signal-name
- (fn []
- ;; When signal arrives, consume it and save event
- (when-let [signal-data (p/consume-signal store workflow-id signal-name)]
- (p/save-event store workflow-id {:event-type :signal-received
- :seq seq
- :signal-name signal-name
- :signal-id (:id signal-data)
- :payload (:payload signal-data)
- :timestamp (utils/current-time-ms)})
- (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data)))
- ;; Unregister callback
- (p/unregister-signal-callback store workflow-id signal-name)
- ;; Wake up the workflow
- (when wake-fn (wake-fn))))
- :wait-signal))))
+ ;; No signal yet: stay suspended; the armed callback will wake us.
+ :wait-signal)))
(defn process-signal-with-timeout [store scheduler workflow-id suspension-data
pending-events wake-fn observer]
(let [{:keys [seq signal-name deadline]} suspension-data
- now (utils/current-time-ms)]
+ now (utils/current-time-ms)
+ save-completed (fn [signal-data?]
+ (p/save-event store workflow-id
+ (cond-> {:event-type :signal-wait-completed
+ :seq seq
+ :received (some? signal-data?)
+ :signal-name signal-name
+ :timestamp (utils/current-time-ms)}
+ (some? signal-data?) (assoc :payload (:payload signal-data?))))
+ (when signal-data?
+ (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data?))))]
(p/save-events store workflow-id pending-events)
;; Check if signal already available
(if-let [signal-data (p/consume-signal store workflow-id signal-name)]
(do
- (p/save-event store workflow-id {:event-type :signal-wait-completed
- :seq seq
- :received true
- :signal-name signal-name
- :payload (:payload signal-data)
- :timestamp now})
- (-notify p/on-signal-received observer workflow-id signal-name (:payload signal-data))
+ (save-completed signal-data)
:continue)
;; ELSE Check if already timed out
(if (>= now deadline)
(do
- (p/save-event store workflow-id {:event-type :signal-wait-completed
- :seq seq
- :received false
- :signal-name signal-name
- :timestamp now})
+ (save-completed nil)
:continue)
- ;; Schedule timeout
+ ;; Register signal callback FIRST (mirrors the process-signal fix for bug 2.1):
+ ;; a signal arriving between the consume-check above and the timer firing would
+ ;; otherwise be silently lost. With the callback armed, exactly one of {the
+ ;; timer callback, the signal callback} wins the atomic consume-signal race.
(do
+ (p/register-signal-callback store workflow-id signal-name
+ (fn []
+ (when-let [signal-data (p/consume-signal store workflow-id signal-name)]
+ (p/unregister-signal-callback store workflow-id signal-name)
+ (p/cancel-timer scheduler workflow-id seq)
+ (save-completed signal-data)
+ (when wake-fn (wake-fn)))))
(p/schedule-timer scheduler workflow-id seq deadline
(fn []
- ;; Check one more time for signal
+ (p/unregister-signal-callback store workflow-id signal-name)
(let [signal-data? (p/consume-signal store workflow-id signal-name)]
- (p/save-event store workflow-id (cond-> {:event-type :signal-wait-completed
- :seq seq
- :received (some? signal-data?)
- :signal-name signal-name
- :timestamp (utils/current-time-ms)}
- (some? signal-data?) (assoc :payload (:payload signal-data?)))))
-
+ (save-completed signal-data?))
(when wake-fn (wake-fn))))
:wait-signal-timeout)))))
@@ -354,6 +362,7 @@
:seq-counter (atom 0)
:pending-events (atom [])
:pending-asyncs (atom [])
+ :compensating? (atom false)
:store store
:registry registry
:observer observer}
@@ -378,18 +387,20 @@
:result result}))
(defn finalize-cancelled
- "Save cancellation event and return result as failed."
+ "Save a dedicated cancellation event and return the cancelled result.
+ The history event is :workflow-cancelled (a first-class terminal state), so
+ history and the derived status agree rather than recording cancellation as a
+ failure."
[store workflow-id pending-events observer]
(p/save-events store workflow-id pending-events)
(let [error-map {:type "clojure.lang.ExceptionInfo"
:message "Workflow cancelled"
:data {:workflow-id workflow-id}}]
- (p/save-event store workflow-id {:event-type :workflow-failed
+ (p/save-event store workflow-id {:event-type :workflow-cancelled
:error error-map
:timestamp (utils/current-time-ms)})
(-notify p/on-workflow-cancelled observer workflow-id)
- (-notify p/on-workflow-failed observer workflow-id error-map)
- {:status :failed
+ {:status :cancelled
:workflow-id workflow-id
:error error-map}))
@@ -531,11 +542,19 @@
:or {max-iterations 1000}}]
#_{:clj-kondo/ignore [:loop-without-recur]}
(prom/loop [iteration 0]
- (when (>= iteration max-iterations)
- (throw (ex-info "Max iterations exceeded" {:workflow-id workflow-id
- :iterations iteration})))
-
- (log/debugf "Internal loop %d of %d" iteration max-iterations)
+ (if (>= iteration max-iterations)
+ ;; Replay budget exhausted (e.g. a non-terminating workflow loop). Persist a
+ ;; terminal :workflow-failed event so the workflow becomes resolvable instead
+ ;; of staying "running" forever with an un-recorded exception thrown out of
+ ;; the loop.
+ (do
+ (log/warnf "Workflow %s exceeded replay budget of %d iterations" workflow-id max-iterations)
+ (finalize-failed store workflow-id []
+ (ex-info "Replay budget exceeded"
+ {:workflow-id workflow-id :iterations iteration})
+ observer))
+ (do
+ (log/debugf "Internal loop %d of %d" iteration max-iterations)
;; Check if executor is shutting down - stop processing to avoid endless rejections
(if (p/shutdown? executor)
@@ -544,29 +563,12 @@
{:status :suspended
:workflow-id workflow-id})
- ;; Check cancellation at start of each iteration
- (if (p/is-cancelled? store workflow-id)
- (let [error-map {:type "clojure.lang.ExceptionInfo"
- :message "Workflow cancelled"
- :data {:workflow-id workflow-id}}]
-
- (-notify p/on-workflow-cancelled observer workflow-id)
- (p/save-event store workflow-id {:event-type :workflow-failed
- :error error-map
- :timestamp (utils/current-time-ms)})
-
- (log/info "Workflow cancelled, failing")
- (-notify p/on-workflow-failed observer workflow-id error-map)
- {:status :failed
- :workflow-id workflow-id
- :error error-map})
- ;; else
- (let [history (p/load-history store workflow-id)
- ctx (make-workflow-context workflow-id history store registry observer
- :protocols (:protocols engine))
- exec-result (binding [ctx/*workflow-context* ctx]
- (log/debugf "Executing workflow function %s..." workflow-fn)
- (execute-workflow-fn workflow-fn args))
+ (let [history (p/load-history store workflow-id)
+ ctx (make-workflow-context workflow-id history store registry observer
+ :protocols (:protocols engine))
+ exec-result (binding [ctx/*workflow-context* ctx]
+ (log/debugf "Executing workflow function %s..." workflow-fn)
+ (execute-workflow-fn workflow-fn args))
dispatch (fn [exec-result]
(log/debugf "Workflow function executed, got: %s" (:status exec-result))
(case (:status exec-result)
@@ -578,6 +580,10 @@
observer)
:cancelled
+ ;; Cancellation surfaced from the body (a stub's
+ ;; check-cancelled!). Any saga rollback already ran
+ ;; inside the user's catch before the cancel exception
+ ;; was rethrown, so just finalize.
(finalize-cancelled store workflow-id
(:pending-events exec-result)
observer)
@@ -596,7 +602,20 @@
(if (= action :continue)
(prom/recur (inc iteration))
- (action->result action workflow-id)))
+ ;; About to wait: register a generic wake callback so an
+ ;; external actor (e.g. cancel-workflow) can force this
+ ;; workflow to re-enter and observe the cancel flag.
+ (do
+ (when wake-fn
+ (p/register-wake-callback store workflow-id wake-fn))
+ ;; C2: record when this workflow next needs attention.
+ (let [sd (:suspension-data exec-result)
+ wake-at (case action
+ :wait-timer (:fire-at sd)
+ :wait-signal-timeout (:deadline sd)
+ nil)]
+ (p/set-wake-at store workflow-id wake-at))
+ (action->result action workflow-id))))
:failed
(finalize-failed store workflow-id
@@ -606,7 +625,7 @@
;; exec-result may be a Promise if workflow-fn returned a Promise (e.g. from p/let)
(if (prom/promise? exec-result)
(bthen exec-result dispatch)
- (dispatch exec-result)))))))
+ (dispatch exec-result)))))))) ; close inner (if shutdown? let), outer (do) and budget (if)
(defn process-child-workflow [{:keys [store executor scheduler registry] :as engine} workflow-id
suspension-data pending-events observer]
diff --git a/src/intemporal/internal/fns/start_workflow.clj b/src/intemporal/internal/fns/start_workflow.clj
index 29b2b48..344b536 100644
--- a/src/intemporal/internal/fns/start_workflow.clj
+++ b/src/intemporal/internal/fns/start_workflow.clj
@@ -2,8 +2,13 @@
(:require [intemporal.internal.execution :as exec]
[intemporal.internal.logging :as log]
[intemporal.internal.activity :as a]
+ [intemporal.internal.workflow-registry :as wreg]
[intemporal.protocol :as p]
- [intemporal.utils :as utils]))
+ [intemporal.utils :as utils])
+ (:import [java.util.concurrent LinkedBlockingQueue]))
+
+(def ^:private waiting-statuses
+ #{:waiting-timer :waiting-signal :waiting-signal-timeout :waiting-async})
(defn start-workflow
"Start a workflow execution.
@@ -27,52 +32,51 @@
:or {max-iterations 1000}}]
(doseq [[proto impl] protocols]
(a/register-protocol-activities! registry proto impl))
- (let [wf-id (or workflow-id (str (random-uuid)))
- resume-promise-atom (atom nil)
+ (let [wf-id (or workflow-id (str (random-uuid)))
observer (or observer (get engine :observer))
- wake-fn (fn wake-fn-impl []
- (log/with-mdc {:workflow-id wf-id}
- (try
- (when observer
- (p/on-workflow-resumed observer wf-id))
- (log/debugf "Waking workflow for resume")
- (let [old-promise @resume-promise-atom
- new-promise (promise)
- result (exec/run-workflow-internal engine wf-id workflow-fn args
- {:observer observer
- :max-iterations max-iterations
- :wake-fn wake-fn-impl})]
- (reset! resume-promise-atom new-promise)
- (deliver old-promise result))
- (catch Exception e
- (when-let [p @resume-promise-atom]
- (deliver p {:status :failed :error e}))))))]
+ ;; Wake channel. wake-fn (invoked from store signal/timer callbacks and
+ ;; from cancel-workflow via wake-workflow) only enqueues a token — it
+ ;; never runs execution itself. All run-workflow-internal calls happen
+ ;; on THIS thread, in the loop below. This:
+ ;; (a) makes the wake edge-safe: a wake that fires while the workflow
+ ;; is still suspending sits in the queue and is observed by the
+ ;; next take, instead of racing a resume-promise handshake; and
+ ;; (b) prevents two threads from executing the same workflow at once.
+ wake-q (LinkedBlockingQueue.)
+ run-once (fn []
+ (exec/run-workflow-internal engine wf-id workflow-fn args
+ {:observer observer
+ :max-iterations max-iterations
+ :wake-fn (fn wake-fn []
+ (when observer
+ (p/on-workflow-resumed observer wf-id))
+ (.offer wake-q :wake))}))]
+ ;; Record the workflow function under its stable name so the workflow can be
+ ;; resumed later by id alone (resume-workflow [engine wf-id]); the name is
+ ;; stored in the :workflow-started event below. (improvements.md §B3)
(log/with-mdc {:workflow-id wf-id}
- ;; Initialize with first promise
- (reset! resume-promise-atom (promise))
(p/save-event store wf-id {:event-type :workflow-started
:workflow-id wf-id
+ :workflow-fn-name (wreg/register-workflow! workflow-fn)
:args (vec args)
:timestamp (utils/current-time-ms)})
(when observer
(p/on-workflow-started observer wf-id args))
(log/info "Workflow started")
(try
- ;; Execute initial workflow run
- (let [initial-result (exec/run-workflow-internal engine wf-id workflow-fn args
- {:observer observer
- :max-iterations max-iterations
- :wake-fn wake-fn})]
- ;; Loop to handle multiple wait cycles
- (loop [result initial-result]
- (log/infof "Got result %s with status %s" (:result initial-result) (:status initial-result))
- (if (#{:waiting-timer :waiting-signal :waiting-signal-timeout :waiting-async} (:status result))
- (do
- (log/infof "Workflow waiting for promise: %s" (:status result))
- (let [next-promise @resume-promise-atom
- next-result @next-promise]
- (recur next-result)))
- result)))
+ (loop [result (run-once)]
+ (log/infof "Got result %s with status %s" (:result result) (:status result))
+ (if (waiting-statuses (:status result))
+ (do
+ (log/infof "Workflow waiting: %s" (:status result))
+ ;; Block until woken. A token enqueued before this take (signal
+ ;; arrived during suspension setup) returns immediately — no edge
+ ;; is lost. Drain any extra tokens so one re-run covers coalesced
+ ;; wakes; a wake arriving during the re-run queues for next take.
+ (.take wake-q)
+ (.clear wake-q)
+ (recur (run-once)))
+ result))
(catch Exception e
(log/warnf e "Caught exception")
(throw e))))))
diff --git a/src/intemporal/internal/fns/start_workflow.cljs b/src/intemporal/internal/fns/start_workflow.cljs
index e5f0d87..b1175fa 100644
--- a/src/intemporal/internal/fns/start_workflow.cljs
+++ b/src/intemporal/internal/fns/start_workflow.cljs
@@ -93,8 +93,8 @@
:max-iterations max-iterations
:wake-fn wake-fn-impl})
(bthen (fn [result]
- (when (and on-complete (not (waiting-status? result)))
- (on-complete result))))
+ (when (and on-complete (not (waiting-status? result)))
+ (on-complete result))))
(prom/catch js/Error
(fn [e]
(when on-complete
@@ -111,9 +111,9 @@
:max-iterations max-iterations
:wake-fn wake-fn})
(bthen (fn [result]
- (when (and on-complete (not (waiting-status? result)))
- (on-complete result))
- result))
+ (when (and on-complete (not (waiting-status? result)))
+ (on-complete result))
+ result))
(prom/catch js/Error
(fn [e]
(log/warnf e "Caught exception during async workflow start")
diff --git a/src/intemporal/internal/runtime.clj b/src/intemporal/internal/runtime.clj
index 7f781a9..88e3af0 100644
--- a/src/intemporal/internal/runtime.clj
+++ b/src/intemporal/internal/runtime.clj
@@ -14,16 +14,22 @@
pending-timers]
p/IScheduler
(schedule-timer [_ workflow-id seq-num fire-at callback]
- (let [delay-ms (max 0 (- fire-at (System/currentTimeMillis)))
- timer-key [workflow-id seq-num]
- future (.schedule pool
- ^Runnable (fn []
- (swap! pending-timers dissoc timer-key)
- (callback))
- delay-ms
- TimeUnit/MILLISECONDS)]
- (swap! pending-timers assoc timer-key future)
- timer-key))
+ (let [timer-key [workflow-id seq-num]]
+ ;; Idempotent: under the ownership scan a suspended timer workflow is
+ ;; re-resumed every poll, so process-timer may call schedule-timer again
+ ;; for the same [wf,seq]. Scheduling a second future would leak it and
+ ;; risk a duplicate :timer-fired. If one is already armed, keep it.
+ (if (contains? @pending-timers timer-key)
+ timer-key
+ (let [delay-ms (max 0 (- fire-at (System/currentTimeMillis)))
+ future (.schedule pool
+ ^Runnable (fn []
+ (swap! pending-timers dissoc timer-key)
+ (callback))
+ delay-ms
+ TimeUnit/MILLISECONDS)]
+ (swap! pending-timers assoc timer-key future)
+ timer-key))))
(cancel-timer [_ workflow-id seq-num]
(let [timer-key [workflow-id seq-num]]
diff --git a/src/intemporal/internal/runtime.cljs b/src/intemporal/internal/runtime.cljs
index 299d8d5..2d76260 100644
--- a/src/intemporal/internal/runtime.cljs
+++ b/src/intemporal/internal/runtime.cljs
@@ -46,15 +46,20 @@
p/IScheduler
(schedule-timer [_ workflow-id seq-num fire-at callback]
- (let [delay-ms (max 0 (- fire-at (utils/current-time-ms)))
- timer-key [workflow-id seq-num]
- timer-id (js/setTimeout
- (fn []
- (swap! pending-timers dissoc timer-key)
- (callback))
- delay-ms)]
- (swap! pending-timers assoc timer-key timer-id)
- timer-key))
+ (let [timer-key [workflow-id seq-num]]
+ ;; Idempotent: a re-resumed timer workflow may call schedule-timer again
+ ;; for the same [wf,seq]; keep the already-armed timer rather than arming
+ ;; a second one (which would risk a duplicate :timer-fired).
+ (if (contains? @pending-timers timer-key)
+ timer-key
+ (let [delay-ms (max 0 (- fire-at (utils/current-time-ms)))
+ timer-id (js/setTimeout
+ (fn []
+ (swap! pending-timers dissoc timer-key)
+ (callback))
+ delay-ms)]
+ (swap! pending-timers assoc timer-key timer-id)
+ timer-key))))
(cancel-timer [_ workflow-id seq-num]
(let [timer-key [workflow-id seq-num]]
diff --git a/src/intemporal/internal/workflow_registry.cljc b/src/intemporal/internal/workflow_registry.cljc
new file mode 100644
index 0000000..402eb7d
--- /dev/null
+++ b/src/intemporal/internal/workflow_registry.cljc
@@ -0,0 +1,62 @@
+(ns intemporal.internal.workflow-registry
+ "Maps a stable workflow name -> workflow function so a workflow can be resumed
+ knowing only its id (the name + args are recorded in the :workflow-started
+ event). This is what lets a restarted process — or, in a multi-pod
+ deployment, a different pod — resume a workflow it did not itself start
+ (improvements.md §B3, load-bearing for the Phase C worker loop).
+
+ The registry is a process-global atom: each process registers the workflow
+ functions it can resolve (Temporal's model). start-workflow auto-registers
+ the function it is given, which covers same-process resume; for cross-process
+ resume the application must register its workflow vars at startup."
+ #?(:cljs (:require [clojure.string :as str])))
+
+(defonce ^{:doc "Process-global name -> workflow-fn registry."}
+ registry
+ (atom {}))
+
+(defn workflow-name
+ "Stable string name for a workflow function (a var or a top-level fn)."
+ [f]
+ #?(:clj
+ (if (var? f)
+ (subs (str f) 2) ; #'ns/name -> "ns/name"
+ (clojure.lang.Compiler/demunge ; ns$fn_name -> "ns/fn-name"
+ (.getName (class f))))
+ :cljs
+ (if-let [raw (and (fn? f) (.-name f))]
+ (if (str/blank? raw)
+ (str f)
+ (let [parts (str/split raw #"\$")]
+ (if (> (count parts) 1)
+ (str (str/join "." (map #(str/replace % "_" "-") (butlast parts)))
+ "/"
+ (str/replace (last parts) "_" "-"))
+ (str/replace raw "_" "-"))))
+ (str f))))
+
+(defn register-workflow!
+ "Register a workflow function under its derived name (or an explicit name).
+ Accepts a var or a fn. Returns the name used."
+ ([f] (register-workflow! (workflow-name f) f))
+ ([name f]
+ (let [resolved (if (var? f) #?(:clj @f :cljs f) f)]
+ (swap! registry assoc name resolved)
+ name)))
+
+(defn resolve-workflow
+ "Return the registered workflow fn for `name`. Throws a descriptive ex-info if
+ the name is not registered in this process, rather than returning nil and
+ surfacing an obscure NPE deeper in execution. A fresh process must register its
+ workflow vars at startup for cross-process resume to work."
+ [name]
+ (or (get @registry name)
+ (throw (ex-info (str "No workflow function registered for name: " name
+ ". Register the workflow var at startup so it can be resumed by id.")
+ {:workflow-name name
+ :registered (vec (keys @registry))}))))
+
+(defn clear-registry!
+ "Test helper: empties the global registry."
+ []
+ (reset! registry {}))
diff --git a/src/intemporal/observer.cljc b/src/intemporal/observer.cljc
index 6b04252..1035f63 100644
--- a/src/intemporal/observer.cljc
+++ b/src/intemporal/observer.cljc
@@ -112,6 +112,22 @@
(on-workflow-cancelled [_ workflow-id]
(swap! log-atom conj {:event :workflow-cancelled
+ :workflow-id workflow-id
+ :timestamp (utils/current-time-ms)}))
+
+ (on-compensation-started [_ workflow-id]
+ (swap! log-atom conj {:event :compensation-started
+ :workflow-id workflow-id
+ :timestamp (utils/current-time-ms)}))
+
+ (on-compensation-failed [_ workflow-id error]
+ (swap! log-atom conj {:event :compensation-failed
+ :workflow-id workflow-id
+ :error error
+ :timestamp (utils/current-time-ms)}))
+
+ (on-compensation-completed [_ workflow-id]
+ (swap! log-atom conj {:event :compensation-completed
:workflow-id workflow-id
:timestamp (utils/current-time-ms)})))
@@ -139,4 +155,54 @@
(on-signal-received [_ _ _ _])
(on-workflow-completed [_ _ _])
(on-workflow-failed [_ _ _])
- (on-workflow-cancelled [_ _])))
+ (on-workflow-cancelled [_ _])
+ (on-compensation-started [_ _])
+ (on-compensation-failed [_ _ _])
+ (on-compensation-completed [_ _])))
+
+(defn make-composite-observer
+ "Create an observer that fans out all events to a list of observers.
+ Returns a noop-observer if the list is empty."
+ [observers]
+ (let [obs (vec (filter some? observers))]
+ (if (empty? obs)
+ (noop-observer)
+ (reify p/IWorkflowObserver
+ (on-workflow-started [_ workflow-id args]
+ (doseq [o obs] (p/on-workflow-started o workflow-id args)))
+ (on-workflow-suspended [_ workflow-id suspension-type]
+ (doseq [o obs] (p/on-workflow-suspended o workflow-id suspension-type)))
+ (on-workflow-resumed [_ workflow-id]
+ (doseq [o obs] (p/on-workflow-resumed o workflow-id)))
+ (on-activity-scheduled [_ workflow-id seq-num activity-name args]
+ (doseq [o obs] (p/on-activity-scheduled o workflow-id seq-num activity-name args)))
+ (on-activity-started [_ workflow-id seq-num activity-name]
+ (doseq [o obs] (p/on-activity-started o workflow-id seq-num activity-name)))
+ (on-activity-completed [_ workflow-id seq-num activity-name result duration-ms]
+ (doseq [o obs] (p/on-activity-completed o workflow-id seq-num activity-name result duration-ms)))
+ (on-activity-failed [_ workflow-id seq-num activity-name error duration-ms]
+ (doseq [o obs] (p/on-activity-failed o workflow-id seq-num activity-name error duration-ms)))
+ (on-async-started [_ workflow-id seq-num]
+ (doseq [o obs] (p/on-async-started o workflow-id seq-num)))
+ (on-async-completed [_ workflow-id seq-num result]
+ (doseq [o obs] (p/on-async-completed o workflow-id seq-num result)))
+ (on-async-failed [_ workflow-id seq-num error]
+ (doseq [o obs] (p/on-async-failed o workflow-id seq-num error)))
+ (on-timer-scheduled [_ workflow-id seq-num fire-at]
+ (doseq [o obs] (p/on-timer-scheduled o workflow-id seq-num fire-at)))
+ (on-timer-fired [_ workflow-id seq-num]
+ (doseq [o obs] (p/on-timer-fired o workflow-id seq-num)))
+ (on-signal-received [_ workflow-id signal-name payload]
+ (doseq [o obs] (p/on-signal-received o workflow-id signal-name payload)))
+ (on-workflow-completed [_ workflow-id result]
+ (doseq [o obs] (p/on-workflow-completed o workflow-id result)))
+ (on-workflow-failed [_ workflow-id error]
+ (doseq [o obs] (p/on-workflow-failed o workflow-id error)))
+ (on-workflow-cancelled [_ workflow-id]
+ (doseq [o obs] (p/on-workflow-cancelled o workflow-id)))
+ (on-compensation-started [_ workflow-id]
+ (doseq [o obs] (p/on-compensation-started o workflow-id)))
+ (on-compensation-failed [_ workflow-id error]
+ (doseq [o obs] (p/on-compensation-failed o workflow-id error)))
+ (on-compensation-completed [_ workflow-id]
+ (doseq [o obs] (p/on-compensation-completed o workflow-id)))))))
\ No newline at end of file
diff --git a/src/intemporal/observer/otel.clj b/src/intemporal/observer/otel.clj
index 7509a42..fb3e2f1 100644
--- a/src/intemporal/observer/otel.clj
+++ b/src/intemporal/observer/otel.clj
@@ -126,7 +126,25 @@
:event {:name "workflow.cancelled"
:attributes {:intemporal/cancelled true}}})
(otspan/end-span! {:context span-ctx})
- (swap! spans-atom update :workflows dissoc workflow-id))))
+ (swap! spans-atom update :workflows dissoc workflow-id)))
+
+ ;; Compensations run before the workflow span is ended by on-workflow-failed/
+ ;; -cancelled, so we add events to the still-open workflow span.
+ (on-compensation-started [_ workflow-id]
+ (when-let [span-ctx (get-in @spans-atom [:workflows workflow-id])]
+ (otspan/add-span-data! {:context span-ctx
+ :event {:name "compensation.started"}})))
+
+ (on-compensation-failed [_ workflow-id error]
+ (when-let [span-ctx (get-in @spans-atom [:workflows workflow-id])]
+ (otspan/add-span-data! {:context span-ctx
+ :event {:name "compensation.failed"
+ :attributes {:intemporal/error (pr-str error)}}})))
+
+ (on-compensation-completed [_ workflow-id]
+ (when-let [span-ctx (get-in @spans-atom [:workflows workflow-id])]
+ (otspan/add-span-data! {:context span-ctx
+ :event {:name "compensation.completed"}}))))
(defn make-otel-observer
"Create an OpenTelemetry observer that emits traces for workflows and activities"
diff --git a/src/intemporal/protocol.cljc b/src/intemporal/protocol.cljc
index 9ee6f07..0bfa389 100644
--- a/src/intemporal/protocol.cljc
+++ b/src/intemporal/protocol.cljc
@@ -16,9 +16,28 @@
(consume-signal [store workflow-id signal-name] "Consume and remove a signal")
(register-signal-callback [store workflow-id signal-name callback] "Register callback to be invoked when signal arrives")
(unregister-signal-callback [store workflow-id signal-name] "Unregister signal callback")
+ (register-wake-callback [store workflow-id callback] "Register a generic wake callback, fired by wake-workflow to force the workflow to re-enter its execution loop (e.g. to observe cancellation)")
+ (wake-workflow [store workflow-id] "Fire the registered wake callback for a workflow, forcing it to re-enter its loop and re-evaluate state such as the cancellation flag. No-op if none registered.")
(is-cancelled? [store workflow-id] "Check if workflow is cancelled")
(mark-cancelled [store workflow-id] "Mark workflow as cancelled")
- (get-workflow-status [store workflow-id] "Get current workflow status"))
+ (get-workflow-status [store workflow-id] "Get current workflow status")
+
+ ;; --- Phase C: ownership-based recovery (opt-in; single-process callers ignore) ---
+ (claim-owner [store workflow-id owner-id]
+ "Atomically stamp ownership: UPDATE owner=owner-id WHERE owner IS NULL OR
+ owner=owner-id. Returns true iff the workflow is now owned by owner-id. The
+ exclusivity gate — only one pod can claim an unowned workflow.")
+ (list-pending [store owner-id limit]
+ "Return up to `limit` workflow-ids that are NON-TERMINAL, DUE (wake-at is null
+ or in the past), and (owner=owner-id OR owner IS NULL): the workflows this
+ owner may resume right now. Used for both the live poll and startup recovery.")
+ (release-owner [store owner-id]
+ "Clear ownership (owner=NULL) for this owner's non-terminal workflows, so
+ other pods may pick them up. Called on clean shutdown.")
+ (set-wake-at [store workflow-id wake-at-ms]
+ "Record the earliest time (epoch ms) this workflow next needs attention, or
+ nil for 'always eligible' (waiting on an external event, not the clock).
+ list-pending skips workflows whose wake-at is still in the future (C2)."))
(defprotocol IActivityExecutor
"Protocol for executing activities"
@@ -59,4 +78,7 @@
(on-signal-received [observer workflow-id signal-name payload])
(on-workflow-completed [observer workflow-id result])
(on-workflow-failed [observer workflow-id error])
- (on-workflow-cancelled [observer workflow-id]))
+ (on-workflow-cancelled [observer workflow-id])
+ (on-compensation-started [observer workflow-id])
+ (on-compensation-failed [observer workflow-id error])
+ (on-compensation-completed [observer workflow-id]))
diff --git a/src/intemporal/store.cljc b/src/intemporal/store.cljc
index 277128f..8ef6af4 100644
--- a/src/intemporal/store.cljc
+++ b/src/intemporal/store.cljc
@@ -1,5 +1,8 @@
(ns intemporal.store
- (:require [intemporal.protocol :as p]))
+ (:require [intemporal.protocol :as p]
+ [intemporal.utils :as utils]))
+
+(def ^:private terminal-status? #{:completed :failed :cancelled})
;; ============================================================================
;; In-Memory Store Implementation
@@ -11,14 +14,31 @@
(get-in @state [:workflows workflow-id :history] []))
(save-event [_ workflow-id event]
- (swap! state update-in [:workflows workflow-id :history]
- (fnil conj []) event)
+ (swap! state
+ (fn [s]
+ (let [s (update-in s [:workflows workflow-id :history] (fnil conj []) event)]
+ (case (:event-type event)
+ :workflow-completed (assoc-in s [:workflows workflow-id :status] :completed)
+ :workflow-failed (assoc-in s [:workflows workflow-id :status] :failed)
+ :workflow-cancelled (assoc-in s [:workflows workflow-id :status] :cancelled)
+ s))))
event)
(save-events [_ workflow-id events]
(when (seq events)
- (swap! state update-in [:workflows workflow-id :history]
- (fnil into []) events))
+ (swap! state
+ (fn [s]
+ (let [s (update-in s [:workflows workflow-id :history] (fnil into []) events)
+ ;; Phase B2: cache terminal status for O(1) reads.
+ term (some #(case (:event-type %)
+ :workflow-completed :completed
+ :workflow-failed :failed
+ :workflow-cancelled :cancelled
+ nil)
+ events)]
+ (if term
+ (assoc-in s [:workflows workflow-id :status] term)
+ s)))))
events)
(find-event [this worfklow-id event-type seq-num]
@@ -31,12 +51,12 @@
(get-pending-signals [_ workflow-id]
(get-in @state [:workflows workflow-id :signals] {}))
- (add-signal [this workflow-id signal-name signal-data]
+ (add-signal [_ workflow-id signal-name signal-data]
(swap! state update-in [:workflows workflow-id :signals signal-name]
(fnil conj []) signal-data)
- ;; Check if there's a callback registered for this signal
+ ;; In-process wake for an embedded (no-worker) engine in THIS process.
+ ;; Worker mode picks the workflow up via the ownership scan (list-pending).
(when-let [callback (get-in @state [:workflows workflow-id :signal-callbacks signal-name])]
- ;; Invoke callback asynchronously
#?(:clj (future (callback))
:cljs (js/setTimeout callback 0)))
signal-data)
@@ -60,6 +80,14 @@
(unregister-signal-callback [_ workflow-id signal-name]
(swap! state update-in [:workflows workflow-id :signal-callbacks] dissoc signal-name))
+ (register-wake-callback [_ workflow-id callback]
+ (swap! state assoc-in [:workflows workflow-id :wake-callback] callback))
+
+ (wake-workflow [_ workflow-id]
+ (when-let [callback (get-in @state [:workflows workflow-id :wake-callback])]
+ #?(:clj (future (callback))
+ :cljs (js/setTimeout callback 0))))
+
(is-cancelled? [_ workflow-id]
(get-in @state [:workflows workflow-id :cancelled] false))
@@ -69,10 +97,56 @@
(get-workflow-status [_ workflow-id]
(let [wf (get-in @state [:workflows workflow-id])]
(cond
+ ;; Check terminal status first: a late mark-cancelled must not override
+ ;; a workflow that already completed or failed.
+ (#{:completed :failed :cancelled} (:status wf)) (:status wf) ; Phase B2 O(1) fast path
(:cancelled wf) :cancelled
(empty? (:history wf)) :not-found
:else (let [last-event (last (:history wf))]
(case (:event-type last-event)
:workflow-completed :completed
:workflow-failed :failed
- :running))))))
+ :workflow-cancelled :cancelled
+ :running)))))
+
+ ;; --- Phase C: ownership-based recovery ---
+ (claim-owner [_ workflow-id owner-id]
+ (let [ok (atom false)]
+ (swap! state
+ (fn [s]
+ (let [cur (get-in s [:workflows workflow-id :owner])]
+ (if (or (nil? cur) (= cur owner-id))
+ (do (reset! ok true)
+ (assoc-in s [:workflows workflow-id :owner] owner-id))
+ s))))
+ @ok))
+
+ (list-pending [_ owner-id limit]
+ (let [now (utils/current-time-ms)]
+ (->> (:workflows @state)
+ (filter (fn [[_ wf]]
+ (and (seq (:history wf))
+ (not (terminal-status? (:status wf)))
+ (not (:cancelled wf))
+ ;; C2: skip workflows not yet due to wake
+ (let [wa (:wake-at wf)] (or (nil? wa) (<= wa now)))
+ (let [o (:owner wf)] (or (nil? o) (= o owner-id))))))
+ (map first)
+ (take limit)
+ vec)))
+
+ (release-owner [_ owner-id]
+ (swap! state
+ (fn [s]
+ (reduce (fn [s [wid wf]]
+ (if (and (= owner-id (:owner wf))
+ (not (terminal-status? (:status wf))))
+ (update-in s [:workflows wid] dissoc :owner)
+ s))
+ s
+ (:workflows s))))
+ nil)
+
+ (set-wake-at [_ workflow-id wake-at-ms]
+ (swap! state assoc-in [:workflows workflow-id :wake-at] wake-at-ms)
+ nil))
diff --git a/src/intemporal/store/fdb.clj b/src/intemporal/store/fdb.clj
index 1bbb1a2..90e4a23 100644
--- a/src/intemporal/store/fdb.clj
+++ b/src/intemporal/store/fdb.clj
@@ -21,6 +21,34 @@
(defn ->tuple [v]
(Tuple/from (into-array Object (map #(if (keyword? %) (name %) %) v))))
+;; ============================================================================
+;; Ownership index (Phase C)
+;;
+;; FDB cannot SQL-scan by owner, so non-terminal workflows are indexed under
+;; ["wf-owner" ]. list-pending scans the owner's bucket
+;; plus the unowned ("") bucket. The entry is added when a workflow starts,
+;; moved on claim-owner / release-owner, and removed when it terminates.
+;;
+;; The index entry VALUE carries the C2 wake-at (epoch ms, or nil = always due),
+;; so list-pending can skip not-yet-due timer workflows without a separate read.
+;; Bucket moves (claim-owner / release-owner) preserve the value.
+;; ============================================================================
+
+(defn- read-owner [tx root-subspace workflow-id]
+ (<-bytes (fdb-core/get tx root-subspace (->tuple ["owner" workflow-id]))))
+
+(defn- owner-index-key [root-subspace bucket workflow-id]
+ (->tuple ["wf-owner" bucket workflow-id]))
+
+(defn- maintain-owner-index! [tx root-subspace workflow-id events]
+ (let [started? (some #(= :workflow-started (:event-type %)) events)
+ terminal? (some #(#{:workflow-completed :workflow-failed :workflow-cancelled} (:event-type %)) events)
+ bucket (or (read-owner tx root-subspace workflow-id) "")]
+ (cond
+ terminal? (fdb-core/clear tx root-subspace (owner-index-key root-subspace bucket workflow-id))
+ started? (fdb-core/set tx root-subspace (owner-index-key root-subspace bucket workflow-id)
+ (->bytes {:wake-at nil})))))
+
;; ============================================================================
;; FDB Store Implementation
;; ============================================================================
@@ -47,21 +75,41 @@
(save-event [_ workflow-id event]
(let [history-sub (fsub/get root-subspace (->tuple ["history" workflow-id]))
seq-num (:seq event (System/currentTimeMillis))
- key (->tuple [seq-num (str (java.util.UUID/randomUUID))])]
+ key (->tuple [seq-num (str (java.util.UUID/randomUUID))])
+ term (case (:event-type event)
+ :workflow-completed "completed"
+ :workflow-failed "failed"
+ :workflow-cancelled "cancelled"
+ nil)]
(ftr/run db
(fn [tx]
- (fdb-core/set tx history-sub key (->bytes event))))
+ (fdb-core/set tx history-sub key (->bytes event))
+ ;; Phase B2: cache terminal status for O(1) reads.
+ (when term
+ (fdb-core/set tx root-subspace (->tuple ["state" workflow-id "status"]) (->bytes term)))
+ ;; Phase C: keep the ownership index in sync.
+ (maintain-owner-index! tx root-subspace workflow-id [event])))
event))
(save-events [_ workflow-id events]
(when (seq events)
- (let [history-sub (fsub/get root-subspace (->tuple ["history" workflow-id]))]
+ (let [history-sub (fsub/get root-subspace (->tuple ["history" workflow-id]))
+ term (some #(case (:event-type %)
+ :workflow-completed "completed"
+ :workflow-failed "failed"
+ :workflow-cancelled "cancelled"
+ nil)
+ events)]
(ftr/run db
(fn [tx]
(doseq [event events]
(let [seq-num (:seq event (System/currentTimeMillis))
key (->tuple [seq-num (str (java.util.UUID/randomUUID))])]
- (fdb-core/set tx history-sub key (->bytes event))))))))
+ (fdb-core/set tx history-sub key (->bytes event))))
+ (when term
+ (fdb-core/set tx root-subspace (->tuple ["state" workflow-id "status"]) (->bytes term)))
+ ;; Phase C: keep the ownership index in sync.
+ (maintain-owner-index! tx root-subspace workflow-id events)))))
events)
(find-event [this workflow-id event-type seq-num]
@@ -82,14 +130,15 @@
{}
r))))))
- (add-signal [this workflow-id signal-name signal-data]
+ (add-signal [_ workflow-id signal-name signal-data]
(let [signals-sub (fsub/get root-subspace (->tuple ["signals" workflow-id signal-name]))
key (->tuple [(System/currentTimeMillis) (str (java.util.UUID/randomUUID))])]
(ftr/run db
(fn [tx]
(fdb-core/set tx signals-sub key (->bytes signal-data))))
- ;; Invoke callback asynchronously
+ ;; In-process fast path for an embedded (no-worker) engine in THIS process.
+ ;; Worker mode picks the workflow up via the ownership scan (list-pending).
(when-let [callback (get-in @callbacks [workflow-id signal-name])]
(future (callback)))
@@ -112,6 +161,13 @@
(unregister-signal-callback [_ workflow-id signal-name]
(swap! callbacks update workflow-id dissoc signal-name))
+ (register-wake-callback [_ workflow-id callback]
+ (swap! callbacks assoc-in [workflow-id ::wake] callback))
+
+ (wake-workflow [_ workflow-id]
+ (when-let [callback (get-in @callbacks [workflow-id ::wake])]
+ (future (callback))))
+
(is-cancelled? [_ workflow-id]
(ftr/run db
(fn [tx]
@@ -120,19 +176,94 @@
(mark-cancelled [_ workflow-id]
(ftr/run db
(fn [tx]
- (fdb-core/set tx root-subspace (->tuple ["state" workflow-id "cancelled"]) (->bytes true)))))
+ (fdb-core/set tx root-subspace (->tuple ["state" workflow-id "cancelled"]) (->bytes true))
+ ;; Drop the workflow out of the ownership scan immediately so list-pending
+ ;; stops re-listing a cancelled-but-not-yet-finalized workflow. The entry
+ ;; lives under the workflow's current owner bucket (or "" if unowned).
+ (let [bucket (or (read-owner tx root-subspace workflow-id) "")]
+ (fdb-core/clear tx root-subspace (owner-index-key root-subspace bucket workflow-id))))))
(get-workflow-status [this workflow-id]
- (if (p/is-cancelled? this workflow-id)
- :cancelled
- (let [history (p/load-history this workflow-id)]
- (if (empty? history)
- :not-found
- (let [last-event (last history)]
- (case (:event-type last-event)
- :workflow-completed :completed
- :workflow-failed :failed
- :running)))))))
+ ;; Read both status and cancelled flag in one transaction so that a late
+ ;; mark-cancelled cannot override a workflow that already completed or failed.
+ (let [[cached cancelled?]
+ (ftr/run db
+ (fn [tx]
+ [(<-bytes (fdb-core/get tx root-subspace (->tuple ["state" workflow-id "status"])))
+ (boolean (<-bytes (fdb-core/get tx root-subspace (->tuple ["state" workflow-id "cancelled"]))))]))]
+ (cond
+ ;; Check terminal status first: takes precedence over the cancelled flag.
+ (#{"completed" "failed" "cancelled"} cached) (keyword cached)
+ cancelled? :cancelled
+ :else (let [history (p/load-history this workflow-id)]
+ (if (empty? history)
+ :not-found
+ (let [last-event (last history)]
+ (case (:event-type last-event)
+ :workflow-completed :completed
+ :workflow-failed :failed
+ :workflow-cancelled :cancelled
+ :running)))))))
+
+ ;; --- Phase C: ownership-based recovery (serializable read-modify-write) ---
+ (claim-owner [_ workflow-id owner-id]
+ (ftr/run db
+ (fn [tx]
+ (let [k (->tuple ["owner" workflow-id])
+ cur (<-bytes (fdb-core/get tx root-subspace k))]
+ (if (or (nil? cur) (= cur owner-id))
+ (let [old-bucket (or cur "")
+ ;; preserve the index value (carries C2 wake-at) across the move
+ entry (or (<-bytes (fdb-core/get tx root-subspace
+ (owner-index-key root-subspace old-bucket workflow-id)))
+ {:wake-at nil})]
+ (fdb-core/set tx root-subspace k (->bytes owner-id))
+ (fdb-core/clear tx root-subspace (owner-index-key root-subspace old-bucket workflow-id))
+ (fdb-core/set tx root-subspace (owner-index-key root-subspace owner-id workflow-id)
+ (->bytes entry))
+ true)
+ false)))))
+
+ (list-pending [_ owner-id limit]
+ (ftr/run db
+ (fn [tx]
+ (let [now (System/currentTimeMillis)
+ due? (fn [v] (let [wa (:wake-at v)] (or (nil? wa) (<= wa now))))
+ scan (fn [bucket]
+ (let [sub (fsub/get root-subspace (->tuple ["wf-owner" bucket]))]
+ (->> (fdb-core/get-range tx (fsub/range sub))
+ (keep (fn [[key value]]
+ (when (due? (<-bytes value))
+ (nth key (dec (count key)))))))))]
+ (->> (concat (scan owner-id) (scan ""))
+ distinct
+ (take limit)
+ vec)))))
+
+ (release-owner [_ owner-id]
+ (ftr/run db
+ (fn [tx]
+ (let [sub (fsub/get root-subspace (->tuple ["wf-owner" owner-id]))
+ entries (->> (fdb-core/get-range tx (fsub/range sub))
+ (mapv (fn [[key value]]
+ [(nth key (dec (count key))) (<-bytes value)])))]
+ (doseq [[wid entry] entries]
+ ;; entries in the owner bucket are non-terminal by construction;
+ ;; preserve the index value (C2 wake-at) when moving to the "" bucket
+ (fdb-core/clear tx root-subspace (->tuple ["owner" wid]))
+ (fdb-core/clear tx root-subspace (owner-index-key root-subspace owner-id wid))
+ (fdb-core/set tx root-subspace (owner-index-key root-subspace "" wid)
+ (->bytes (or entry {:wake-at nil})))))))
+ nil)
+
+ (set-wake-at [_ workflow-id wake-at-ms]
+ (ftr/run db
+ (fn [tx]
+ (let [bucket (or (read-owner tx root-subspace workflow-id) "")
+ k (owner-index-key root-subspace bucket workflow-id)
+ entry (or (<-bytes (fdb-core/get tx root-subspace k)) {})]
+ (fdb-core/set tx root-subspace k (->bytes (assoc entry :wake-at wake-at-ms))))))
+ nil))
(defn make-fdb-store [db subspace-name]
(let [root (fsub/create (->tuple [subspace-name]))]
diff --git a/src/intemporal/store/jdbc.clj b/src/intemporal/store/jdbc.clj
index 30be0e1..31914ad 100644
--- a/src/intemporal/store/jdbc.clj
+++ b/src/intemporal/store/jdbc.clj
@@ -92,7 +92,10 @@
;; Ensure workflow exists
(jdbc/execute! tx ["INSERT INTO intemporal_workflows (id) VALUES (?) ON CONFLICT (id) DO NOTHING"
workflow-id])
- ;; Insert events
+ ;; Insert events. DO UPDATE keeps the write idempotent under normal
+ ;; replay (the engine re-writes the same seq with identical data on
+ ;; each pass). Concurrent execution is prevented by exclusive ownership
+ ;; (claim-owner) + the worker resuming owned workflows one at a time.
(doseq [event events]
(let [seq-num (:seq event)
event-type (name (:event-type event))
@@ -100,7 +103,16 @@
(jdbc/execute! tx ["INSERT INTO intemporal_history (workflow_id, seq, event_type, data)
VALUES (?, ?, ?, ?)
ON CONFLICT (workflow_id, seq) DO UPDATE SET event_type = EXCLUDED.event_type, data = EXCLUDED.data"
- workflow-id seq-num event-type data])))))
+ workflow-id seq-num event-type data])))
+ ;; Phase B2: maintain the O(1) status column on terminal events.
+ (when-let [term (some (fn [e] (case (:event-type e)
+ :workflow-completed "completed"
+ :workflow-failed "failed"
+ :workflow-cancelled "cancelled"
+ nil))
+ events)]
+ (jdbc/execute! tx ["UPDATE intemporal_workflows SET status = ? WHERE id = ?"
+ term workflow-id]))))
events)
(find-event [_ workflow-id event-type seq-num]
@@ -119,14 +131,14 @@
{}
rows)))
- (add-signal [this workflow-id signal-name signal-data]
+ (add-signal [_ workflow-id signal-name signal-data]
(jdbc/with-transaction [tx datasource]
(jdbc/execute! tx ["INSERT INTO intemporal_workflows (id) VALUES (?) ON CONFLICT (id) DO NOTHING"
workflow-id])
(jdbc/execute! tx ["INSERT INTO intemporal_signals (workflow_id, signal_name, payload) VALUES (?, ?, ?)"
workflow-id signal-name signal-data]))
-
- ;; Trigger callback if registered
+ ;; In-process fast path for an embedded (no-worker) engine in THIS process.
+ ;; Worker mode picks the workflow up via the ownership scan (list-pending).
(when-let [callback (get-in @callbacks [workflow-id signal-name])]
(future (callback)))
signal-data)
@@ -146,6 +158,13 @@
(unregister-signal-callback [_ workflow-id signal-name]
(swap! callbacks update workflow-id dissoc signal-name))
+ (register-wake-callback [_ workflow-id callback]
+ (swap! callbacks assoc-in [workflow-id ::wake] callback))
+
+ (wake-workflow [_ workflow-id]
+ (when-let [callback (get-in @callbacks [workflow-id ::wake])]
+ (future (callback))))
+
(is-cancelled? [_ workflow-id]
(let [row (jdbc/execute-one! datasource
["SELECT cancelled FROM intemporal_workflows WHERE id = ?"
@@ -160,11 +179,16 @@
(get-workflow-status [this workflow-id]
(let [wf-row (jdbc/execute-one! datasource
- ["SELECT cancelled FROM intemporal_workflows WHERE id = ?"
- workflow-id])]
+ ["SELECT cancelled, status FROM intemporal_workflows WHERE id = ?"
+ workflow-id])
+ status (:intemporal_workflows/status wf-row)]
(cond
(nil? wf-row) :not-found
+ ;; Check terminal status first: a late mark-cancelled must not override
+ ;; a workflow that already completed or failed.
+ (#{"completed" "failed" "cancelled"} status) (keyword status)
(:intemporal_workflows/cancelled wf-row) :cancelled
+ ;; Otherwise (running / pre-migration) derive from history as before.
:else (let [history (p/load-history this workflow-id)]
(if (empty? history)
:not-found
@@ -172,7 +196,44 @@
(case (:event-type last-event)
:workflow-completed :completed
:workflow-failed :failed
- :running))))))))
+ :workflow-cancelled :cancelled
+ :running)))))))
+
+ ;; --- Phase C: ownership-based recovery ---
+ (claim-owner [_ workflow-id owner-id]
+ (let [res (jdbc/execute-one! datasource
+ ["UPDATE intemporal_workflows SET owner = ?
+ WHERE id = ? AND (owner IS NULL OR owner = ?)"
+ owner-id workflow-id owner-id])]
+ (pos? (or (:next.jdbc/update-count res) 0))))
+
+ (list-pending [_ owner-id limit]
+ (let [rows (jdbc/execute! datasource
+ ["SELECT id FROM intemporal_workflows
+ WHERE status NOT IN ('completed','failed','cancelled')
+ AND cancelled = FALSE
+ AND (wake_at IS NULL OR wake_at <= now())
+ AND (owner = ? OR owner IS NULL)
+ ORDER BY created_at
+ LIMIT ?"
+ owner-id limit])]
+ (mapv :intemporal_workflows/id rows)))
+
+ (release-owner [_ owner-id]
+ (jdbc/execute! datasource
+ ["UPDATE intemporal_workflows SET owner = NULL
+ WHERE owner = ? AND status NOT IN ('completed','failed','cancelled')"
+ owner-id])
+ nil)
+
+ (set-wake-at [_ workflow-id wake-at-ms]
+ (jdbc/execute! datasource
+ ["UPDATE intemporal_workflows
+ SET wake_at = CASE WHEN ?::bigint IS NULL THEN NULL
+ ELSE to_timestamp(?::bigint / 1000.0) END
+ WHERE id = ?"
+ wake-at-ms wake-at-ms workflow-id])
+ nil))
;; TODO use more complete opts
(defn make-jdbc-store
diff --git a/src2/intemporal/error.cljc b/src2/intemporal/error.cljc
deleted file mode 100644
index 85ca2a6..0000000
--- a/src2/intemporal/error.cljc
+++ /dev/null
@@ -1,27 +0,0 @@
-(ns intemporal.error
- #?(:clj (:import [java.lang InterruptedException]
- [java.util.concurrent RejectedExecutionException])))
-
-
-(defn interrupted? [e]
- #?(:clj (instance? InterruptedException e)
- :cljs false))
-
-(defn rejected? [e]
- #?(:clj (instance? RejectedExecutionException e)
- :cljs false))
-
-(defn internal-error? [ex]
- (when-let [t (-> ex ex-data ::type)]
- (or (= :internal t)
- (= :panic t))))
-
-(defn panic? [ex]
- (and (instance? #?(:clj Exception :cljs js/Error) ex)
- (= :panic (-> ex ex-data ::type))))
-
-(defn internal-error [msg data]
- (ex-info msg (merge data {::type :internal})))
-
-(defn panic [msg]
- (ex-info msg {::type :panic}))
diff --git a/src2/intemporal/macros.cljc b/src2/intemporal/macros.cljc
deleted file mode 100644
index 3cb4a47..0000000
--- a/src2/intemporal/macros.cljc
+++ /dev/null
@@ -1,227 +0,0 @@
-(ns intemporal.macros
- (:require [cljs.analyzer.api :as api]
- [intemporal.workflow :as w]
- [intemporal.workflow.internal :as i]
- [md5.core :as md5]
- [promesa.core :as p]
- [taoensso.telemere :as t])
- #?(:clj (:require [net.cgrand.macrovich :as macros]
- [intemporal.workflow.internal :refer [trace! trace-async! add-event!]])
- :cljs (:require-macros [net.cgrand.macrovich :as macros]
- [intemporal.workflow.internal :refer [trace! trace-async! add-event!]]
- [intemporal.macros :refer [env-let defn-workflow stub-function stub-protocol]])))
-
-(def cljs-available?
- #?(:cljs
- false
- :clj
- (try
- (require '[cljs.analyzer])
- ;; Ensure clojurescript is recent enough:
- (-> 'cljs.analyzer/var-meta resolve boolean)
- (catch Exception _ false))))
-
-;;;;
-;; userland
-
-
-;; utility function: since stubs return promises,
-;; we want to use p/let
-;; but p/let runs a thunk that is not env-aware so we fix that
-;; actually, since js promises can't block, we need a new fn
-;; to chain the value, hence we will always need to wrap any thunk
-;; in a `(with-env...)
-(defmacro env-let
- "Only useful for clojurescript. Wraps the `body` and each `bindings` val with a `(with-env current-env val)`), ensuring
- that if the binding value is function stub, its value will be unrapped
- with the same environment at the callsite.
-
- Uses `(promesa.core.cljc/let ...` under the hood so promises are resolved via
- a thunk with the current environment."
- [bindings & body]
- (let [env-sym (gensym)
- wrap-vals (fn [i b]
- (if (even? i)
- b
- `(w/with-env ~env-sym ~b)))
- wrapped (map-indexed wrap-vals bindings)]
- `(let [~env-sym (w/current-env)]
- (p/let ~wrapped
- (w/with-env ~env-sym
- ~@body)))))
-
-(defmacro vthread
- "Runs `body` within a virtual thread, returning a promise."
- [& body]
- `(binding [i/*env* (assoc i/*env* :vthread? true)]
- (do ~@body)))
-
-(defmacro defn-workflow
- "Defines a workflow. Workflows are functions that are resillient to crashes, as
- long as side-effects are run via activities."
- [sym argv & body]
- (let [wname (symbol (str sym "-"))
- sig (md5/string->md5-hex (str body))]
- ;; TODO save signature
- `(do
- (defn- ~wname ~argv (do ~@body))
- (defn ~sym ~argv
- ;; workflow should be called within a with-env block:
- ;; (with-env {:store ..}
- ;; (my-workflow ...
- ;; TODO: fixme: task id generator must be deterministic for a given workflow
- (assert (some? (:store i/*env*)) "Environment does not have a `:store`, did you call `\n(with-env {:store ..}\n\t(my-workflow ...` ?")
- (let [id# (or (:id i/*env*) (i/random-id))
- fvar# #'~wname
- ;; #'my-workflown-fn- => my-workflow-fn
- orig# (subs (str fvar#) 2 (dec (count (str fvar#))))]
- (trace! {:name (format "workflow: %s" orig#) :attributes {:task-id id#}}
- (let [ref# (:ref i/*env*)
- root# (:root i/*env*)
- ;; id can be passed by env if we're dequeuing a task from store
- task# (i/create-workflow-task ref# root# (symbol fvar#) (macros/case :cljs fvar# :clj (var-get fvar#)) ~argv id#)]
- (t/log! {:level :debug :_data {:env i/*env* :task task#}} ["Invoking task with id" (:id task#)])
- (add-event! ::w/enqueue-and-wait {})
- (w/enqueue-and-wait i/*env* task#))))))))
-
-(defmacro stub-function
- "Stubs `f`, wrapping it in an activity-aware function."
- [f]
- `(fn [& argv#]
- (assert (some? (:next-id i/*env*)) "No next-id function, are you inside `defn-workflow`?")
- (let [ref# (:ref i/*env*)
- root# (:root i/*env*)
- fvar# (var ~f)]
- ;; TODO we can use &form to determine eg checksum of activity
-
- ;; prepare call
- (let [store# (:store i/*env*)
- protos# (:protos i/*env*)
- id# ((:next-id i/*env*))
- ref# nil ;; no enqueued task => no ref
- task# (i/create-activity-task ref# root# (symbol fvar#) (macros/case :cljs fvar# :clj (var-get fvar#)) argv# id#)]
-
- ;; an embedded workflow engine doesn't need to have a task per invocation
- (t/log! {:level :debug :_data {:env i/*env* :task task#}} ["Invoking task with id " id#])
- (trace! {:name (format "activity: %s" (symbol fvar#)) :attributes {:task-id id#}}
- (w/enqueue-and-wait i/*env* task#)
- #_(let [res# (i/resume-task i/*env* store# protos# task#)]
- (macros/case
- :cljs res#
- :clj (deref res#))))))))
-
-(defmacro stub-protocol
- "Stub a protocol definition. Opts are currently unused.
- Example: `(stub-protocol EventHandler {:some-opts true})`"
- [proto & opts]
- (macros/case
- :cljs
- (when cljs-available?
- (let [resolved (api/resolve &env proto)
- curr-ns (:name (:ns &env))
- proto-ns (:ns resolved)
- in-proto-ns? (= curr-ns proto-ns)
- sig+args (-> (for [[sig val] (:sigs resolved)
- :let [arglist (:arglists val)
- qname (str (name proto-ns) "/" (name sig))
- invname (if in-proto-ns?
- (name sig)
- (str (namespace proto) "/" (name sig)))]]
- [(name sig) arglist (symbol invname) (symbol qname) (str (:name resolved))])
- (doall))]
- ;; TODO we can use &form to determine eg checksum of proto def
- `(reify ~proto
- ~@(for [[mname arglist invname qname pname] sig+args
- :let [sname (symbol mname)
- args (rest (first arglist))]]
- ;; implement ~sname
- `(~sname [this# ~@args]
- (let [aid# '~qname
- act-opts# ~(first opts)
- ref# (:ref i/*env*)
- root# (:root i/*env*)]
-
- ;; prepare call
- (let [store# (:store i/*env*)
- protos# (:protos i/*env*)
- id# ((:next-id i/*env*))
- ref# nil ;; no task => no ref
- task# (i/create-proto-activity-task
- (symbol ~pname)
- ref#
- root#
- (symbol aid#)
- ;aid# ;; >> doesn't work!
- ;; protos are not reified like in clj https://clojurescript.org/about/differences#_protocols
- ;; we create a "fake" fvar that can be invokeable just like the real thing
- (fn [& impl+args#] (apply ~qname impl+args#))
- [~@args]
- id#)]
-
- (t/log! {:level :debug :_data {:env i/*env* :task task#}} ["Invoking task with id" id#])
- ;(i/resume-task i/*env* store# protos# task#))))))))
- (w/enqueue-and-wait i/*env* task#))))))))
-
- :clj
- #_{:clj-kondo/ignore [:unresolved-symbol]}
- (let [proto-var (var-get (resolve proto))
- curr-ns (name (ns-name *ns*))
- proto-ns (namespace (symbol (subs (str (:var proto-var)) 2)))
- in-proto-ns? (= curr-ns proto-ns)
- sig+args (-> (for [[sig val] (:sigs proto-var)
- :let [arglist (:arglists val)
- qname (str (name proto-ns) "/" (name sig))
- invname (if in-proto-ns?
- (name sig)
- (str (namespace proto) "/" (name sig)))]]
- [(name sig) arglist (symbol invname) (symbol qname)])
- (doall))]
- `(reify ~proto
- ~@(for [[mname arglist invname qname] sig+args
- :let [sname (symbol mname)
- args (rest (first arglist))]]
- ;; implement ~sname
- `(~sname [this# ~@args]
- (let [aid# '~qname
- act-opts# ~(first opts)
- ref# (:ref i/*env*)
- root# (:root i/*env*)]
-
- ;; prepare call
- (let [store# (:store i/*env*)
- protos# (:protos i/*env*)
- id# ((:next-id i/*env*))
- ref# nil ;; no task => no ref
- task# (i/create-proto-activity-task
- (-> ~proto :var symbol)
- ref#
- root#
- (symbol aid#)
- (var-get (requiring-resolve aid#))
- [~@args]
- id#)]
-
- (t/log! {:level :debug :_data {:env i/*env* :task task#}} ["Invoking task with id" id#])
- (if (:vthread? i/*env*)
-
- (trace-async! {:name (format "activity: %s" aid#) :attributes {:task-id id# :protocol (-> ~proto :var symbol)}}
- @(i/resume-task i/*env* store# protos# task#))
- #_
- (trace! {:name (format "activity: %s" aid#) :attributes {:task-id id# :protocol (-> ~proto :var symbol)}}
- (w/enqueue-and-wait i/*env* task#))
- (trace! {:name (format "activity: %s" aid#) :attributes {:task-id id# :protocol (-> ~proto :var symbol)}}
- ;@(i/resume-task i/*env* store# protos# task#)))))))))))
- (w/enqueue-and-wait i/*env* task#)))))))))))
-
-(defmacro with-failure
- "Runs `body`, ensuring that if it fails, compensation will always run.
- - if `body` fails, `binding` will have the value `intemporal.activity/failure`.
- - if `body` succeeds, but compensation is invoked later (eg other activity failure), `binding` will have its return value
-
- (with-failure [v (book-hotel stub \"hotel\")]
- (cancel-hotel stub v n))
- "
- [[binding body] comp-fn]
- `(let [val# (atom :intemporal.activity/failure)]
- (w/add-compensation (fn [] (let [~binding @val#] (do ~comp-fn))))
- (reset! val# (do ~body))))
\ No newline at end of file
diff --git a/src2/intemporal/store.cljc b/src2/intemporal/store.cljc
deleted file mode 100644
index 3548f32..0000000
--- a/src2/intemporal/store.cljc
+++ /dev/null
@@ -1,340 +0,0 @@
-(ns intemporal.store
- (:require [clojure.tools.reader.edn :as edn]
- [intemporal.store.internal :as si]
- [promesa.core :as p]
- [taoensso.telemere :as t]
- #?(:clj [clojure.java.io :as io])
- #?(:clj [net.cgrand.macrovich :as macros]))
- #?(:cljs (:require-macros
- [net.cgrand.macrovich :as macros]
- [intemporal.store :refer [bfn]]))
- #?(:clj (:import [java.io File])))
-
-#?(:clj (set! *warn-on-reflection* true))
-
-(defmacro bfn
- "Like bound-fn on JVM; falls back to fn on CLJS."
- [args & body]
- (macros/case
- :clj `(clojure.core/bound-fn ~args ~@body)
- :cljs `(fn ~args ~@body)))
-
-;;;;
-;; main protos
-
-(defprotocol TaskStore
- (list-tasks [this] "Lists all tasks")
- (task<-panic [this task-id error]
- "Terminates the task via panic; events should not be stored")
- (task<-event [this task-id event-descr]
- "Transitions the task. The task should be dequeued beforehand. Returns the event.
- `event-descr` is one of:
- `{:sym 'ns/f :args [1]}`
- `{:sym 'ns/f :result :ok}`
- `{:sym 'ns/f :error }`
- ")
- (watch-task [this id callback]
- "Observes state changes, calling `callback` for any task that matches `predicate`.")
- (await-task [this id] [this id opts]
- "Waits for workflow to finish. Returns a deref'able value. Can throw.
- Opts include
- - `timeout-ms`: timeout for task await")
- (find-task [this id]
- "Finds the task on the db by id")
- (reenqueue-pending-tasks [this callback]
- "Marks all pending tasks belonging to the store's `owner` (or `nil` owner) as `new`")
- (release-pending-tasks [this]
- "Disowns all tasks that are pending for the store's `owner` (or `nil` owner), making them available")
- (enqueue-task [this task]
- "Atomically enqueues a protocol, workflow or activity task execution")
- (dequeue-task [this] [this opts]
- "Atomically dequeues some workflow, protocol or activity task.
- For deterministic purposes, should dequeue the oldest task first.
- If the task was deserialized, its `fvar` attribute must be a `fn`
- Opts:
- * `lease-ms`- duration of lease for dequeue. After lease expires, the task is eligible for dequeueing again")
- (clear-tasks [this]
- "Deletes all tasks"))
-
-(defprotocol HistoryStore
- (list-events [this] "Lists all events")
- (save-event [this task-id event] "Saves the event for the given task id. Returns the saved event")
- (all-events [this task-id] "Returns all the events for a given task id")
- (clear-events [this] "Deletes all events"))
-
-(defprotocol InternalVarStore
- (register [this sym var] "Register the symbol with the var")
- (lookup [this sym] "Finds the var for the given symbol"))
-
-;;;;
-;; helpers
-
-(defn now []
- #?(:clj (System/currentTimeMillis)
- :cljs (.getTime (js/Date.))))
-
-(def default-lease "Default lease time in millis - 15mins"
- (* 15 60 1000))
-
-(defn sym->var [store {:keys [sym fvar] :as task}]
- #?(:clj (or fvar (requiring-resolve sym))
- :cljs (or fvar (lookup store sym))))
-
-(defn- edn-exists? [file]
- #?(:clj (.exists (File. ^String file))
- :cljs (seq (.getItem (.-localStorage js/window) file))))
-
-(defn read-edn [file readers]
- #?(:clj (with-open [f (io/reader file)]
- (edn/read-string {:readers readers} (slurp f)))
- :cljs (let [f (.getItem (.-localStorage js/window) file)]
- (edn/read-string {:readers readers} f))))
-
-(defn write-edn [file val]
- #?(:clj (spit file val)
- :cljs (.setItem (.-localStorage js/window) file (pr-str val))))
-
-;;;;
-;; main impl
-;;
-
-(def default-owner "intemporal")
-
-(defn make-store
- "Creates a new memory-based store. All workflows will belong to the store's owner.
- When calling `release-pending-tasks` or `reenqueue-pending-tasks`, only tasks that either belong to the
- store's `owner` or have `owner = nil` will be picked up."
- ([]
- (make-store nil))
- ([{:keys [owner file readers failures]
- :or {owner default-owner
- failures {:validation 0}}}]
- ;; TODO use single atom?
- (let [tasks (atom {})
- history (atom {})
- counter (atom 0)
- pcounter (atom 0)
- ecounter (atom 0)
- tcounter (atom 0)
- vars (atom {})
- maybe-fail! (fn []
- (when (< (rand-int 100)
- (* 100 (get failures :validation)))
- (throw (ex-info "Forced error via failure rate" {:intemporal.workflow.internal/type :internal}))))
-
- ;;persistence
- persist! (fn [k ref old new]
- (when (and file (not= old new))
- (t/log! :debug ["Persisting store to file" file])
- (write-edn file {:tasks @tasks
- :history @history
- :counter @counter
- :pcounter @pcounter
- :ecounter @ecounter})))
-
- find-task (fn [this id]
- (get @tasks id))
-
- update-task (fn [this id attrs]
- (when-let [w (find-task this id)]
- (maybe-fail!)
- (si/validate-transition! w attrs)
- (->> (merge w attrs)
- (si/validate-task!)
- (swap! tasks assoc id))))]
-
- ;; deser the db
- (when file
- ;; add hooks to persist on change
- (add-watch tasks :persist persist!)
- (add-watch history :persist persist!)
- (add-watch counter :persist persist!)
- (add-watch pcounter :persist persist!)
- (add-watch ecounter :persist persist!)
-
- (when (edn-exists? file)
- (t/log! :info ["Reading store file" file])
- (let [data (read-edn file readers)]
- (reset! tasks (or (:tasks data) {}))
- (reset! history (or (:history data) {}))
- (reset! counter (or (:counter data) 0))
- (reset! pcounter (or (:pcounter data) 0))
- (reset! ecounter (or (:ecounter data) 0)))))
-
- (reify
- InternalVarStore
- (register [this sym var]
- #?(:cljs (swap! vars assoc sym var)))
- (lookup [this sym]
- #?(:clj (requiring-resolve sym)
- :cljs (get @vars sym)))
-
- HistoryStore
- (list-events [this]
- (apply concat (vals @history)))
- (save-event [this task-id event]
- (let [evt+id (assoc event :id (swap! counter inc))]
- (si/validate-event! evt+id)
- (swap! history (fn [v]
- (assoc v task-id (-> (or (get v task-id) [])
- (conj evt+id)))))
- evt+id))
- (all-events [this task-id]
- (get @history task-id))
-
- (clear-events [this]
- (reset! history {}))
-
- TaskStore
- (list-tasks [this]
- (filter #(or (= owner (:owner %))
- (nil? (:owner %)))
- (vals @tasks)))
-
- (task<-panic [this task-id error]
- (update-task this task-id {:result error}))
-
- (task<-event [this task-id {:keys [id ref root type sym args result error] :as event-descr}]
- ;; some redundancy between :result in task and event
- ;; note that we save the event first, because update-task can trigger some watchers
- ;; and they would expect the event to be present in the history
- (cond
- (some? args)
- (let [evt {:ref ref :root root :type type :sym sym :args args :error nil :result nil}]
- (when-not id
- (save-event this task-id evt))
- (update-task this task-id {:state :pending})
- evt)
-
- (some? error)
- (let [evt {:ref ref :root root :type type :sym sym :args nil :error error :result nil}]
- (when-not id
- (save-event this task-id evt))
- (update-task this task-id {:state :failure :result error})
- evt)
-
- ;;(some? result) ;result can be nil
- :else
- (let [evt {:ref ref :root root :type type :sym sym :args nil :error nil :result result}]
- (when-not id
- (save-event this task-id evt))
- (update-task this task-id {:state :success :result result})
- evt)))
-
- (find-task [this id]
- (->> (vals @tasks)
- (filter #(= (:id %) id))
- (first)))
-
- (watch-task [this id f]
- (let [k (keyword (str "watcher-" (swap! pcounter inc)))
- watchfn (fn [k atm old new]
- (let [xf (comp
- (filter #(= id (:id %)))
- (filter #(not= (get old (:id %)) %))
- (take 1))
- changeset (transduce xf conj (vals new))]
-
- (when (and (first changeset)
- (f (first changeset)))
- (remove-watch tasks k))))]
- (add-watch tasks k watchfn)))
-
- (await-task [this id]
- (await-task this id {:timeout-ms default-lease}))
-
- (await-task [this id {:keys [timeout-ms] :as opts}]
- (maybe-fail!)
- (let [task (find-task this id)
- deferred (p/deferred)
- wrap-result (fn [{:keys [result] :as task}]
- (cond
- (si/success? task) (p/resolved result)
- (si/failure? task) (p/rejected result)
- :else (p/rejected (ex-info "Unknown state" {:task task}))))]
-
- (if (si/terminal? task)
- (wrap-result task)
- ;;else
- (do
- (watch-task this id (bfn [task]
- (when (si/terminal? task)
- (p/resolve! deferred task)
- true)))
- ;; wait for resolution
- ;; remember: js doesnt have blocking op so we need to chain
- (-> (p/timeout deferred timeout-ms ::timeout)
- (p/then (bfn [resolved]
- (if (= ::timeout resolved)
- (throw (ex-info "Timeout waiting for task to be completed" {:task task}))
- (wrap-result resolved)))))))))
-
- (release-pending-tasks [this]
- (swap! tasks
- update-vals
- (fn [{:keys [state] :as task}]
- (cond-> task
- (and (= :pending state)
- (= (:owner task) owner))
- (assoc :owner nil)))))
-
- (reenqueue-pending-tasks [this f]
- (let [task->run? (atom #{})]
- (swap! tasks
- update-vals
- (fn [{:keys [state] :as task}]
- (if (and (= :pending state)
- (or (= (:owner task) owner)
- (nil? (:owner task))))
- (try
- ;; ensure we only run f once - swap! might run the fn multiple times
- (assoc task :state :new :owner owner)
- ;; TODO log reenqueued task
- (finally
- (when-not (contains? @task->run? task)
- (try
- (f task)
- (finally
- (swap! task->run? conj task))))))
- ;; else
- task)))))
-
- (enqueue-task [this task]
- (maybe-fail!)
- (let [task+owner (assoc task :owner owner :order (swap! tcounter inc))]
- (si/validate-task! task+owner)
- (swap! tasks assoc (:id task) task+owner)
- #?(:cljs (register this (:sym task+owner) (:fvar task+owner)))
- task+owner))
-
- (dequeue-task [this]
- (dequeue-task this {:lease-ms nil}))
-
- (dequeue-task [this {:keys [lease-ms]}]
- (let [first-new (fn [v] (->> (vals v)
- (filter #(and
- (or (= owner (:owner %)) (nil? (:owner %)))
- (or (= :new (:state %))
- (some-> (:lease-end %)
- (< (now))))))
- (sort-by :order)
- (first)))
- found? (atom nil)]
-
- (swap-vals! tasks
- (fn [v] (let [found (first-new v)]
- (if found
- (->> (assoc found :state :pending
- :fvar (sym->var this found)
- ;; watch for overflow?
- :lease-end (when lease-ms
- (+ (now) lease-ms)))
- (reset! found?)
- (assoc v (:id found)))
- v))))
- ;; highest first
- (->> @found?)))
-
- (clear-tasks [this]
- (reset! tasks {}))))))
-
diff --git a/src2/intemporal/store/foundationdb.clj b/src2/intemporal/store/foundationdb.clj
deleted file mode 100644
index a12dbf0..0000000
--- a/src2/intemporal/store/foundationdb.clj
+++ /dev/null
@@ -1,255 +0,0 @@
-(ns intemporal.store.foundationdb
- (:require [intemporal.store :as store]
- [intemporal.workflow.internal :as i]
- [intemporal.store.internal :as si :refer [resolve-fvar serialize deserialize next-id]]
- [me.vedang.clj-fdb.FDB :as cfdb]
- [me.vedang.clj-fdb.core :as fc]
- [me.vedang.clj-fdb.transaction :as ftr]
- [me.vedang.clj-fdb.subspace.subspace :as fsub]
- [promesa.core :as p])
- (:import [com.apple.foundationdb FDB FDBTransaction KeyValue]
- [com.apple.foundationdb.tuple Tuple]))
-
-;; FDB is a KV store; this store impl will use the subspace feature for namespacing
-;; => task
-;; => event
-;; event ids are scoped to a task
-;; values are (de)serialized via nippy
-
-(def fdb-api-version cfdb/clj-fdb-api-version)
-
-(defmacro with-tx [binding & body]
- (let [[tx-sym db-sym] binding
- database (with-meta db-sym {:tag 'com.apple.foundationdb.Database})]
- ;; TODO type hint Closeable?
- `(with-open [db# ~database]
- (ftr/run db#
- (fn [~tx-sym] (do ~@body))))))
-
-(defn make-store
- ([]
- (make-store nil))
- ([{:keys [owner cluster-file-path]
- :or {owner store/default-owner}}]
- (let [^FDB fdb (cfdb/select-api-version fdb-api-version)
- open-db #(if cluster-file-path
- (cfdb/open fdb cluster-file-path)
- (cfdb/open fdb))
- subspace-tasks (fsub/create ["tasks"])
- subspace-owned-tasks (fsub/create [(str owner "_tasks")])
- subspace-history (fsub/create ["history"])]
- (reify
- store/InternalVarStore
- (register [this sym var])
- (lookup [this sym]
- (requiring-resolve sym))
-
- store/HistoryStore
- (list-events [this]
- (-> (with-tx [tx (open-db)]
- (fc/get-range tx subspace-history {:valfn deserialize}))
- (vals)))
-
- (save-event [this task-id {:keys [type ref root sym args result] :as event}]
- (si/validate-serializable! args "Event args should be serializable")
- (si/validate-serializable! result "Event result should be serializable")
- (let [evt-id (next-id)
- evt+id (assoc event :id evt-id)]
- (si/validate-serializable! evt+id "Event should be serializable")
- (si/validate-event! evt+id)
-
- (with-tx [tx (open-db)]
- (fc/set tx subspace-history [task-id evt-id] (serialize evt+id)))
- evt+id))
-
- (all-events [this task-id]
- (-> (with-tx [tx (open-db)]
- (fc/get-range tx subspace-history [task-id] {:valfn deserialize}))
- (vals)))
-
- (clear-events [this]
- (with-tx [tx (open-db)]
- (fc/clear-range tx subspace-history)))
-
- store/TaskStore
- (list-tasks [this]
- (let [owned (-> (with-tx [tx (open-db)]
- (fc/get-range tx subspace-owned-tasks {:valfn (comp resolve-fvar deserialize)}))
- (vals))
- free (-> (with-tx [tx (open-db)]
- (fc/get-range tx subspace-tasks {:valfn (comp resolve-fvar deserialize)}))
- (vals))]
- (into owned free)))
-
- (task<-panic [this task-id error]
- (with-tx [tx (open-db)]
- (let [task (fc/get tx subspace-owned-tasks task-id {:valfn (comp resolve-fvar deserialize)})
- updated-task (assoc task :result error)]
- (when task
- (si/validate-task! updated-task)
- (fc/set tx subspace-owned-tasks task-id (serialize updated-task))))))
-
- (task<-event [this task-id {:keys [id ref root type sym args result error] :as event-descr}]
- ;; some redundancy between :result in task and event
- ;; note that we save the event first, because update-task can trigger some watchers
- ;; and they would expect the event to be present in the history
- (with-tx [tx (open-db)]
- (let [task (fc/get tx subspace-owned-tasks task-id {:valfn (comp resolve-fvar deserialize)})
- evt {:ref ref :root root :type type :sym sym :args args}
- updated-task (cond
- (some? args) (assoc task :state :pending)
- (some? error) (assoc task :state :failure :result error)
- :else (assoc task :state :success :result result))
- updated-evt (cond
- (some? args) (assoc evt :args args)
- (some? error) (assoc evt :error error)
- :else (assoc evt :result result))]
- (si/validate-serializable! task "Task should be serializable")
- (when-not id
- (store/save-event this task-id updated-evt))
- ;; not every invocation will come from a persisted task
- (when task
- (si/validate-task! updated-task)
- (si/validate-transition! task updated-task)
- (fc/set tx subspace-owned-tasks task-id (serialize updated-task)))
- updated-evt)))
-
- (find-task [this id]
- (with-tx [^FDBTransaction tx (open-db)]
- (when-let [task? (fc/get tx subspace-owned-tasks id)]
- (resolve-fvar (deserialize task?)))))
-
- (watch-task [this id f]
- (let [watch? (atom true)]
- (i/libthread (format "Watcher-%s" id)
- (while @watch?
- @(with-tx [^FDBTransaction tx (open-db)]
- (when (fc/get tx subspace-owned-tasks id)
- (.watch tx (fsub/pack subspace-owned-tasks (Tuple/from (object-array [id]))))))
-
- (with-tx [^FDBTransaction tx (open-db)]
- (when-let [task? (fc/get tx subspace-owned-tasks id)]
- (when (f (resolve-fvar (deserialize task?)))
- (reset! watch? false))))))))
-
- (await-task [this id]
- (store/await-task this id {:timeout-ms store/default-lease}))
-
- (await-task [this id {:keys [timeout-ms] :as opts}]
- (let [task (store/find-task this id)
- deferred (p/deferred)
- wrap-result (fn [{:keys [state result] :as task}]
- (cond
- (si/success? task) (p/resolved result)
- (si/failure? task) (p/rejected result)
- :else (p/rejected (ex-info "Unknown state" {:task task}))))]
-
- (if (si/terminal? task)
- (wrap-result task)
- ;;else
- (do
- (store/watch-task this id (fn [{:keys [state] :as task}]
- (when (si/terminal? task)
- (p/resolve! deferred task)
- true)))
- ;; wait for resolution
- (-> (p/timeout deferred timeout-ms ::timeout)
- (p/then (fn [resolved]
- (if (= ::timeout resolved)
- (throw (ex-info "Timeout waiting for task to be completed" {:task task}))
- (wrap-result resolved)))))))))
-
- (release-pending-tasks [this]
- (with-tx [tx (open-db)]
- (let [owned-tasks @(.asList (ftr/get-range tx (fsub/range subspace-owned-tasks)))]
- (doseq [kv owned-tasks]
- (let [task (-> kv .getValue deserialize resolve-fvar)]
- (when (= :pending (:state task))
- (fc/set tx subspace-tasks [(:id task)] (serialize (assoc task :owner nil)))
- (fc/clear tx subspace-owned-tasks (:id task))))))))
-
- (reenqueue-pending-tasks [this f]
- (with-tx [tx (open-db)]
- (let [owned-tasks @(.asList (ftr/get-range tx (fsub/range subspace-owned-tasks)))
- free-tasks @(.asList (ftr/get-range tx (fsub/range subspace-tasks)))]
- (doseq [kv owned-tasks]
- (let [task (-> kv .getValue deserialize resolve-fvar)]
- (when (= :pending (:state task))
- (f task)
- (fc/set tx subspace-owned-tasks [(:id task)] (serialize (assoc task :state :new))))))
-
- (doseq [kv free-tasks]
- (let [task (-> kv .getValue deserialize resolve-fvar)]
- (when (= :pending (:state task))
- (f task)
- (fc/clear tx subspace-tasks (:id task))
- (fc/set tx subspace-owned-tasks [(:id task)] (serialize (assoc task :state :new :owner owner)))))))))
-
- (enqueue-task [this task]
- (let [task+owner (assoc task :owner owner)
- task-id (:id task+owner)]
- (si/validate-serializable! task+owner "Task should be serializable")
- (si/validate-task! task+owner)
-
- (with-tx [tx (open-db)]
- (fc/set tx subspace-owned-tasks [task-id] (serialize (dissoc task+owner :fvar))))
- task+owner))
-
- (dequeue-task [this]
- (store/dequeue-task this {:lease-ms nil}))
-
- (dequeue-task [this {:keys [lease-ms]}]
- (let [dequeuable? (fn [{:keys [state lease-end]}]
- (or (= :new state)
- (some-> lease-end
- (< (store/now)))))
- update-task (fn [task]
- (assoc task
- :owner owner
- :state :pending
- :fvar (store/sym->var this task)
- :lease-end (when lease-ms (+ (store/now) lease-ms))))
- found? (with-tx [tx (open-db)]
- (reduce
- (fn [_ ^KeyValue kv]
- (let [task (-> kv .getValue deserialize resolve-fvar)]
- (when (dequeuable? task)
- (let [updated-task (update-task task)]
- (fc/set tx subspace-owned-tasks [(:id task)] (serialize (dissoc updated-task :fvar)))
- (reduced updated-task)))))
- nil
- (ftr/get-range tx (fsub/range subspace-owned-tasks))))]
-
- ;; if we cant find any task that we own,
- ;; try the tasks that were released
- (if found?
- found?
- (with-tx [tx (open-db)]
- (reduce
- (fn [_ ^KeyValue kv]
- (let [task (-> kv .getValue deserialize resolve-fvar)]
- (when (dequeuable? task)
- (let [updated-task (update-task task)]
- (fc/clear tx subspace-tasks (:id task))
- (fc/set tx subspace-owned-tasks [(:id task)] (serialize (dissoc updated-task :fvar)))
- (reduced updated-task)))))
- nil
- (ftr/get-range tx (fsub/range subspace-tasks)))))))
-
- (clear-tasks [this]
- (with-tx [tx (open-db)]
- (fc/clear-range tx subspace-owned-tasks)))))))
-
-
-(comment
- (def s (make-store {:cluster-file-path "docker/fdb.cluster"}))
- (def t (i/create-workflow-task "ref#" "root#" 'clojure.core/+ (var-get #'+) [] 1))
-
- (store/save-event s 1 {:a 1})
- (store/list-events s)
- (store/list-tasks s)
-
- (store/enqueue-task s t)
- (store/dequeue-task s))
-
-;(store/watch-task s 1 (partial println ">>>"))
\ No newline at end of file
diff --git a/src2/intemporal/store/internal.cljc b/src2/intemporal/store/internal.cljc
deleted file mode 100644
index 3c302b6..0000000
--- a/src2/intemporal/store/internal.cljc
+++ /dev/null
@@ -1,139 +0,0 @@
-(ns intemporal.store.internal
- #?(:clj (:require [intemporal.error :as error]
- [taoensso.nippy :as nippy]
- [malli.core :as m])
- :cljs (:require [clojure.edn :as edn]
- [intemporal.error :as error]
- [malli.core :as m])))
-
-(defn next-id []
- #?(:clj (System/currentTimeMillis)
- :cljs (.getTime (js/Date.))))
-
-;;;;
-;; serialization
-
-(defn resolve-fvar [{:keys [sym] :as task}]
- ;; TODO does it work in cljs?
- (assoc task :fvar #?(:clj (requiring-resolve sym) :cljs nil)))
-
-(defn serializable?
- "Indicates if an object is serializable"
- [x]
- #?(:clj (nippy/freezable? x {:allow-java-serializable? true?})
- :cljs true))
-
-(defn serialize
- "Serializes an object"
- [x]
- (when x
- #?(:clj (nippy/freeze x)
- :cljs (pr-str x))))
-
-(defn deserialize
- "Deserializes an object"
- [x]
- (when x
- #?(:clj (nippy/thaw x)
- :cljs (edn/read x))))
-
-;;;;
-;; validation
-
-#_:clj-kondo/ignore
-#?(:clj (when (= "true" (System/getenv "DEV"))
- ((requiring-resolve 'malli.dev/start!))))
-
-;;;;
-;; validation
-(def registry
- (merge
- (m/class-schemas)
- (m/comparator-schemas)
- (m/base-schemas)
- (m/type-schemas)
- {:var (m/-simple-schema {:type :var, :pred #(or (fn? %) (var? %))})}))
-
-(def ^:private RuntimeConfig
- [:map {:closed false}
- [:timeout-ms {:optional true} :int]
- [:telemetry-context {:optional true} [:maybe :map]]])
-
-(def ^:private Task
- [:map {:closed true}
- [:id [:or :string :uuid]]
- [:owner [:maybe :string]]
- [:sym :symbol]
- [:ref [:maybe :string]]
- [:root [:maybe :string]]
- [:proto {:optional true} :symbol]
- [:fvar :var]
- [:args {:optional true} [:maybe [:sequential :any]]]
- [:result :any]
- [:state [:enum :new :pending :failure :success]]
- [:type [:enum :workflow :activity :proto-activity]]
- [:lease-end {:optional true} [:maybe :int]]
- [:order {:optional true} :int]
- [:runtime {:optional true} RuntimeConfig]])
-
-(def ^:private Event
- [:map {:closed true}
- [:id :int]
- [:ref [:maybe :string]]
- [:root [:maybe :string]]
- [:type [:enum
- :intemporal.workflow/invoke :intemporal.workflow/success :intemporal.workflow/failure
- :intemporal.activity/invoke :intemporal.activity/success :intemporal.activity/failure
- :intemporal.protocol/invoke :intemporal.protocol/success :intemporal.protocol/failure
- :intemporal.workflow.internal/failure]]
- [:sym :symbol]
- [:args {:optional true} [:maybe [:sequential :any]]]
- [:result {:optional true} :any]
- [:error {:optional true} :any]])
-
-;; valid task states
-(def valid-state-transitions {:new #{:pending}
- :pending #{:new :success :failure}})
-
-(defn validate-transition!
- "Ensures that the task's new `:state`, if any, is allowed.
- Useful to implement compare-and-swap semantics"
- [{:keys [state id]} attrs]
- (let [next-states (get valid-state-transitions state)]
- ;; if we are updating state
- ;; and the new state is not allowed
- ;; error out
- (when (and (contains? attrs :state)
- (not= (:state attrs) state)
- (not (contains? next-states (:state attrs))))
- (throw (ex-info (str "Cannot update task with id " id " from state " state " to " (:state attrs)) {:task-id id
- :state state
- :next-state (:state attrs)})))))
-(def validate-task!
- "Throws if the task is not valid"
- (m/coercer Task nil {:registry registry}))
-
-(def validate-event!
- "Throws if the event is not valid"
- (m/coercer Event nil {:registry registry}))
-
-(defn validate-serializable!
- "Throws if the object is not serializable"
- ([obj]
- (validate-serializable! obj "Object is not serializable"))
- ([obj msg]
- (when-not (serializable? obj)
- (throw (ex-info msg {:object obj})))))
-
-
-(defn success? [{:keys [state] :as task}]
- (= :success state))
-
-(defn failure? [{:keys [state result] :as task}]
- (or (= :failure state)
- (and (= :pending state)
- (error/panic? result))))
-
-(defn terminal? [task]
- (or (success? task)
- (failure? task)))
\ No newline at end of file
diff --git a/src2/intemporal/store/jdbc.clj b/src2/intemporal/store/jdbc.clj
deleted file mode 100644
index 07dc171..0000000
--- a/src2/intemporal/store/jdbc.clj
+++ /dev/null
@@ -1,290 +0,0 @@
-(ns intemporal.store.jdbc
- (:require [hikari-cp.core :as hikari]
- [intemporal.store :as store]
- [intemporal.workflow.internal :as i]
- [intemporal.store.internal :as si :refer [serialize deserialize]]
- [migratus.core :as migratus]
- [next.jdbc :as jdbc]
- [next.jdbc.sql.builder :as builder]
- [next.jdbc.result-set :as rs]
- [promesa.core :as p])
- (:import [java.sql Timestamp]
- [java.util Date]))
-
-(comment
- (let [cfg {:store :database
- :migration-dir "migrations/postgres"
- :watch-polling-ms 100
- :db {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"}}]
- (migratus/rollback cfg)
- (migratus/migrate cfg))
- "")
-
-;;;;
-;; utilities
-
-(defn- kw->db [kw]
- (when kw
- (if (keyword? kw)
- (.substring (str kw) 1)
- (name kw))))
-
-(defn- db->kw [v]
- (when v (keyword v)))
-
-(defn- db->task [{:keys [id proto type ref root sym args result state lease_end runtime owner] :as task}]
- (let [dargs (deserialize args)
- dresult (deserialize result)
- druntime (deserialize runtime)
- ssym (symbol sym)
- sproto (when proto (symbol proto))
- kstate (db->kw state)]
- (cond-> (condp = type
- "workflow" (i/create-workflow-task ref root ssym (resolve ssym) dargs id dresult kstate druntime)
- "activity" (i/create-activity-task ref root ssym (resolve ssym) dargs id dresult kstate druntime)
- "proto-activity" (i/create-proto-activity-task sproto ref root ssym (resolve ssym) dargs id dresult kstate druntime))
- lease_end (assoc :lease-end lease_end)
- owner (assoc :owner owner))))
-
-(defn- db->event [{:keys [id type ref root sym args result] :as event}]
- (let [dargs (deserialize args)
- dresult (deserialize result)]
- (assoc event
- :type (db->kw type)
- :ref ref
- :root root
- :sym (symbol sym)
- :args dargs
- :result dresult)))
-
-;;;;
-;; main
-
-(defn make-store
- "Creates a new Postgres-based store."
- [{:keys [owner migration-dir migrate? watch-polling-ms jdbcUrl]
- :or {owner store/default-owner migrate? true watch-polling-ms 100} :as opts}]
- (let [db-spec (-> opts
- (dissoc :migration-dir :migrate? :watch-polling-ms)
- (assoc :jdbc-url jdbcUrl))
- datasource (hikari/make-datasource db-spec)
- config {:store :database
- :migration-dir migration-dir
- :db db-spec}
- default-opts {:builder-fn rs/as-unqualified-lower-maps}]
-
- (when migrate?
- (migratus/migrate config))
-
- (reify
- store/InternalVarStore
- (register [this sym var])
- (lookup [this sym]
- (requiring-resolve sym))
-
- store/HistoryStore
- (list-events [this]
- (->> (jdbc/with-transaction [tx datasource]
- (jdbc/execute! tx ["select * from events"] default-opts))
- (map db->event)))
-
- (save-event [this task-id {:keys [type ref root sym args result] :as event}]
- (si/validate-serializable! args "Event args should be serializable")
- (si/validate-serializable! result "Event result should be serializable")
- (si/validate-event! (assoc event :id Integer/MAX_VALUE))
-
- (let [args (serialize args)
- result (serialize result)
- res (jdbc/with-transaction [tx datasource]
- (jdbc/execute-one! tx ["INSERT INTO events(type, ref, root, sym, args, result) values (?,?,?,?,?,?) RETURNING id"
- (kw->db type) ref root (str sym) args result]
- default-opts))]
- (assoc event :id (:id res))))
-
- (all-events [this task-id]
- (->> (jdbc/with-transaction [tx datasource]
- (jdbc/execute! tx ["select * from events where ref=?" task-id] default-opts))
- (map db->event)))
-
- (clear-events [this]
- (jdbc/with-transaction [tx datasource]
- (jdbc/execute! tx ["delete from events"])))
-
- store/TaskStore
- (list-tasks [this]
- (->> (jdbc/with-transaction [tx datasource]
- (jdbc/execute! tx ["select * from tasks where (owner is null or owner=?)" owner] default-opts))
- (map db->task)))
-
- (task<-panic [this task-id error]
- (jdbc/with-transaction [tx datasource]
- (let [updated-task {:result (serialize error)}]
- (jdbc/execute-one! tx (builder/for-update "tasks" updated-task {:id task-id} default-opts)))))
-
- (task<-event [this task-id {:keys [id ref root type sym args result error] :as event-descr}]
- ;; some redundancy between :result in task and event
- ;; note that we save the event first, because update-task can trigger some watchers
- ;; and they would expect the event to be present in the history
- (jdbc/with-transaction [tx datasource]
- (let [evt {:ref ref :root root :type type :sym sym :args args}
- expected-state (cond
- (some? args) :new
- (or (some? result) (some? error)) :pending
- :else :unknown)
- updated-task (cond
- (some? args) {:state (kw->db :pending) :args (serialize args)}
- (some? error) {:state (kw->db :failure) :result (serialize error)}
- :else {:state (kw->db :success) :result (serialize result)})
- updated-evt (cond
- (some? args) (assoc evt :args args)
- (some? error) (assoc evt :error error)
- :else (assoc evt :result result))]
-
- (when-not id
- (store/save-event this task-id updated-evt))
- ;; cant really validate because its a partial task
- ;(validate-task! updated-task)
- (let [updated (jdbc/execute-one! tx (builder/for-update "tasks" updated-task {:id task-id :state (name expected-state)} default-opts))]
- (when (empty? updated)
- (throw (ex-info (format "Cannot update task with id %s, expected state %s did not match" id expected-state)
- {:task-id id :expected-state expected-state})))
- updated-evt))))
-
- (find-task [this id]
- (some-> (jdbc/with-transaction [tx datasource]
- (jdbc/execute-one! tx ["select * from tasks where id=?" id] default-opts))
- (db->task)))
-
- (watch-task [this id f]
- (let [query-state! (fn []
- (jdbc/with-transaction [tx datasource]
- (jdbc/execute-one! tx ["select state from tasks where id=?" id] default-opts)))
- state (query-state!)
- watch? (atom true)]
- (i/libthread (format "Watcher-%s" id)
- (while (and @watch? state)
- (Thread/sleep (long watch-polling-ms))
- (when (not= state (query-state!))
- (let [task (some-> (jdbc/with-transaction [tx datasource]
- (jdbc/execute-one! tx ["select * from tasks where id=?" id] default-opts))
- (db->task))]
- (when (and task (f task))
- (reset! watch? false))))))))
-
- (await-task [this id]
- (store/await-task this id {:timeout-ms store/default-lease}))
-
- (await-task [this id {:keys [timeout-ms] :as opts}]
- ;; TODO use owner
- ;; TODO use promise if available
- (let [task (store/find-task this id)
- deferred (p/deferred)
- wrap-result (fn [{:keys [state result] :as task}]
- (cond
- (si/success? task) (p/resolved result)
- (si/failure? task) (p/rejected result)
- :else (p/rejected (ex-info "Unknown state" {:task task}))))]
-
- (if (si/terminal? task)
- (wrap-result task)
- ;;else
- (do
- (store/watch-task this id (fn [task]
- (when (si/terminal? task)
- (p/resolve! deferred task)
- true)))
- ;; wait for resolution
- (-> (p/timeout deferred timeout-ms ::timeout)
- (p/then (fn [resolved]
- (if (= ::timeout resolved)
- (throw (ex-info "Timeout waiting for task to be completed" {:task task}))
- (wrap-result resolved)))))))))
-
- (release-pending-tasks [this]
- (jdbc/with-transaction [tx datasource]
- (jdbc/execute-one! tx ["update tasks set owner=null where owner=?" owner])))
-
- (reenqueue-pending-tasks [this f]
- (let [tasks? (jdbc/with-transaction [tx datasource]
- (let [tasks (jdbc/execute! tx ["select * from tasks where state='pending' and (owner is null or owner=?)" owner] default-opts)]
- (jdbc/execute-one! tx ["update tasks set state='new', owner=? where id = ANY(?)" owner
- (into-array String (mapv :id tasks))])
- (doseq [row tasks]
- (f (db->task row)))
- tasks))]
- tasks?))
-
- (enqueue-task [this {:keys [id proto type ref root sym args result state lease-end runtime] :as task}]
- (assert (or (nil? proto) (some? (:on proto)) "Task protocol not valid, missing :on attribute"))
-
- (let [task+owner (assoc task :owner owner)]
- (si/validate-serializable! args "Task args should be serializable")
- (si/validate-serializable! result "Task result should be serializable")
- (si/validate-serializable! runtime "Task runtime should be serializable")
- (si/validate-task! task+owner)
-
- (let [proto? (cond (symbol? proto) (str proto)
- (some? (:on proto)) (str (:on proto))
- (string? proto) proto)
- args (serialize args)
- result (serialize result)
- runtime (serialize runtime)]
- (jdbc/with-transaction [tx datasource]
- (jdbc/execute! tx ["INSERT INTO tasks(id,owner,proto,type,ref,root,sym,args,result,state,lease_end,runtime) values (?,?,?,?,?,?,?,?,?,?,?,?) RETURNING id"
- id owner proto? (kw->db type) (kw->db ref) (kw->db root) (str sym) args result (kw->db state) lease-end runtime])))
- task+owner))
-
- (dequeue-task [this]
- (store/dequeue-task this {:lease-ms nil}))
-
- (dequeue-task [this {:keys [lease-ms]}]
- ;; TODO check owner
- ;; TODO select for update skip locked
- (let [query "select * from tasks where (owner=? or owner is null) and (state='new' or lease_end < now()) order by id asc limit 1"
- found? (jdbc/with-transaction [tx datasource]
- (when-let [task (some-> (jdbc/execute-one! tx [query owner] default-opts)
- (db->task))]
- (let [lease-epoch (when lease-ms
- (* 1000 (+ (store/now) lease-ms)))
- lease-ts (when lease-epoch
- (-> (Date. (long lease-epoch))
- (.toInstant)
- (Timestamp/from)))]
- (jdbc/execute-one! tx ["update tasks set state='pending', lease_end=? where id=?" lease-ts (:id task)])
- (assoc task
- :state :pending
- :fvar (store/sym->var this task)
- :lease-end (when lease-epoch
- (/ lease-epoch 1000))))))]
- found?))
-
- (clear-tasks [this]
- (jdbc/with-transaction [tx datasource]
- (jdbc/execute! tx ["delete from tasks"]))))))
-
-#_:clj-kondo/ignore
-(comment
- (require '[intemporal.workflow.internal :as i])
- (defprotocol TestProto
- (a [this] "x"))
- (def s (make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"}))
-
- (def t (i/create-workflow-task nil nil 'clojure.core/+ (var-get #'+) [{:a 1}]))
- (def t2 (i/create-proto-activity-task TestProto (:id t) (:id t) 'clojure.core/+ (var-get #'+) [{:b 2 :c #inst "2011-01-01"}]))
-
- (store/clear-events s)
- (store/clear-tasks s)
-
- (store/enqueue-task s t)
- (store/enqueue-task s t2)
-
- (store/list-tasks s)
-
- (store/watch-task s (:id t) (fn [task] (println "CALLBACK" task)))
-
- (prn (store/dequeue-task s))
- (store/save-event s 1 {:type :intemporal.workflow/invoke
- :ref (:id t) :root (:id t) :sym 'clojure.core/+ :args [] :result [1]})
-
- (store/list-events s))
diff --git a/src2/intemporal/workflow.cljc b/src2/intemporal/workflow.cljc
deleted file mode 100644
index 1d05a24..0000000
--- a/src2/intemporal/workflow.cljc
+++ /dev/null
@@ -1,222 +0,0 @@
-(ns intemporal.workflow
- (:require [intemporal.store :as store]
- [intemporal.workflow.internal :as internal]
- [intemporal.error :as error]
- [promesa.core :as p]
- [taoensso.telemere :as t])
- #?(:cljs (:require-macros
- #_:clj-kondo/ignore
- [intemporal.workflow.internal :refer [with-env-internal trace! trace-async!]]
- [intemporal.workflow :refer [with-env]]))
- #?(:clj (:require [intemporal.error :as error]
- [intemporal.workflow.internal :refer [trace! trace-async!]]
- [steffan-westcott.clj-otel.context :as otctx]))
- #?(:clj (:import [java.util.concurrent Executors TimeUnit]
- [java.lang AutoCloseable])))
-
-#?(:clj (set! *warn-on-reflection* true))
-
-;;;;
-;; runtime
-
-(defmacro with-env
- "Creates a new environment for workflow execution. Options:
- - `:store`: the underlying store to persist workflow metadata
- - `:id`: optional workflow id
- - `:timeout-ms`: optional timeout for workflow execution
- "
- [m & body]
- `(internal/with-env-internal ~m (do ~@body)))
-
-(defn current-env
- "Returns the workflow execution environment for the current thread"
- []
- (assert (some? internal/*env*) "No workflow env detected, should only be called within a workflow function")
- internal/*env*)
-
-(defn workflow-id
- "Returns the current workflow uuid"
- []
- (assert (some? internal/*env*) "No workflow env detected, should only be called within a workflow function")
- (-> internal/*env* :root))
-
-;;;;
-;; worker
-(defprotocol ITaskExecutor
- (submit [this f] "Submits the function `f` for execution")
- (shutdown [this grace-period-ms] "Shuts down the task executor")
- (terminated? [this] "Indicates if the executor has terminated")
- (shutting-down? [this] "Indicates if the executor has entered shutdown state"))
-
-(defn make-task-executor
- "Creates an object that satisfies `ITaskExecutor`."
- []
- (let [terminated? (atom false)
- shutdown? (atom false)]
- #?(:cljs
- (reify ITaskExecutor
- (submit [_ f]
- (when (not @terminated?)
- (p/vthread (f))))
- (shutdown [_ grace-period-ms]
- (t/log! {:level :debug} ["Executor shutdown"])
- (reset! terminated? true)
- (reset! shutdown? true))
- (terminated? [_] @terminated?)
- (shutting-down? [_] @shutdown?))
- :clj
- (let [factory (-> (Thread/ofVirtual)
- (.name "Task Thread")
- (.factory))
- exec (Executors/newThreadPerTaskExecutor factory)]
- (reify
- ITaskExecutor
- (submit [_ f]
- (.submit exec ^Runnable f))
- (shutdown [_ grace-period-ms]
- (try
- ;; reject tasks
- (.shutdown exec)
- (reset! shutdown? true)
- (t/log! {:level :debug} ["Executor shutdown"])
- ;; await ongoing tasks
- (when-not (.awaitTermination exec grace-period-ms TimeUnit/MILLISECONDS)
- (t/log! {:level :debug} ["Executor shutdown grace period over, shutting down NOW"])
- (.shutdownNow exec))
- ;; in case we got interrupted exception, make sure to set the flag
- ;; so ongoing ops fail
- (finally
- (reset! terminated? true))))
- (terminated? [_]
- @terminated?)
- (shutting-down? [_]
- @shutdown?)
- ;; allow expressions like (with-open [executor (w/start-poller ....
- AutoCloseable
- (close [this]
- (shutdown this 0)))))))
-
-(defn- worker-execute-fn
- "Executes a given protocol, activity or workflow `task`"
- [store protocols {:keys [type id root runtime fvar] :as task} task-counter terminated? shutting-down?]
- (let [runtime (:runtime task)
- base-env {:store store
- :type type
- :ref id
- :id id
- :root (or root id)
- :protos protocols
- :next-id (fn [] (str (or root id) "-" (swap! task-counter inc)))
- :terminated? terminated?
- :shutdown? shutting-down?}
- internal-env (merge internal/default-env base-env runtime)]
- ;; root task: we only enqueue workflows
- (with-env internal-env
- (t/log! {:level :debug :data {:sym (:sym task)}} ["Resuming task with id" (:id task)])
- ;; this span creation is required in order for
- ;; subsequent workflow traces to have a "parent" span, otherwise
- ;; they won't show up correctly in jaeger
- ;; TODO test with eg loki
- (trace-async! {:name "worker: worker-execute-fn" :attributes {:task-id (:id task)}}
- #?(:cljs (internal/resume-task internal-env store protocols task)
- :clj (otctx/bind-context! (otctx/headers->merged-context (:telemetry-context runtime))
- (internal/resume-task internal-env store protocols task)))))))
-
-(defn- worker-poll-fn
- "Continously polls for task while `task-executor` is active."
- [store protocols task-executor polling-ms]
- (let [task-counter (atom 0)
- stopped? (fn [] (terminated? task-executor))
- shutdown? (fn [] (shutting-down? task-executor))]
- #_{:clj-kondo/ignore [:loop-without-recur :invalid-arity]}
- @(p/loop []
- (-> (p/delay polling-ms)
- (p/chain (fn [_]
- (loop []
- ;(t/log! {:level :debug} ["Polling for tasks..."])
- ;; TODO add another check for shutting-down?
- (when-let [task (and
- (not (shutting-down? task-executor))
- (store/dequeue-task store))]
- (t/log! {:level :debug :_data {:task task}} ["Dequeued task with id" (:id task)])
- (try
- (submit task-executor (fn []
- (worker-execute-fn store protocols task task-counter stopped? shutdown?)))
- (catch #?(:clj Exception :cljs js/Error) e
- ;; dequeued updated the state atomically (so other txs dont do the same)
- ;; but if the executor stopped in the meantime we need to revert the task's state
- (when (error/rejected? e)
- (t/log! {:level :warn} ["Task execution rejected, reverting state to :new"])
- (store/enqueue-task store (assoc task :state :new)))))
- (when-not (stopped?)
- (recur))))
- (when-not (stopped?)
- (p/recur))))
- (p/catch (fn [e]
- (t/log! {:level :warn :data {:exception e}} ["Caught error during task polling, continuing"])
- (when-not (stopped?)
- (p/recur))))))))
-
-(defn start-poller!
- "Starts a poller that will submit tasks to the `task-executor`.
- Protocol implementations are resolved via a map of `:protocols {my.ns Impl}`
- Returns an `ITaskExecutor` that can be shutdown.
- For clj runtimes, task-executor should be `(Executors/newVirtualThreadPerTaskExecutor)`, as
- each execution will be blocked while they await for a given task dependencie's execution."
- ([store {:keys [protocols polling-ms] :or {protocols {} polling-ms 100} :as opts}]
- (start-poller! store (make-task-executor) opts))
- ([store task-executor & {:keys [protocols polling-ms] :or {protocols {} polling-ms 100}}]
- (assert (satisfies? ITaskExecutor task-executor) "Supplied task executor does not satisfy ITaskExecutor")
- ;; start poller in a out-of-executor thread so it doesnt prevent the executor from shutting down
- ;; the only way to stop the poller is via shutdown
- (p/vthread
- (worker-poll-fn store protocols task-executor polling-ms))
- task-executor))
-
-(defn start-worker!
- "Starts a single worker thread that periodically polls for tasks and executes them in a
- separate thread. Mostly used for testing purposes."
- ([store]
- (start-worker! store {}))
- ([store & {:keys [protocols polling-ms] :or {protocols {} polling-ms 100}}]
- (let [run? (atom true)
- task-counter (atom 0)]
- (internal/libthread "Worker"
- #_{:clj-kondo/ignore [:loop-without-recur :invalid-arity]}
- @(p/loop []
- (-> (p/delay polling-ms)
- (p/chain (fn [_]
- (when-let [task (store/dequeue-task store)]
- (t/log! {:level :debug :data {:sym (:sym task)}} ["Dequeued task with id" (:id task)])
- (internal/libthread (str "Worker-" (:id task))
- (worker-execute-fn store protocols task task-counter
- (fn [] (not @run?))
- (fn [] (not @run?)))))
-
- (when @run?
- (p/recur)))))))
- (fn []
- (t/log! {:level :info} ["Stopping worker"])
- (reset! run? false)))))
-
-(defn enqueue-and-wait
- "Adds the task to the internal queue, awaits for its execution.
- Task might be fulfilled by other threads"
- [{:keys [store] :as opts} task]
- (t/log! {:level :debug :data {:sym (:sym task)}} ["Enqueuing task with id" (:id task)])
- (internal/enqueue-and-wait opts task))
-
-(defn add-compensation
- "Adds a compensation action to the current workflow."
- [thunk]
- (assert (ifn? thunk) "Compensation action should implement IFn")
- (swap! (:compensations internal/*env*) conj thunk))
-
-(defn compensate
- "Runs compensation in program order. A failure of the compensation action will stop running other compensations."
- []
- (let [thunks (-> internal/*env* :compensations)]
- (trace! {:name "compensations" :attributes {:fn-count (count @thunks)}}
- (doseq [f @thunks]
- (swap! thunks pop)
- (f)))))
\ No newline at end of file
diff --git a/src2/intemporal/workflow/internal.cljc b/src2/intemporal/workflow/internal.cljc
deleted file mode 100644
index 28f8009..0000000
--- a/src2/intemporal/workflow/internal.cljc
+++ /dev/null
@@ -1,362 +0,0 @@
-(ns ^:private intemporal.workflow.internal
- "Private namespace for workflow support."
- (:require [intemporal.store :as store]
- [intemporal.error :as error]
- [promesa.core :as p]
- [taoensso.telemere :as t])
- #?(:clj (:require [steffan-westcott.clj-otel.context :as otctx]
- [steffan-westcott.clj-otel.api.trace.span :as otspan]
- [net.cgrand.macrovich :as macros]
- [intemporal.store :refer [bfn]]))
- #?(:cljs (:require-macros
- [net.cgrand.macrovich :as macros]
- [intemporal.workflow.internal :refer [trace! trace-async!]]
- [intemporal.store :refer [bfn]]))
- #?(:clj (:import [java.util.concurrent CompletableFuture])))
-
-#?(:clj (set! *warn-on-reflection* true))
-
-;;;;
-;; utils
-
-(defmacro libthread
- "Creates a thread for internal usage. Client code should not rely on this.
- Returns a promise."
- [label & body]
- `(p/vthread ~@body))
-
-;;;;
-;; runtime
-
-(def ^:dynamic *env* nil)
-(def default-env {:compensations (atom '())
- :timeout-ms (* 15 60 1000)})
-
-(defn- env->runtime
- "Derives the `runtime` attrs from the current env."
- []
- (select-keys *env* [:timeout-ms :telemetry-context]))
-
-(defn random-id
- "Generates a random id. if env var `DEV` is defined, generates a two-word human-readable id."
- []
- ;; debugging purposes only
- ;; https://github.com/moby/moby/blob/master/pkg/namesgenerator/names-generator.go
- ;; TODO use https://github.com/adzerk-oss/env ?
- (if #?(:clj (= "true" (System/getenv "DEV"))
- :cljs false)
- (let [left ["admiring" "adoring" "affectionate" "agitated" "amazing" "angry" "awesome" "beautiful" "blissful" "bold" "boring" "brave" "busy" "charming" "clever" "compassionate" "competent" "condescending" "confident" "cool" "cranky" "crazy" "dazzling" "determined" "distracted" "dreamy" "eager" "ecstatic" "elastic" "elated" "elegant" "eloquent" "epic" "exciting" "fervent" "festive" "flamboyant" "focused" "friendly" "frosty" "funny" "gallant" "gifted" "goofy" "gracious" "great" "happy" "hardcore" "heuristic" "hopeful" "hungry" "infallible" "inspiring" "intelligent" "interesting" "jolly" "jovial" "keen" "kind" "laughing" "loving" "lucid" "magical" "modest" "musing" "mystifying" "naughty" "nervous" "nice" "nifty" "nostalgic" "objective" "optimistic" "peaceful" "pedantic" "pensive" "practical" "priceless" "quirky" "quizzical" "recursing" "relaxed" "reverent" "romantic" "sad" "serene" "sharp" "silly" "sleepy" "stoic" "strange" "stupefied" "suspicious" "sweet" "tender" "thirsty" "trusting" "unruffled" "upbeat" "vibrant" "vigilant" "vigorous" "wizardly" "wonderful" "xenodochial" "youthful" "zealous" "zen"]
- right ["agnesi" "albattani" "allen" "almeida" "antonelli" "archimedes" "ardinghelli" "aryabhata" "austin" "babbage" "banach" "banzai" "bardeen" "bartik" "bassi" "beaver" "bell" "benz" "bhabha" "bhaskara" "black" "blackburn" "blackwell" "bohr" "booth" "borg" "bose" "bouman" "boyd" "brahmagupta" "brattain" "brown" "buck" "burnell" "cannon" "carson" "cartwright" "carver" "cerf" "chandrasekhar" "chaplygin" "chatelet" "chatterjee" "chaum" "chebyshev" "clarke" "cohen" "colden" "cori" "cray" "curie" "curran" "darwin" "davinci" "dewdney" "dhawan" "diffie" "dijkstra" "dirac" "driscoll" "dubinsky" "easley" "edison" "einstein" "elbakyan" "elgamal" "elion" "ellis" "engelbart" "euclid" "euler" "faraday" "feistel" "fermat" "fermi" "feynman" "franklin" "gagarin" "galileo" "galois" "ganguly" "gates" "gauss" "germain" "goldberg" "goldstine" "goldwasser" "golick" "goodall" "gould" "greider" "grothendieck" "haibt" "hamilton" "haslett" "hawking" "heisenberg" "hellman" "hermann" "herschel" "hertz" "heyrovsky" "hodgkin" "hofstadter" "hoover" "hopper" "hugle" "hypatia" "ishizaka" "jackson" "jang" "jemison" "jennings" "jepsen" "johnson" "joliot" "jones" "kalam" "kapitsa" "kare" "keldysh" "keller" "kepler" "khayyam" "khorana" "kilby" "kirch" "knuth" "kowalevski" "lalande" "lamarr" "lamport" "leakey" "leavitt" "lederberg" "lehmann" "lewin" "lichterman" "liskov" "lovelace" "lumiere" "mahavira" "margulis" "matsumoto" "maxwell" "mayer" "mccarthy" "mcclintock" "mclaren" "mclean" "mcnulty" "meitner" "mendel" "mendeleev" "meninsky" "merkle" "mestorf" "mirzakhani" "montalcini" "moore" "morse" "moser" "murdock" "napier" "nash" "neumann" "newton" "nightingale" "nobel" "noether" "northcutt" "noyce" "panini" "pare" "pascal" "pasteur" "payne" "perlman" "pike" "poincare" "poitras" "proskuriakova" "ptolemy" "raman" "ramanujan" "rhodes" "ride" "ritchie" "robinson" "roentgen" "rosalind" "rubin" "saha" "sammet" "sanderson" "satoshi" "shamir" "shannon" "shaw" "shirley" "shockley" "shtern" "sinoussi" "snyder" "solomon" "spence" "stonebraker" "sutherland" "swanson" "swartz" "swirles" "taussig" "tesla" "tharp" "thompson" "torvalds" "tu" "turing" "varahamihira" "vaughan" "villani" "visvesvaraya" "volhard" "wescoff" "wilbur" "wiles" "williams" "williamson" "wilson" "wing" "wozniak" "wright" "wu" "yalow" "yonath" "zhukovsky"]]
- (str (rand-nth left) "-" (rand-nth right)))
- (str (random-uuid))))
-
-(defmacro with-env-internal
- "Merges `m` on top of the current internal environment, then runs `body` with the new environment."
- [m & body]
- `(binding [*env* (merge default-env ~m)]
- (do ~@body)))
-
-;;;;
-;; telemetry
-
-(defn ->telemetry-context []
- #?(:clj (otctx/->headers)
- :cljs {}))
-
-(defmacro trace!
- "Wraps body in a tracing context. "
- [{:keys [name attributes] :as attrs} & body]
- (macros/case
- ;; cljs: no telemetry
- :cljs `(do ~@body)
- :clj `(let [attrs# (do ~attrs)]
- (otspan/with-span! attrs#
- (with-env-internal (merge *env* {:telemetry-context (->telemetry-context)})
- (let [res# (do ~@body)]
- res#))))))
-
-(defmacro trace-async!
- "Wraps body in a tracing context. "
- [{:keys [name attributes] :as attrs} & body]
- (macros/case
- ;; cljs: no telemetry
- :cljs `(do ~@body)
- :clj `(let [attrs# (do ~attrs)
- span# (otspan/new-span! attrs#)]
- ;(otspan/async-bound-cf-span attrs#)
- ;(with-env-internal (merge *env* {:telemetry-context (->telemetry-context)}))
- (let [res# (do ~@body)]
- (.whenComplete ^CompletableFuture res#
- (fn [t# e#]
- (when e# (otspan/add-exception! {:context span#} e#))
- (otspan/end-span! {:context span#})))))))
-
-(defn add-event!
- ([task ename attrs]
- #?(:clj (when-let [ctx (-> task :runtime :telemetry-context)]
- (otctx/with-context! (otctx/headers->merged-context ctx)
- (add-event! ename attrs)))))
- ([ename attrs]
- #?(:clj (otspan/add-event! ename attrs))))
-
-;;;;
-;; task definitions
-
-;;;;
-;; Tasks
-
-(defn create-workflow-task
- ([ref root sym fvar args id]
- (create-workflow-task ref root sym fvar args id nil :new nil))
- ([ref root sym fvar args id result state runtime]
- (let [runtime (or runtime (env->runtime))]
- {:type :workflow :id id :ref ref :root root :sym sym :fvar fvar :args args :result result :state state
- :runtime runtime})))
-
-(defn create-activity-task
- ([ref root sym fvar args id]
- (create-activity-task ref root sym fvar args id nil :new nil))
- ([ref root sym fvar args id result state runtime]
- (let [runtime (or runtime (env->runtime))]
- {:type :activity :id id :ref ref :root root :sym sym :fvar fvar :args args :result result :state state
- :runtime runtime})))
-
-(defn create-proto-activity-task
- ([proto ref root sym fvar args id]
- (create-proto-activity-task proto ref root sym fvar args id nil :new nil))
- ([proto ref root sym fvar args id result state runtime]
- (let [runtime (or runtime (env->runtime))]
- {:type :proto-activity :proto proto :id id :ref ref :root root :sym sym :fvar fvar :args args :result result :state state
- :runtime runtime})))
-
-(defn event-matches? [{t :type s :sym} {t2 :type s2 :sym}]
- (and (= t t2) (= s s2)))
-
-;;;;
-;; traced store fns
-
-(defn- all-events [store id]
- (add-event! ::store/all-events {:task-id id})
- (store/all-events store id))
-
-(defn- task<-event [store task-id event-descr]
- ;; TODO patch this to use a compare-and-swap
- ;; must send the expected state as arg
- (add-event! (:type event-descr) {:task-id task-id})
- (store/task<-event store task-id event-descr))
-
-(defn- task<-panic [store task-id error]
- (add-event! ::store/task<-panic {:task-id task-id})
- (store/task<-panic store task-id error))
-
-(defn- find-task [store task-id]
- (add-event! ::store/find-task {:task-id task-id})
- (store/find-task store task-id))
-
-(defn- enqueue-task [store task]
- (add-event! ::store/enqueue-task {:task-id (:id task)})
- (store/enqueue-task store task))
-
-(defn- await-task [store task-id opts]
- (add-event! ::store/await-task {:task-id task-id})
- (store/await-task store task-id opts))
-
-;;;;
-;; task execution/replay
-
-(defn resume-fn-task
- "Resumes a generic fn call task"
- [{:keys [vthread? terminated? shutdown?] :as env} store protos {:keys [type proto id root sym fvar args] :as task} [invoke success failure]]
- (when (and (= :proto-activity type)
- (nil? (get protos proto)))
- (throw (ex-info (str "Protocol implementation for "
- (pr-str proto)
- " not found; available protocols:"
- (pr-str protos)
- ". Make sure to pass `:protocols` key when starting poller or worker ")
- {::type :internal
- :protocols protos
- :required proto})))
- ;; do we have invocation and result events for this task?
- (t/log! {:level :debug :sym sym} ["Resuming try/catch task with id" id])
-
- (try
- (let [terminated? (fn [] (and (ifn? terminated?) (terminated?)))
- [inv? res?] (all-events store id)]
-
- ;; mark invoke/replay
- (let [next-event {:ref id :root (or root id) :type invoke :sym sym :args args}]
- (when inv?
- (t/log! {:level :debug :data {:sym (:sym task)}} ["Found replay event for task with id" (:id task)]))
- (when res?
- (t/log! {:level :debug :data {:sym (:sym task)}} ["Found result event for task with id" (:id task)]))
-
- (cond
- ;; do we have an invocation event? if not, save this one
- (not inv?)
- (task<-event store id next-event)
-
- ;; we do have an invocation event, is it a match of the above?
- (not (event-matches? inv? next-event))
- (throw (error/internal-error "Transition unexpected" {:got (:type inv?)
- :expected invoke}))))
-
- ;; mark success/failure or replay
- (let [next-event {:ref id :root (or root id) :type success :sym sym}
- next-failure (assoc next-event :type failure)
- handle-ok (bfn [r]
- ;; TODO assert r is serializable!
- ;; we check for shutdown because in js runtime, there is no thread interruption
- ;; at this point, if we are shutting down it means we exhausted the grace period
- (let [panic? (terminated?)]
- (try
- (if panic?
- (task<-panic store id (error/panic "Worker shutting down during invocation result handling"))
- (let [new-event (assoc next-event :result r)]
- #?(:clj (otspan/add-span-data! {:attributes {:replayed false :result r}}))
- (task<-event store id new-event)
- r))
- (finally
- (if panic?
- (t/log! {:level :debug :data {:sym sym :result r}} ["Shutting down, interrupted result" id])
- (t/log! {:level :debug :data {:sym sym :result r}} ["Got actual function result for task" id]))))))
- handle-fail (bfn [e]
- (cond
- ;; if its a java.lang.InterruptedException it means
- ;; we killed the executor
- ;; - we must leave the task pending (assuming its idempotent)
- (error/interrupted? e)
- (t/log! {:level :debug :data {:sym sym}} ["InterruptedException caught during actual function invocation for task" id])
-
- ;; executor has terminated, it means we exhausted the graceful shutdown period
- ;; panic the task
- (terminated?)
- (do
- (t/log! {:level :warn :data {:exception e}} ["Exception caught during shutdown, panicking task"])
- (task<-panic store id (error/panic "Worker shutting down during invocation failure handling")))
-
- ;; regular task failure
- :else
- (do
- (t/log! {:level :debug :data {:sym sym :exception e}} ["Exception caught during actual function invocation for task" id])
- (task<-event store id (cond-> (assoc next-failure :error e)
- (error/internal-error? e) (assoc :type ::failure)))))
- ;; finally, return error
- (p/rejected e))
- retval (cond
- ;; are we replaying a result?
- (some? res?)
- (let [success? (contains? res? :result)
- retval (if success? (:result res?) (:error? res?))
- ;; we need to ensure replay events return the same type
- ;; as if they were called via a vthread
- wrapped (if vthread?
- (p/vthread retval)
- retval)]
- #?(:clj (otspan/add-span-data! {:attributes {:replayed true :result retval}}))
- (task<-event store id res?)
- (if success?
- (p/resolved wrapped)
- (p/rejected wrapped)))
-
- ;; no replay, lets do the actual call
- (not res?)
- (-> (let [;; if we're calling a prototype, we need to prepend the
- ;; prot impl and then its args
- args' (if (= :proto-activity type)
- (cons (get protos proto) args)
- args)
- ;; this is the result
- r (binding [*env* (merge default-env env)]
- (t/log! {:level :debug :data {:sym sym :args args'}} ["Calling actual function for task" id])
- ;; vthread calls are special because we only want to process its
- ;; result when deref is called, to ensure determinism:
- ;; - first we must save all events
- ;; - then we can process the underlying impl call
- (if vthread?
- (let [inner (p/create (fn [res rej]
- (-> (p/vthread ;TODO: user thread
- (binding [*env* (-> env
- (dissoc :vhtread?)
- (assoc :telemetry-context (->telemetry-context)))]
- ;(trace! {:id sym})
- #?(:clj (otctx/bind-context! (otctx/headers->merged-context (:telemetry-context env))
- (apply fvar args'))
- :cljs (apply fvar args'))))
- (p/then res)
- (p/catch rej))))]
- ;; in cljs we dont need delay bc its single threaded
- ;; in clj, the delayed value will be deref'd
- ;; but at this point we ensure that any other eg vthread calls have been saved in history
- (#?(:cljs do :clj delay)
- (-> inner
- (p/then handle-ok)
- (p/catch handle-fail))))
- ;; ensure handle-fail always has a chance to catch any fvar
- ;; exceptions
- (-> nil
- (p/then (fn [_] (binding [*env* env]
- ;(trace! {:id sym})
- #?(:clj (otctx/bind-context! (otctx/headers->merged-context (:telemetry-context env))
- (apply fvar args'))
- :cljs (apply fvar args')))))
- (p/then' handle-ok)
- (p/catch handle-fail))))]
- ;; r can be a value or a promise
- r))
-
- (not (or (event-matches? res? next-event) ;; replay success
- (event-matches? res? next-failure))) ;; replay failure
- (throw (error/internal-error "Transition unexpected" {:got (:type res?)
- :expected [success failure]})))]
- (t/log! {:level :debug :data {:sym sym :retval retval}} ["Finished internal execution for task" id])
- ;; if userland called a vthread, retval will be delayed
- retval))
- ;; ensure we terminate the fn call, even if the next event wouldnt be the expected type
- (catch #?(:clj Exception :cljs js/Error) e
- ;; TODO at this point we should just panic, "userland" exceptions should be handled in the handle-fail
- ;; on theory there is no other way for a user exception to bubble out
- (let [wrapped (ex-info "Internal error while resuming execution" {::type :internal} e)]
- (task<-event store id {:ref id :root (or root id) :type ::failure :sym sym :error wrapped}))
- (p/rejected e))))
-
-#?(:clj (ns-unmap *ns* 'resume-task))
-(defmulti resume-task
- "Continues a task that has been queued for execution. Replays events if they exist."
- (fn [env store protos task]
- (:type task)))
-
-(defmethod resume-task :workflow
- [env store protos {:keys [id root sym fvar args] :as task}]
- (resume-fn-task env store protos task [:intemporal.workflow/invoke :intemporal.workflow/success :intemporal.workflow/failure]))
-
-(defmethod resume-task :activity
- [env store protos {:keys [id root sym fvar args] :as task}]
- (resume-fn-task env store protos task [:intemporal.activity/invoke :intemporal.activity/success :intemporal.activity/failure]))
-
-(defmethod resume-task :proto-activity
- [env store protos {:keys [id root sym fvar args] :as task}]
- (resume-fn-task env store protos task [:intemporal.protocol/invoke :intemporal.protocol/success :intemporal.protocol/failure]))
-
-(defn enqueue-and-wait
- "Enqueues `task` onto the store and awaits its execution.
- If the exact task is alread present (eg we are resuming a crashed workflow),
- the existing task will be awaited instead."
- [{:keys [store] :as opts} {:keys [id] :as task}]
- ;; because execution engine is supposed to be deterministic,
- ;; we can safely assume that if an identic task exists at this point
- ;; we are replaying some events
- (assert (some? store) "Store should exist")
- (assert (some? task) "Task should exist")
-
- ;; TODO trace if we pick the task from the db?
- ;; the db task should have a telemetry context already no?
- ;; (trace! {:name (format "workflow: %s" orig#) :attributes {:task-id id#}}
- (let [db-task (or (find-task store id)
- (enqueue-task store task))
-
- _ (add-event! :intemporal.workflow.internal.enqueue-and-wait/db-task {})
- prom (await-task store (:id db-task) opts)]
-
- #?(:clj (deref prom)
- :cljs prom)))
diff --git a/test/intemporal/jepsen/README.md b/test/intemporal/jepsen/README.md
new file mode 100644
index 0000000..8bb1f16
--- /dev/null
+++ b/test/intemporal/jepsen/README.md
@@ -0,0 +1,199 @@
+# Jepsen Chaos Test for `intemporal`
+
+## Context
+
+**What this tests.** `intemporal` is a Clojure workflow engine inspired by Temporal/Cadence.
+This chaos harness runs multiple worker JVMs against a shared Postgres store, injects
+SIGKILL/SIGTERM faults, and checks four correctness invariants after a quiesce period.
+
+**Why a chaos test.** The library's event-sourcing design gives strong single-process
+resilience, but several structural bugs make it unsafe under multi-process deployment
+(see `improvements.md`). Existing unit tests and crash tests cover the happy path;
+this harness exercises the failure path by combining real process kills with
+concurrent access to the same Postgres schema.
+
+**Scope.** Local-only / on-demand — not in CI. Run with `clojure -X:dev:jdbc:jepsen`.
+Each "node" is a forked JVM, not a Docker/SSH container. We use our own orchestrator
+rather than the `jepsen/jepsen` library (same rationale as the ablauf Jepsen tests:
+the safety properties are DB-mediated, not OS-mediated).
+
+---
+
+## Bugs under test
+
+| Bug | improvements.md ref | Description |
+|-----|---------------------|-------------|
+| 1.1 | §1.1 | No wake mechanism survives pod restart. Signal callbacks live in a process-local atom; a dead worker's callbacks are gone forever. |
+| 1.2 | §1.2 | No ownership / silent concurrent execution. `ON CONFLICT DO UPDATE` masks concurrent writes to `intemporal_history`. |
+| 1.3 | §1.3 | No recovery poller. Restarting a worker does not resume the workflows it was running. |
+| 2.1 | §2.1 | Register-then-consume signal race. Between the consume-check and register-callback call, a concurrent sender's signal is dropped. |
+| 2.3 | §2.3 | Cancellation cannot reach a sleeping workflow. The cancelled flag is set but never observed by a workflow blocked in `wait-for-signal`. |
+
+---
+
+## Architecture
+
+```
+ ┌─────────────────────────────────────┐
+ │ runner.clj (host JVM) │
+ │ - generator (submit/cancel/signal) │
+ │ - nemesis (kill/restart/signal) │
+ │ - checker (4 invariants) │
+ └──────────┬──────────────────────────┘
+ │ writes jepsen_work_queue
+ │ reads intemporal_* tables
+ ┌───────────────┼─────────────────────┐
+ │ │ │
+ ┌───▼───┐ ┌───▼───┐ ... ┌───▼───┐
+ │worker0│ │worker1│ │workerN│
+ │JVM │ │JVM │ │JVM │
+ └───┬───┘ └───┬───┘ └───┬───┘
+ └───────────────┴────────────────────┘
+ ▼
+ Postgres (docker or local)
+ ┌──────────────────────────────┐
+ │ intemporal_workflows │
+ │ intemporal_history │
+ │ intemporal_signals │
+ │ jepsen_work_queue │ ← test coordination
+ │ jepsen_invocations │ ← side-channel
+ │ jepsen_signals_sent │
+ │ jepsen_cancels_sent │
+ └──────────────────────────────┘
+```
+
+**Process model.** Each worker is a JVM forked by `ProcessBuilder` from the runner.
+`destroyForcibly()` (SIGKILL) skips the JVM shutdown hook, destroying the
+process-local `callbacks` atom in `JdbcStore` — reproducing bug 1.1.
+
+---
+
+## Files
+
+| File | Role |
+|------|------|
+| [runner.clj](runner.clj) | Orchestrator: phases 1–5, entry point |
+| [worker.clj](worker.clj) | Forked-JVM entry: engine, work-queue poll loop |
+| [db.clj](db.clj) | Subprocess registry: `fork!`, `kill!`, `alive?`, schema setup |
+| [client.clj](client.clj) | Test operations: submit, signal, cancel, observe, concurrent-start |
+| [nemesis.clj](nemesis.clj) | Fault injector: kill/restart workers, signal dead workflows |
+| [checker.clj](checker.clj) | Post-quiesce invariants (4 checkers) |
+| [workflows.clj](workflows.clj) | Workflow shapes W1–W4, side-channel activity |
+| [test/resources/migrations/jepsen/postgres/](../../../../resources/migrations/jepsen/postgres/) | Side-channel table migrations |
+
+---
+
+## Workflow shapes
+
+| Shape | Type | Bug probed |
+|-------|------|------------|
+| W1 `signal-wait-workflow` | Records `:before`, waits on signal `"go"`, records `:after` | **1.1** lost wake |
+| W2 `activity-chain-workflow` | Runs N activities in sequence | **1.3** no recovery poller |
+| W3 `cancel-sleep-workflow` | Records `:started`, waits on `"wake"` forever | **2.3** cancel can't reach sleeper |
+| W4 `rapid-signal-workflow` | Suspends immediately on `"immediate"` | **2.1** signal race |
+
+---
+
+## Checkers
+
+All checkers run after the quiesce phase. Each returns `{:valid? bool :violations [...] :stats {...}}`.
+
+**1. Liveness** (bugs 1.1, 1.3)
+Every submitted workflow must be in a terminal state (`workflow-completed`, `workflow-failed`).
+Workflows stuck as `:running` after quiesce + grace are violations.
+
+**2. Signal consumed** (bug 2.1)
+Every signal row written to `intemporal_signals` by the test must eventually be consumed.
+Orphaned rows after quiesce flag either the lost-callback (1.1) or the register-then-consume
+race (2.1) — the distinction is visible in the nemesis history (was the worker alive?).
+
+**3. History integrity** (bug 1.2)
+For workflows started via the `concurrent-start` op, `intemporal_history` must contain
+`seq=0` with `event_type = 'workflow-started'` only. If a concurrent writer's
+`ON CONFLICT DO UPDATE` clobbered it with a different event type, the violation is recorded.
+
+**4. Cancellation liveness** (bug 2.3)
+Workflows with `cancelled = TRUE` in `intemporal_workflows` must have a terminal last event.
+If the workflow is still `:running` (last event not `:workflow-completed/failed/cancelled`),
+the cancel flag was never observed — the workflow is stuck sleeping.
+
+---
+
+## Expected results with the current (unfixed) codebase
+
+| Checker | Expected result | Reason |
+|---------|-----------------|--------|
+| liveness | **FAIL** | Workers crash; no auto-resume; W1/W2 workflows stuck |
+| signal-consumed | **FAIL** (intermittent) | Signals sent to dead workers land in DB; callbacks gone |
+| history-integrity | **FAIL** (if concurrent-start runs) | `DO UPDATE` silently clobbers seq=0 |
+| cancellation-liveness | **FAIL** | `cancel-workflow` sets flag but never wakes sleeper |
+
+After the Phase A + B + C fixes from `improvements.md`, all four should **PASS**.
+
+**Smoke checks:**
+
+- Run with `:no-kill true` — all checkers should pass (no chaos, happy path).
+- Run normally — checkers should fail as documented above.
+- After implementing A1 (signal race fix) — checker 2 should pass.
+- After implementing A2 (cancellation wake) — checker 4 should pass.
+- After implementing Phase C (distributed primitives) — checkers 1 and 2 should pass.
+
+---
+
+## Side-channel
+
+`jepsen_invocations` records every activity invocation with `:begin`/`:end`/`:fail` phases
+using a **separate auto-commit Hikari pool** (`*side-ds*`), so rows survive a SIGKILL.
+`jepsen_signals_sent` and `jepsen_cancels_sent` track what the test issued, enabling the
+checker to cross-reference intent vs. outcome.
+
+Dynamic vars (`*side-ds*`, `*test-run*`, `*owner*`) are bound in `worker.clj` before
+calling `start-workflow` or `resume-workflow`, so activities can write to the side-channel
+without the workflow function carrying a non-serialisable connection reference.
+
+---
+
+## Running
+
+```bash
+# 1. Start Postgres (if not already running)
+docker run -d --name intemporal-pg \
+ -e POSTGRES_USER=root -e POSTGRES_PASSWORD=root \
+ -p 5432:5432 postgres:16
+
+# 2. Run the chaos test (4 workers, 120s active, 90s grace)
+clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run \
+ :workers 4 :duration 120
+
+# 3. No-kill baseline (should pass all checkers)
+clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run \
+ :workers 4 :duration 60 :no-kill true
+
+# 4. Aggressive run (more workers, faster kills)
+clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run \
+ :workers 6 :duration 180 :nemesis-min-ms 1500 :nemesis-jitter-ms 3000 \
+ :min-alive 1 :grace-s 120
+```
+
+The JDBC URL defaults to `POSTGRES_JDBC_URI` env var or `localhost:5432/root`.
+
+---
+
+## Risks / limitations
+
+1. **No jepsen/jepsen library.** We implement our own orchestrator (same approach as
+ the ablauf Jepsen tests). The history format is compatible with jepsen.history for
+ future migration. Adding sshd/containers would mostly be a `db.clj` swap.
+
+2. **Worker classpath boot time.** Each `clojure -X:...` invocation takes 10–30s to
+ compile on first run due to AOT. Subsequent runs are faster if the dep cache is warm.
+ Increase `boot-timeout-ms` in `db/fork!` if workers time out during setup.
+
+3. **Bug 1.2 detection is approximate.** We inject a sentinel `event_type` from the
+ second concurrent writer. The real damage (last-writer-wins on seq=0) is masked by
+ `DO UPDATE` — a production incident would manifest as non-deterministic replay, not
+ a visible row. The checker catches the sentinel as a proxy for the real corruption.
+
+4. **Bug 2.1 is intermittent.** The register-then-consume race requires precise timing.
+ The `rapid-signal` workflow + 50ms signal loop creates high contention, but the race
+ window is narrow. Run multiple times or increase `:submit-rps` to improve hit rate.
diff --git a/test/intemporal/jepsen/checker.clj b/test/intemporal/jepsen/checker.clj
new file mode 100644
index 0000000..c066828
--- /dev/null
+++ b/test/intemporal/jepsen/checker.clj
@@ -0,0 +1,244 @@
+(ns intemporal.jepsen.checker
+ "Post-quiesce invariant checkers. Each fn returns
+ {:valid? bool :violations [...] :stats {...}}.
+
+ All four checkers operate on DB state after the quiesce phase: the generator
+ has stopped, the nemesis is paused, all workers have been restarted (so each
+ one's startup ran), and a grace period has elapsed.
+
+ Checkers are mapped to specific bugs in improvements.md:
+
+ 1. liveness — bugs 1.1, 1.3: workflows never complete after
+ the owning worker crashes
+ 2. signal-consumed — bug 2.1: register-then-consume race leaves
+ orphaned signal rows
+ 3. history-integrity — bug 1.2: concurrent writers corrupt event log
+ via ON CONFLICT DO UPDATE
+ 4. cancellation-liveness — bug 2.3: cancel-workflow can't wake a sleeper
+
+ Expected post-quiesce state for the CURRENT (buggy) codebase:
+ checker 1 (liveness) -> FAIL (workflows stuck without resume)
+ checker 2 (signal-consumed) -> FAIL (if race is hit; intermittent)
+ checker 3 (history-integrity) -> FAIL (if concurrent-start ran)
+ checker 4 (cancellation-liveness) -> FAIL (cancelled sleepers never wake)"
+ (:require [next.jdbc :as jdbc]
+ [next.jdbc.result-set :as rs]
+ [clojure.string :as str]
+ [taoensso.telemere :as log]))
+
+(def ^:private jdbc-opts {:builder-fn rs/as-unqualified-maps})
+
+;; ---------------------------------------------------------------------------
+;; Helper: submitted workflow-ids from history
+
+(defn- submitted-ids
+ "Set of workflow-ids that the generator successfully submitted."
+ [history]
+ (->> @history
+ (filter #(and (= :submit (:f %)) (= :ok (:type %))))
+ (keep #(get-in % [:value :workflow-id]))
+ set))
+
+(defn- cancelled-ids
+ "Set of workflow-ids for which cancel ops succeeded."
+ [history]
+ (->> @history
+ (filter #(and (= :cancel (:f %)) (= :ok (:type %))))
+ (keep #(get-in % [:value :workflow-id]))
+ set))
+
+(defn- concurrent-start-ids
+ "Set of workflow-ids from concurrent-start ops."
+ [history]
+ (->> @history
+ (filter #(and (= :concurrent-start (:f %)) (= :ok (:type %))))
+ (keep #(get-in % [:value :workflow-id]))
+ set))
+
+;; ---------------------------------------------------------------------------
+;; Checker 1: Liveness (bugs 1.1, 1.3)
+;;
+;; Every submitted workflow must reach a terminal state (:completed, :failed,
+;; :cancelled). Workflows stuck in :running after the quiesce + grace period
+;; mean that no worker auto-resumed them after its crash.
+
+(defn liveness-checker
+ "1. Every submitted workflow is in a terminal state after quiesce."
+ [db-spec history]
+ (let [ids (submitted-ids history)]
+ (if (empty? ids)
+ {:valid? true :violations [] :stats {:submitted 0}}
+ (let [in-clause (str/join "," (repeat (count ids) "?"))
+ stuck (jdbc/execute! db-spec
+ (into [(str "SELECT w.id,
+ w.cancelled,
+ h.event_type AS last_event
+ FROM intemporal_workflows w
+ LEFT JOIN LATERAL (
+ SELECT event_type
+ FROM intemporal_history
+ WHERE workflow_id = w.id
+ ORDER BY id DESC LIMIT 1
+ ) h ON TRUE
+ WHERE w.id IN (" in-clause ")
+ AND w.cancelled = FALSE
+ AND (h.event_type IS NULL
+ OR h.event_type NOT IN
+ ('workflow-completed','workflow-failed','workflow-cancelled'))")]
+ ids)
+ jdbc-opts)]
+ {:valid? (empty? stuck)
+ :violations (vec stuck)
+ :stats {:submitted (count ids)
+ :stuck (count stuck)}}))))
+
+;; ---------------------------------------------------------------------------
+;; Checker 2: Signal consumed (bug 2.1)
+;;
+;; Every signal the test client wrote (via jepsen_signals_sent or the nemesis's
+;; signal-dead-workflows!) should eventually be consumed by the workflow. An
+;; unconsumed row in intemporal_signals after quiesce + grace either means:
+;; a) the owning worker died and its callback atom was empty (bug 1.1), or
+;; b) the signal arrived between consume-check and register-callback (bug 2.1).
+;;
+;; This checker flags both; the distinction is visible in the nemesis history
+;; (was the worker alive when the signal was sent?).
+
+(defn signal-consumed-checker
+ "2. No orphaned signal rows remain after quiesce."
+ [db-spec test-run]
+ (let [orphans (jdbc/execute! db-spec
+ ["SELECT s.workflow_id, s.signal_name
+ FROM intemporal_signals s
+ JOIN jepsen_signals_sent ss
+ ON ss.workflow_id = s.workflow_id
+ AND ss.signal_name = s.signal_name
+ WHERE ss.test_run = ?"
+ test-run]
+ jdbc-opts)
+ total-sent (or (:c (jdbc/execute-one! db-spec
+ ["SELECT COUNT(*) AS c FROM jepsen_signals_sent WHERE test_run = ?"
+ test-run]
+ jdbc-opts))
+ 0)]
+ {:valid? (empty? orphans)
+ :violations (vec orphans)
+ :stats {:signals-sent total-sent
+ :orphaned-signals (count orphans)}}))
+
+;; ---------------------------------------------------------------------------
+;; Checker 3: History integrity (bug 1.2)
+;;
+;; Concurrent calls to start-workflow with the same workflow-id use
+;; ON CONFLICT (workflow_id, seq) DO UPDATE, silently overwriting events.
+;; Symptoms:
+;; a) Multiple :workflow-started events at seq=0 (last writer wins silently).
+;; b) Two workers produce different event_type at the same seq — detected by
+;; comparing event_type vs the "canonical" value stored in the first write.
+;;
+;; We detect this by looking for workflows where seq 0 has a non-canonical
+;; event type, or where the history contains duplicate seq numbers that were
+;; overwritten (the DO UPDATE mask hides them, but if two writers raced and
+;; produced DIFFERENT event_types at the same seq, one version is lost).
+;;
+;; We approximate: for any workflow that had a concurrent-start op, check
+;; whether intemporal_history has a :workflow-started at seq=0. If the
+;; second writer overwrote seq=0 with a different event_type (our sentinel
+;; "workflow-started-duplicate"), that row proves a race.
+
+(defn history-integrity-checker
+ "3. No concurrent-write corruption in intemporal_history."
+ [db-spec history]
+ (let [cs-ids (concurrent-start-ids history)]
+ (if (empty? cs-ids)
+ {:valid? true :violations [] :stats {:concurrent-start-workflows 0}}
+ (let [in-clause (str/join "," (repeat (count cs-ids) "?"))
+ ;; Look for evidence of the silent overwrite: seq=0 with the
+ ;; sentinel event_type means the second writer clobbered the first.
+ corrupted (jdbc/execute! db-spec
+ (into [(str "SELECT workflow_id, event_type
+ FROM intemporal_history
+ WHERE workflow_id IN (" in-clause ")
+ AND seq = 0
+ AND event_type = 'workflow-started-duplicate'")]
+ cs-ids)
+ jdbc-opts)
+ ;; Also look for seq=0 that is NOT workflow-started (any other
+ ;; winner in the race is also corruption).
+ unexpected (jdbc/execute! db-spec
+ (into [(str "SELECT workflow_id, event_type
+ FROM intemporal_history
+ WHERE workflow_id IN (" in-clause ")
+ AND seq = 0
+ AND event_type <> 'workflow-started'")]
+ cs-ids)
+ jdbc-opts)]
+ {:valid? (and (empty? corrupted) (empty? unexpected))
+ :violations {:overwritten-by-duplicate (vec corrupted)
+ :unexpected-seq0 (vec unexpected)}
+ :stats {:concurrent-start-workflows (count cs-ids)
+ :corrupted (+ (count corrupted) (count unexpected))}}))))
+
+;; ---------------------------------------------------------------------------
+;; Checker 4: Cancellation liveness (bug 2.3)
+;;
+;; After cancel-workflow is called on a workflow that is blocked in
+;; wait-for-signal, the cancelled flag is set in intemporal_workflows but the
+;; workflow never observes it (no re-entry to the execution loop). The checker
+;; looks for workflows where:
+;; - cancelled = TRUE in intemporal_workflows
+;; - The last history event is NOT workflow-completed / workflow-failed /
+;; workflow-cancelled
+;; These are workflows that are "cancelled on paper" but still stuck.
+
+(defn cancellation-liveness-checker
+ "4. All cancelled workflows have reached a terminal state."
+ [db-spec history]
+ (let [c-ids (cancelled-ids history)]
+ (if (empty? c-ids)
+ {:valid? true :violations [] :stats {:cancelled-submitted 0}}
+ (let [in-clause (str/join "," (repeat (count c-ids) "?"))
+ stuck (jdbc/execute! db-spec
+ (into [(str "SELECT w.id,
+ h.event_type AS last_event
+ FROM intemporal_workflows w
+ LEFT JOIN LATERAL (
+ SELECT event_type
+ FROM intemporal_history
+ WHERE workflow_id = w.id
+ ORDER BY id DESC LIMIT 1
+ ) h ON TRUE
+ WHERE w.id IN (" in-clause ")
+ AND w.cancelled = TRUE
+ AND (h.event_type IS NULL
+ OR h.event_type NOT IN
+ ('workflow-completed',
+ 'workflow-failed',
+ 'workflow-cancelled'))")]
+ c-ids)
+ jdbc-opts)]
+ {:valid? (empty? stuck)
+ :violations (vec stuck)
+ :stats {:cancelled-submitted (count c-ids)
+ :stuck (count stuck)}}))))
+
+;; ---------------------------------------------------------------------------
+;; Compose
+
+(defn check-all
+ "Runs all four checkers and returns a composed result."
+ [{:keys [db-spec history test-run]}]
+ (log/log! :info "[checker] running post-quiesce invariants")
+ (let [c1 (liveness-checker db-spec history)
+ c2 (signal-consumed-checker db-spec test-run)
+ c3 (history-integrity-checker db-spec history)
+ c4 (cancellation-liveness-checker db-spec history)
+ valid? (every? :valid? [c1 c2 c3 c4])]
+ (when-not valid?
+ (log/log! :warn "[checker] INVARIANT VIOLATION(S) DETECTED"))
+ {:valid? valid?
+ :checkers
+ {:liveness c1
+ :signal-consumed c2
+ :history-integrity c3
+ :cancellation-liveness c4}}))
diff --git a/test/intemporal/jepsen/client.clj b/test/intemporal/jepsen/client.clj
new file mode 100644
index 0000000..a3c119c
--- /dev/null
+++ b/test/intemporal/jepsen/client.clj
@@ -0,0 +1,164 @@
+(ns intemporal.jepsen.client
+ "Operations issued by the test orchestrator against the shared store.
+
+ All operations talk directly to Postgres via next.jdbc (no HTTP layer).
+ Workers pick up submitted workflows by polling jepsen_work_queue.
+
+ Op types:
+ :submit — inserts a workflow spec into jepsen_work_queue
+ :signal — calls add-signal directly on the JDBC store
+ :cancel — calls mark-cancelled directly on the JDBC store
+ :observe — reads workflow status from intemporal_workflows + history
+ :concurrent-start — inserts the same workflow-id twice (different wf types
+ accepted by different workers) to trigger bug 1.2
+
+ History entries are plain EDN maps compatible with jepsen.history format:
+ {:process :type (:ok|:fail|:info) :f :value {...} :time }"
+ (:require [next.jdbc :as jdbc]))
+
+(defn now-ms [] (System/currentTimeMillis))
+
+(defn record-op!
+ "Appends an op to the atom-wrapped history vector."
+ [history op]
+ (swap! history conj (assoc op :time (now-ms)))
+ op)
+
+;; ---------------------------------------------------------------------------
+;; Helpers
+
+(defn- wf-status
+ "Reads workflow status directly from the DB without going through the store
+ object (avoids creating a new engine just to read status)."
+ [db-spec workflow-id]
+ (let [wf (jdbc/execute-one! db-spec
+ ["SELECT cancelled FROM intemporal_workflows WHERE id = ?"
+ workflow-id])
+ last-evt (jdbc/execute-one! db-spec
+ ["SELECT event_type FROM intemporal_history
+ WHERE workflow_id = ?
+ ORDER BY id DESC LIMIT 1"
+ workflow-id])]
+ (cond
+ (nil? wf) :not-found
+ (:intemporal_workflows/cancelled wf) :cancelled
+ (nil? last-evt) :not-found
+ (= "workflow-completed"
+ (:intemporal_history/event_type last-evt)) :completed
+ (= "workflow-failed"
+ (:intemporal_history/event_type last-evt)) :failed
+ (= "workflow-cancelled"
+ (:intemporal_history/event_type last-evt)) :cancelled
+ :else :running)))
+
+;; ---------------------------------------------------------------------------
+;; Client operations
+
+(defn invoke-submit
+ "Picks a random workflow type, inserts it into jepsen_work_queue, and
+ returns {:type :ok :value {:workflow-id ... :wf-type ...}}."
+ [db-spec test-run]
+ (let [wf-type (rand-nth [:signal-wait :activity-chain :cancel-sleep :rapid-signal])
+ wf-id (str (random-uuid))
+ nonce (str (random-uuid))
+ steps-arg (when (= wf-type :activity-chain) {:steps 5})]
+ (try
+ (jdbc/execute! db-spec
+ ["INSERT INTO jepsen_work_queue
+ (test_run, workflow_id, wf_type, nonce, args)
+ VALUES (?,?,?,?,?::jsonb)"
+ test-run wf-id (name wf-type) nonce
+ (if steps-arg (pr-str steps-arg) "{}")])
+ {:type :ok :value {:workflow-id wf-id :wf-type wf-type :nonce nonce}}
+ (catch Throwable t
+ {:type :fail :error (str t)}))))
+
+(defn invoke-signal
+ "Writes a signal directly to the store. Does NOT go through a worker — this
+ models a separate process (e.g. an HTTP endpoint) calling send-signal.
+
+ When the owning worker is alive, its callback atom fires and the workflow
+ wakes. When the worker is dead, the signal row persists in intemporal_signals
+ but no callback fires (bug 1.1)."
+ [db-spec test-run workflow-id signal-name]
+ (try
+ (jdbc/execute! db-spec
+ ["INSERT INTO intemporal_signals (workflow_id, signal_name, payload)
+ VALUES (?,?,'{}'::jsonb)"
+ workflow-id signal-name])
+ (jdbc/execute! db-spec
+ ["INSERT INTO jepsen_signals_sent (test_run, workflow_id, signal_name)
+ VALUES (?,?,?)"
+ test-run workflow-id signal-name])
+ {:type :ok :value {:workflow-id workflow-id :signal signal-name}}
+ (catch Throwable t
+ {:type :fail :error (str t)})))
+
+(defn invoke-cancel
+ "Sets the cancelled flag on the workflow. If the workflow is sleeping on
+ wait-for-signal the flag will be set but the workflow will never observe it
+ (bug 2.3)."
+ [db-spec test-run workflow-id]
+ (try
+ (jdbc/execute! db-spec
+ ["INSERT INTO intemporal_workflows (id, cancelled) VALUES (?,TRUE)
+ ON CONFLICT (id) DO UPDATE SET cancelled = TRUE"
+ workflow-id])
+ (jdbc/execute! db-spec
+ ["INSERT INTO jepsen_cancels_sent (test_run, workflow_id) VALUES (?,?)"
+ test-run workflow-id])
+ {:type :ok :value {:workflow-id workflow-id}}
+ (catch Throwable t
+ {:type :fail :error (str t)})))
+
+(defn invoke-observe
+ "Reads the workflow status for reporting in the history."
+ [db-spec workflow-id]
+ (try
+ (let [status (wf-status db-spec workflow-id)]
+ {:type :ok :value {:workflow-id workflow-id :status status}})
+ (catch Throwable t
+ {:type :fail :error (str t)})))
+
+(defn invoke-concurrent-start
+ "Inserts the same workflow-id into the queue TWICE so that two workers race
+ to run it concurrently. The UNIQUE constraint on workflow_id in the queue
+ prevents a second claim via the normal path, so we bypass the queue and
+ directly write to intemporal_history from two threads to reproduce bug 1.2.
+
+ Returns a map of {:workflow-id ... :threads-launched 2}."
+ [db-spec test-run]
+ (let [wf-id (str (random-uuid))
+ nonce (str (random-uuid))
+ result (promise)
+ write! (fn [seq-num event-type]
+ (try
+ (jdbc/with-transaction [tx db-spec]
+ (jdbc/execute! tx
+ ["INSERT INTO intemporal_workflows (id) VALUES (?)
+ ON CONFLICT (id) DO NOTHING"
+ wf-id])
+ (jdbc/execute! tx
+ ["INSERT INTO intemporal_history
+ (workflow_id, seq, event_type, data)
+ VALUES (?,?,?,'{}'::jsonb)
+ ON CONFLICT (workflow_id, seq) DO UPDATE
+ SET event_type = EXCLUDED.event_type,
+ data = EXCLUDED.data"
+ wf-id seq-num event-type]))
+ :ok
+ (catch Throwable t (str "error: " t))))
+ ;; Fire two threads simultaneously.
+ t1 (Thread/startVirtualThread
+ (fn [] (deliver result (write! 0 "workflow-started"))))
+ t2 (Thread/startVirtualThread
+ (fn [] (write! 0 "workflow-started-duplicate")))]
+ (.join ^Thread t1 5000)
+ (.join ^Thread t2 5000)
+ (jdbc/execute! db-spec
+ ["INSERT INTO jepsen_work_queue
+ (test_run, workflow_id, wf_type, nonce, args, completed)
+ VALUES (?,?,?,?,'{}'::jsonb, TRUE)"
+ test-run wf-id "concurrent-start" nonce])
+ {:type :ok
+ :value {:workflow-id wf-id :nonce nonce :threads-launched 2}}))
diff --git a/test/intemporal/jepsen/db.clj b/test/intemporal/jepsen/db.clj
new file mode 100644
index 0000000..07d3ce7
--- /dev/null
+++ b/test/intemporal/jepsen/db.clj
@@ -0,0 +1,130 @@
+(ns intemporal.jepsen.db
+ "Subprocess lifecycle for forked worker JVMs.
+
+ Each 'node' (owner-id) maps to a forked Process whose classpath is set by
+ `-X:dev:jdbc:jepsen-worker`. We use ProcessBuilder for real SIGKILL
+ semantics: destroyForcibly() skips the JVM shutdown hook, exactly modelling
+ a hard crash. This destroys the process-local signal-callback atom,
+ reproducing bug 1.1.
+
+ Process model deviation: we don't use SSH/sshd containers (local-only).
+ The jepsen library is not required here; we implement our own lightweight
+ orchestrator."
+ (:require [clojure.java.io :as io]
+ [next.jdbc :as jdbc]
+ [migratus.core :as migratus]
+ [taoensso.telemere :as log])
+ (:import [java.io BufferedReader InputStreamReader]
+ [java.util.concurrent TimeUnit]))
+
+(def ^:private registry (atom {}))
+
+;; ---------------------------------------------------------------------------
+;; I/O pumps
+
+(defn- pump-stdout
+ "Forwards child stdout to logger line-by-line. Delivers :ready on the
+ ready-promise the first time 'READY ' appears."
+ [^Process p owner ready-prom]
+ (Thread/startVirtualThread
+ (fn []
+ (with-open [r (BufferedReader. (InputStreamReader. (.getInputStream p)))]
+ (loop []
+ (when-let [line (.readLine r)]
+ (log/log! :info (str "[worker:" owner "] " line))
+ (when (and (not (realized? ready-prom))
+ (.startsWith ^String line (str "READY " owner)))
+ (deliver ready-prom :ready))
+ (recur)))))))
+
+(defn- pump-stderr [^Process p owner]
+ (Thread/startVirtualThread
+ (fn []
+ (with-open [r (BufferedReader. (InputStreamReader. (.getErrorStream p)))]
+ (loop []
+ (when-let [line (.readLine r)]
+ (log/log! :warn (str "[worker:" owner "/err] " line))
+ (recur)))))))
+
+;; ---------------------------------------------------------------------------
+;; Lifecycle
+
+(defn alive?
+ "True iff a worker process is registered and alive."
+ [owner]
+ (boolean (some-> @registry (get owner) :process (.isAlive))))
+
+(defn fork!
+ "Forks a worker JVM via `clojure -X:dev:jdbc:jepsen-worker`.
+ Blocks up to boot-timeout-ms waiting for the READY handshake on stdout."
+ [{:keys [owner db-url test-run boot-timeout-ms repo-root]
+ :or {boot-timeout-ms 90000 repo-root "."}}]
+ (when (alive? owner)
+ (throw (ex-info "Worker already alive for this owner" {:owner owner})))
+ (let [args ["clojure" "-X:dev:jdbc:jepsen-worker"
+ "intemporal.jepsen.worker/run"
+ ":owner" (pr-str owner)
+ ":db-url" (pr-str db-url)
+ ":test-run" (pr-str test-run)]
+ pb (doto (ProcessBuilder. ^java.util.List args)
+ (.directory (io/file repo-root))
+ (.redirectErrorStream false))
+ proc (.start pb)
+ ready (promise)]
+ (pump-stdout proc owner ready)
+ (pump-stderr proc owner)
+ (let [v (deref ready boot-timeout-ms ::timeout)]
+ (when (= v ::timeout)
+ (.destroyForcibly proc)
+ (throw (ex-info "Worker boot timed out"
+ {:owner owner :timeout-ms boot-timeout-ms}))))
+ (let [entry {:process proc :owner owner}]
+ (swap! registry assoc owner entry)
+ (log/log! :info (str "Forked worker " owner " pid=" (.pid proc)))
+ entry)))
+
+(defn kill!
+ "Sends a signal to the worker.
+ :sigkill -> destroyForcibly (no shutdown hook, models hard crash)
+ :sigterm -> destroy (shutdown hook fires, models graceful stop)"
+ [owner signal]
+ (when-let [{:keys [^Process process]} (get @registry owner)]
+ (case signal
+ :sigterm (.destroy process)
+ :sigkill (.destroyForcibly process))
+ (.waitFor process 30 TimeUnit/SECONDS)
+ (swap! registry dissoc owner)
+ (log/log! :info (str "Killed worker " owner " with " (name signal)
+ " exit=" (try (.exitValue process) (catch Exception _ "?"))))))
+
+(defn kill-all! []
+ (doseq [owner (keys @registry)]
+ (try (kill! owner :sigkill)
+ (catch Throwable t
+ (log/log! :warn (str "kill-all failed for " owner ": " t))))))
+
+;; ---------------------------------------------------------------------------
+;; Schema setup
+
+(defn migrate-all!
+ "Runs intemporal migrations and Jepsen side-channel migrations against
+ the given db-spec."
+ [db-spec]
+ (doseq [[dir table] [["migrations/postgres" "migrations"]
+ ["migrations/jepsen/postgres" "jepsen_migrations"]]]
+ (migratus/migrate {:store :database
+ :migration-dir dir
+ :migration-table-name table
+ :db db-spec})))
+
+(defn truncate-all!
+ "Clears all intemporal and Jepsen tables between runs."
+ [db-spec]
+ (doseq [table ["jepsen_cancels_sent"
+ "jepsen_signals_sent"
+ "jepsen_invocations"
+ "jepsen_work_queue"
+ "intemporal_signals"
+ "intemporal_history"
+ "intemporal_workflows"]]
+ (jdbc/execute! db-spec [(str "DELETE FROM " table)])))
diff --git a/test/intemporal/jepsen/nemesis.clj b/test/intemporal/jepsen/nemesis.clj
new file mode 100644
index 0000000..94295c6
--- /dev/null
+++ b/test/intemporal/jepsen/nemesis.clj
@@ -0,0 +1,139 @@
+(ns intemporal.jepsen.nemesis
+ "Fault injector. Periodically picks a random worker and kills it (SIGKILL or
+ SIGTERM), then later restarts it. Maintains a min-alive floor so at least N
+ workers can make progress.
+
+ Also provides 'signal-dead-workers': sends signals to all W1/W3 workflows
+ whose owner is currently dead. This is the primary way to trigger bug 1.1:
+ the signal row lands in intemporal_signals but no callback fires because the
+ owning worker is gone."
+ (:require [intemporal.jepsen.db :as db]
+ [next.jdbc :as jdbc]
+ [clojure.string :as str]
+ [taoensso.telemere :as log]))
+
+(defn- pick-victim
+ "Returns a random alive owner-id to kill, respecting min-alive."
+ [owners min-alive]
+ (let [alive (filter db/alive? owners)]
+ (when (> (count alive) min-alive)
+ (rand-nth alive))))
+
+(defn- pick-dead
+ "Returns a random dead owner-id to revive."
+ [owners]
+ (let [dead (remove db/alive? owners)]
+ (when (seq dead) (rand-nth dead))))
+
+;; ---------------------------------------------------------------------------
+;; Per-tick fault
+
+(defn step!
+ "One nemesis tick. Chooses an action:
+ 50% SIGKILL a random alive worker
+ 25% SIGTERM a random alive worker
+ 25% Start a random dead worker
+
+ Records the op in `history`."
+ [{:keys [owners history db-url test-run repo-root min-alive]
+ :or {min-alive 2}}]
+ (let [r (rand)]
+ (cond
+ (< r 0.50)
+ (if-let [victim (pick-victim owners min-alive)]
+ (do
+ (log/log! :info (str "[nemesis] SIGKILL " victim))
+ (db/kill! victim :sigkill)
+ (swap! history conj {:process :nemesis :type :info
+ :f :kill-9 :value victim
+ :time (System/currentTimeMillis)}))
+ (swap! history conj {:process :nemesis :type :info
+ :f :noop :value :min-alive-floor
+ :time (System/currentTimeMillis)}))
+
+ (< r 0.75)
+ (if-let [victim (pick-victim owners min-alive)]
+ (do
+ (log/log! :info (str "[nemesis] SIGTERM " victim))
+ (db/kill! victim :sigterm)
+ (swap! history conj {:process :nemesis :type :info
+ :f :kill-15 :value victim
+ :time (System/currentTimeMillis)}))
+ (swap! history conj {:process :nemesis :type :info
+ :f :noop :value :min-alive-floor
+ :time (System/currentTimeMillis)}))
+
+ :else
+ (if-let [revive (pick-dead owners)]
+ (do
+ (log/log! :info (str "[nemesis] restart " revive))
+ (db/fork! {:owner revive :db-url db-url :test-run test-run
+ :repo-root repo-root})
+ (swap! history conj {:process :nemesis :type :info
+ :f :start :value revive
+ :time (System/currentTimeMillis)}))
+ (swap! history conj {:process :nemesis :type :info
+ :f :noop :value :all-alive
+ :time (System/currentTimeMillis)})))))
+
+;; ---------------------------------------------------------------------------
+;; Signal-while-dead: exercises bug 1.1.
+
+(defn signal-dead-workflows!
+ "Finds W1 (signal-wait) and W3 (cancel-sleep) workflows whose owning worker
+ is currently dead, then sends the expected signal to each. The signal row
+ lands in intemporal_signals but no callback fires — the workflow is stuck.
+
+ Records each signal in jepsen_signals_sent for the checker."
+ [{:keys [db-spec test-run owners history]}]
+ (let [dead-owners (->> owners (remove db/alive?) set)
+ ;; Find claimed-but-not-completed W1/W3 workflows owned by dead workers.
+ rows (when (seq dead-owners)
+ (jdbc/execute! db-spec
+ (into [(str "SELECT workflow_id, wf_type
+ FROM jepsen_work_queue
+ WHERE test_run = ?
+ AND completed = FALSE
+ AND wf_type IN ('signal-wait','cancel-sleep','rapid-signal')
+ AND claimed_by IN ("
+ (str/join "," (repeat (count dead-owners) "?"))
+ ")")
+ test-run]
+ dead-owners)))]
+ (doseq [{:jepsen_work_queue/keys [workflow_id wf_type]} rows]
+ (let [signal-name (case wf_type
+ "signal-wait" "go"
+ "cancel-sleep" "wake"
+ "rapid-signal" "immediate"
+ nil)]
+ (when signal-name
+ (try
+ (jdbc/execute! db-spec
+ ["INSERT INTO intemporal_signals (workflow_id, signal_name, payload)
+ VALUES (?,?,'{}'::jsonb)"
+ workflow_id signal-name])
+ (jdbc/execute! db-spec
+ ["INSERT INTO jepsen_signals_sent (test_run, workflow_id, signal_name)
+ VALUES (?,?,?)"
+ test-run workflow_id signal-name])
+ (swap! history conj {:process :nemesis :type :info
+ :f :signal-dead :value {:workflow-id workflow_id
+ :signal signal-name}
+ :time (System/currentTimeMillis)})
+ (log/log! :info (str "[nemesis] signalled dead workflow "
+ workflow_id " signal=" signal-name))
+ (catch Throwable t
+ (log/log! :warn (str "[nemesis] signal-dead-workflows! error: " t)))))))))
+
+;; ---------------------------------------------------------------------------
+;; Quiesce helper
+
+(defn ensure-all-alive!
+ "Revives all dead workers during the quiesce phase. Since intemporal has no
+ auto-resume on restart (bug 1.3), restarting workers here does NOT cause
+ stuck workflows to complete — it only proves that point."
+ [{:keys [owners db-url test-run repo-root]}]
+ (doseq [owner owners
+ :when (not (db/alive? owner))]
+ (log/log! :info (str "[quiesce] reviving " owner))
+ (db/fork! {:owner owner :db-url db-url :test-run test-run :repo-root repo-root})))
diff --git a/test/intemporal/jepsen/runner.clj b/test/intemporal/jepsen/runner.clj
new file mode 100644
index 0000000..34a8685
--- /dev/null
+++ b/test/intemporal/jepsen/runner.clj
@@ -0,0 +1,250 @@
+(ns intemporal.jepsen.runner
+ "Top-level orchestrator for the intemporal chaos test.
+
+ Phases:
+ 1. setup — migrate schema, truncate state, fork N worker JVMs
+ 2. active — generator submits/cancels/signals workflows;
+ nemesis kills & restarts workers;
+ nemesis also fires signals at dead workers (bug 1.1 probe)
+ 3. quiesce — nemesis stops; all workers restarted (proves bug 1.3: no
+ auto-resume on restart); grace period elapses
+ 4. check — run all four invariant checkers against final DB state
+ 5. teardown — kill all workers
+
+ Run:
+ clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run \\
+ :workers 4 :duration 120
+
+ No-kill baseline (useful to confirm happy-path correctness):
+ clojure -X:dev:jdbc:jepsen intemporal.jepsen.runner/run \\
+ :workers 4 :duration 60 :no-kill true
+
+ Expected outcome with the current (unfixed) codebase:
+ checker liveness -> FAIL (bug 1.1 / 1.3)
+ checker signal-consumed -> FAIL (bug 2.1, intermittent)
+ checker history-integrity -> FAIL (bug 1.2, if concurrent-start runs)
+ checker cancellation-liveness -> FAIL (bug 2.3)
+
+ After the Phase A + B + C fixes from improvements.md, all four should PASS."
+ (:require [intemporal.jepsen.db :as db]
+ [intemporal.jepsen.client :as client]
+ [intemporal.jepsen.nemesis :as nemesis]
+ [intemporal.jepsen.checker :as checker]
+ [clojure.pprint :as pp]
+ [taoensso.telemere :as log])
+ (:import [java.util.concurrent Executors TimeUnit]))
+
+;; ---------------------------------------------------------------------------
+;; Defaults
+
+(def ^:private default-db-url
+ (or (System/getenv "POSTGRES_JDBC_URI")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root"))
+
+(defn- jdbc-spec [db-url]
+ {:dbtype "postgresql" :connection-uri db-url :jdbcUrl db-url})
+
+;; ---------------------------------------------------------------------------
+;; Generator
+
+(defn- start-generator!
+ "Launches 3 submit threads + 1 cancel thread + 1 observe thread.
+ Returns a 0-arity stop fn."
+ [{:keys [db-spec history test-run submit-rps]
+ :or {submit-rps 5}}]
+ (let [pool (Executors/newFixedThreadPool 5)
+ running? (atom true)
+ submit-period (long (/ 1000 (max 1 submit-rps)))]
+
+ ;; 3 submit threads
+ (dotimes [i 3]
+ (.submit pool ^Runnable
+ (fn []
+ (while @running?
+ (try
+ (let [op (client/invoke-submit db-spec test-run)]
+ (client/record-op! history (assoc op :process i :f :submit)))
+ (catch Throwable t (log/log! :warn (str "submit failed: " t))))
+ (Thread/sleep submit-period)))))
+
+ ;; 1 cancel thread — cancels a recently-submitted workflow every ~3s
+ (.submit pool ^Runnable
+ (fn []
+ (while @running?
+ (try
+ (let [candidates (->> @history
+ (filter #(and (= :submit (:f %))
+ (= :ok (:type %))
+ (= :cancel-sleep
+ (get-in % [:value :wf-type]))))
+ (keep #(get-in % [:value :workflow-id]))
+ seq)]
+ (when candidates
+ (let [wf-id (rand-nth candidates)
+ op (client/invoke-cancel db-spec test-run wf-id)]
+ (client/record-op! history (assoc op :process 98 :f :cancel)))))
+ (catch Throwable t (log/log! :warn (str "cancel failed: " t))))
+ (Thread/sleep 3000))))
+
+ ;; 1 rapid-signal thread — immediately signals rapid-signal workflows
+ (.submit pool ^Runnable
+ (fn []
+ (while @running?
+ (try
+ (let [candidates (->> @history
+ (filter #(and (= :submit (:f %))
+ (= :ok (:type %))
+ (= :rapid-signal
+ (get-in % [:value :wf-type]))))
+ (keep #(get-in % [:value :workflow-id]))
+ seq)]
+ (when candidates
+ (let [wf-id (rand-nth candidates)
+ op (client/invoke-signal db-spec test-run wf-id "immediate")]
+ (client/record-op! history (assoc op :process 97 :f :signal)))))
+ (catch Throwable t (log/log! :warn (str "rapid-signal failed: " t))))
+ ;; Very short sleep to maximise chance of hitting the race window.
+ (Thread/sleep 50))))
+
+ (fn stop-gen []
+ (reset! running? false)
+ (.shutdown pool)
+ (.awaitTermination pool 10 TimeUnit/SECONDS)
+ (.shutdownNow pool))))
+
+;; ---------------------------------------------------------------------------
+;; Nemesis loop
+
+(defn- start-nemesis!
+ [{:keys [owners history db-url db-spec test-run repo-root no-kill?
+ nemesis-min-ms nemesis-jitter-ms min-alive]
+ :or {nemesis-min-ms 3000 nemesis-jitter-ms 6000 min-alive 2}}]
+ (let [running? (atom (not no-kill?))
+ thread (Thread/startVirtualThread
+ (fn []
+ (while @running?
+ (try
+ (Thread/sleep (long (+ nemesis-min-ms
+ (rand-int nemesis-jitter-ms))))
+ (when @running?
+ ;; Occasionally inject a concurrent-start to probe bug 1.2.
+ (when (< (rand) 0.15)
+ (let [op (client/invoke-concurrent-start db-spec test-run)]
+ (client/record-op! history
+ (assoc op :process :nemesis :f :concurrent-start))))
+ ;; Main kill/restart fault.
+ (nemesis/step! {:owners owners
+ :history history
+ :db-url db-url
+ :test-run test-run
+ :repo-root repo-root
+ :min-alive min-alive})
+ ;; After any kill, signal the dead workflows (bug 1.1 probe).
+ (nemesis/signal-dead-workflows! {:db-spec db-spec
+ :test-run test-run
+ :owners owners
+ :history history}))
+ (catch InterruptedException _ (reset! running? false))
+ (catch Throwable t
+ (log/log! :warn (str "nemesis step error: " t)))))))]
+ (fn stop-nem []
+ (reset! running? false)
+ (.join thread 5000))))
+
+;; ---------------------------------------------------------------------------
+;; Entry point
+
+(defn run
+ "Top-level entry. Options (all optional):
+ :workers number of forked worker JVMs (default 4)
+ :duration active phase length in seconds (default 120)
+ :db-url JDBC url (default from POSTGRES_JDBC_URI or localhost)
+ :no-kill disable nemesis (baseline mode)
+ :submit-rps submit rate per thread (3 threads, default 5 → 15 RPS)
+ :nemesis-min-ms minimum gap between nemesis ticks (default 3000)
+ :nemesis-jitter-ms random extra gap per tick (default 6000)
+ :min-alive floor on simultaneously-alive workers (default 2)
+ :grace-s quiesce drain time before checks (default 90)
+ :repo-root working directory for forked workers (default \".\")"
+ [{:keys [workers duration db-url no-kill repo-root submit-rps
+ nemesis-min-ms nemesis-jitter-ms min-alive grace-s]
+ :or {workers 4 duration 120 db-url default-db-url
+ repo-root "." submit-rps 5
+ nemesis-min-ms 3000 nemesis-jitter-ms 6000
+ min-alive 2 grace-s 90}}]
+ (let [test-run (str "run-" (System/currentTimeMillis))
+ owners (mapv #(format "jepsen-%02d-%s" % test-run) (range workers))
+ db-spec (jdbc-spec db-url)
+ history (atom [])]
+
+ (println "\n=== intemporal Jepsen run" test-run "===")
+ (println (format "workers=%d duration=%ds no-kill=%s grace=%ds"
+ workers duration (boolean no-kill) grace-s))
+
+ ;; --- 1. setup ---
+ (println "[setup] migrating + truncating")
+ (db/migrate-all! db-spec)
+ (db/truncate-all! db-spec)
+ (println "[setup] forking workers")
+ (doseq [owner owners]
+ (db/fork! {:owner owner :db-url db-url :test-run test-run
+ :repo-root repo-root}))
+
+ (try
+ ;; --- 2. active phase ---
+ (println (format "[active] running %ds with chaos=%s" duration (not no-kill)))
+ (let [stop-gen (start-generator! {:db-spec db-spec
+ :history history
+ :test-run test-run
+ :submit-rps submit-rps})
+ stop-nem (start-nemesis! {:owners owners
+ :history history
+ :db-url db-url
+ :db-spec db-spec
+ :test-run test-run
+ :repo-root repo-root
+ :no-kill? no-kill
+ :nemesis-min-ms nemesis-min-ms
+ :nemesis-jitter-ms nemesis-jitter-ms
+ :min-alive min-alive})]
+ (Thread/sleep (* 1000 duration))
+ (println "[active->quiesce] stopping generator and nemesis")
+ (stop-gen)
+ (stop-nem))
+
+ ;; --- 3. quiesce ---
+ ;; Restart every worker. This proves bug 1.3: restarting does NOT
+ ;; auto-resume workflows — no recovery poller exists.
+ (println "[quiesce] restarting all workers (proves no auto-resume on restart)")
+ (nemesis/ensure-all-alive! {:owners owners :db-url db-url
+ :test-run test-run :repo-root repo-root})
+ (println (format "[quiesce] grace period: %ds" grace-s))
+ (Thread/sleep (* 1000 grace-s))
+
+ ;; --- 4. check ---
+ (println "[check] running invariants")
+ (let [result (checker/check-all {:db-spec db-spec
+ :history history
+ :test-run test-run})]
+ (println "\n=== RESULTS ===")
+ (pp/pprint result)
+ (println "===============\n")
+ (println (format "Ops in history: %d" (count @history)))
+ (println (format "Submitted: %d"
+ (count (filter #(and (= :submit (:f %))
+ (= :ok (:type %)))
+ @history))))
+ (if (:valid? result)
+ (println "ALL INVARIANTS PASSED")
+ (println "INVARIANTS VIOLATED — see results above"))
+ result)
+
+ (finally
+ ;; --- 5. teardown ---
+ (println "[teardown] killing workers")
+ (db/kill-all!)))))
+
+(defn -main [& args]
+ (let [opts (when (seq args) (read-string (first args)))
+ r (run (or opts {}))]
+ (System/exit (if (:valid? r) 0 1))))
diff --git a/test/intemporal/jepsen/worker.clj b/test/intemporal/jepsen/worker.clj
new file mode 100644
index 0000000..05ec031
--- /dev/null
+++ b/test/intemporal/jepsen/worker.clj
@@ -0,0 +1,144 @@
+(ns intemporal.jepsen.worker
+ "Forked-JVM entry point for a single intemporal worker node.
+
+ Lifecycle:
+ 1. Starts an intemporal engine backed by the shared Postgres store.
+ 2. Polls jepsen_work_queue for unclaimed workflow specs (FOR UPDATE SKIP LOCKED).
+ 3. Claims each spec and runs start-workflow in a virtual thread.
+ 4. Prints 'READY ' once the poll loop is running.
+
+ Signal semantics:
+ SIGTERM -> JVM shutdown hook fires -> engine shutdown -> graceful stop.
+ SIGKILL -> no hook runs. Process-local signal-callback atom in the JDBC
+ store is destroyed. Workflows waiting on a signal will never
+ wake on another worker (bug 1.1).
+
+ The worker does NOT call resume-workflow for previously-running workflows
+ on restart — this reproduces bug 1.3 (no recovery poller)."
+ (:require [intemporal.core :as intemporal]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.jepsen.workflows :as wf]
+ [next.jdbc :as jdbc]
+ [hikari-cp.core :as hikari]
+ [taoensso.telemere :as log])
+ (:gen-class))
+
+;; ---------------------------------------------------------------------------
+;; Connection pools
+
+(defn- make-pool [db-url pool-size auto-commit?]
+ (hikari/make-datasource {:jdbc-url db-url
+ :maximum-pool-size pool-size
+ :auto-commit auto-commit?}))
+
+;; ---------------------------------------------------------------------------
+;; Work queue polling
+
+(def ^:private poll-interval-ms 200)
+
+(defn- claim-work-item!
+ "Claims one unclaimed queue item for this owner. Returns the row or nil."
+ [main-ds test-run owner]
+ (jdbc/with-transaction [tx main-ds]
+ (let [row (jdbc/execute-one! tx
+ ["SELECT id, workflow_id, wf_type, nonce, args
+ FROM jepsen_work_queue
+ WHERE test_run = ? AND claimed_by IS NULL AND completed = FALSE
+ ORDER BY id ASC
+ FOR UPDATE SKIP LOCKED
+ LIMIT 1"
+ test-run])]
+ (when row
+ (jdbc/execute! tx
+ ["UPDATE jepsen_work_queue SET claimed_by = ?, claimed_at = NOW()
+ WHERE id = ?"
+ owner (:jepsen_work_queue/id row)])
+ row))))
+
+(defn- mark-completed! [main-ds queue-id]
+ (jdbc/execute! main-ds
+ ["UPDATE jepsen_work_queue SET completed = TRUE WHERE id = ?" queue-id]))
+
+(defn- run-one-workflow!
+ "Starts a workflow in the current thread (intended to be called from a
+ virtual thread). Binds *side-ds*, *test-run*, and *owner* so workflow
+ activities can record to the side-channel."
+ [engine main-ds side-ds test-run owner row]
+ (let [workflow-id (:jepsen_work_queue/workflow_id row)
+ wf-type (:jepsen_work_queue/wf_type row)
+ args (wf/build-args row)
+ wf-fn (wf/workflow-fn-for wf-type)]
+ (binding [wf/*side-ds* side-ds
+ wf/*test-run* test-run
+ wf/*owner* owner]
+ (try
+ (log/log! :info (str "[" owner "] starting " wf-type " wf=" workflow-id))
+ (intemporal/start-workflow engine wf-fn args :workflow-id workflow-id)
+ (mark-completed! main-ds (:jepsen_work_queue/id row))
+ (log/log! :info (str "[" owner "] completed wf=" workflow-id))
+ (catch InterruptedException _
+ (log/log! :info (str "[" owner "] interrupted wf=" workflow-id)))
+ (catch Throwable t
+ (log/log! :warn (str "[" owner "] failed wf=" workflow-id " err=" t)))))))
+
+(defn- start-poll-loop!
+ "Starts the background work-queue poll loop. Returns a 0-arity stop fn."
+ [engine main-ds side-ds test-run owner]
+ (let [running? (atom true)]
+ (Thread/startVirtualThread
+ (fn []
+ (while @running?
+ (try
+ (if-let [row (claim-work-item! main-ds test-run owner)]
+ ;; Start workflow in its own virtual thread so the poll loop
+ ;; remains responsive.
+ (Thread/startVirtualThread
+ #(run-one-workflow! engine main-ds side-ds test-run owner row))
+ ;; Nothing in queue — sleep briefly.
+ (Thread/sleep poll-interval-ms))
+ (catch InterruptedException _
+ (reset! running? false))
+ (catch Throwable t
+ (log/log! :warn (str "[" owner "] poll loop error: " t))
+ (Thread/sleep poll-interval-ms))))))
+ (fn [] (reset! running? false))))
+
+;; ---------------------------------------------------------------------------
+;; Public entry point
+
+(defn run
+ "deps.edn -X entry point. Boots the engine, starts polling, and parks until
+ SIGKILL or SIGTERM.
+
+ Args (EDN keyword map):
+ :owner — node identifier (stamped on side-channel rows)
+ :db-url — JDBC URL for the shared Postgres instance
+ :test-run — run id matching the current jepsen_work_queue rows"
+ [{:keys [owner db-url test-run]}]
+ (assert owner ":owner required")
+ (assert db-url ":db-url required")
+ (assert test-run ":test-run required")
+
+ (let [store (jdbc-store/make-jdbc-store db-url)
+ main-ds (:datasource store)
+ side-ds (make-pool db-url 2 true) ; auto-commit for side-channel
+ engine (intemporal/make-workflow-engine :store store :threads 8)
+ stop-fn (start-poll-loop! engine main-ds side-ds test-run owner)]
+
+ (.addShutdownHook
+ (Runtime/getRuntime)
+ (Thread.
+ ^Runnable
+ (fn []
+ (log/log! :info (str "[" owner "] shutdown hook: stopping engine"))
+ (stop-fn)
+ (intemporal/shutdown-engine engine 5)
+ (hikari/close-datasource side-ds))))
+
+ (println "READY" owner)
+ (flush)
+ @(promise))) ; park until killed
+
+(defn -main [& args]
+ (let [[owner db-url test-run] args]
+ (run {:owner owner :db-url db-url :test-run (or test-run "default")})))
diff --git a/test/intemporal/jepsen/workflows.clj b/test/intemporal/jepsen/workflows.clj
new file mode 100644
index 0000000..ed915b2
--- /dev/null
+++ b/test/intemporal/jepsen/workflows.clj
@@ -0,0 +1,138 @@
+(ns intemporal.jepsen.workflows
+ "Workflow shapes (W1–W4) submitted by the chaos test, plus the side-channel
+ recording activity.
+
+ Side-channel writes go through *side-ds* (a separate auto-commit Hikari pool),
+ so rows are durable even if the worker JVM is SIGKILLed mid-activity.
+
+ Each workflow shape probes a specific bug from improvements.md:
+ W1 (signal-wait) — bug 1.1: lost wake on signal when worker is dead
+ W2 (activity-chain) — bug 1.3: no recovery poller; activities not re-run
+ W3 (cancel-sleep) — bug 2.3: cancellation can't reach a sleeping workflow
+ W4 (rapid-signal) — bug 2.1: register-then-consume signal race"
+ (:require [intemporal.core :as intemporal]
+ [next.jdbc :as jdbc]
+ [taoensso.telemere :as log]))
+
+;; ---------------------------------------------------------------------------
+;; Dynamic bindings set by the worker before calling start-workflow / resume-workflow.
+
+(def ^:dynamic *side-ds* nil) ; auto-commit JDBC pool for side-channel writes
+(def ^:dynamic *test-run* nil) ; test-run id stamped on every side-channel row
+(def ^:dynamic *owner* nil) ; worker owner-id for attribution
+
+;; ---------------------------------------------------------------------------
+;; Side-channel recording.
+
+(defn- record!
+ "Inserts one row into jepsen_invocations. Never throws — a side-channel
+ failure must not crash the workflow."
+ [workflow-id step nonce phase]
+ (when *side-ds*
+ (try
+ (jdbc/execute! *side-ds*
+ ["INSERT INTO jepsen_invocations (test_run, workflow_id, step, nonce, phase, owner)
+ VALUES (?,?,?,?,?,?)"
+ *test-run* workflow-id step nonce (name phase) *owner*])
+ (catch Throwable t
+ (log/log! :warn (str "jepsen side-channel write failed: " t))))))
+
+;; ---------------------------------------------------------------------------
+;; Activities.
+
+(defn jepsen-activity
+ "Side-channel–recording activity. Sleeps briefly to widen the crash window,
+ then records :begin / :end / :fail rows. Longer sleep for non-trivial steps
+ so the nemesis can land a SIGKILL while the activity is in-flight."
+ [workflow-id step nonce]
+ (record! workflow-id step nonce :begin)
+ (try
+ (Thread/sleep (long (+ 100 (rand-int 150))))
+ (record! workflow-id step nonce :end)
+ :ok
+ (catch Throwable t
+ (record! workflow-id step nonce :fail)
+ (throw t))))
+
+;; ---------------------------------------------------------------------------
+;; W1: signal-wait — probes bug 1.1 (lost wake on signal across processes).
+;;
+;; Registers a wait-for-signal :go. If the worker is killed while waiting and
+;; someone sends the signal from another process, the workflow should resume.
+;; With the current implementation it will NOT: the callback is in a dead atom.
+
+(defn signal-wait-workflow
+ "Records :before, suspends on signal 'go', records :after."
+ [workflow-id nonce]
+ (let [act (intemporal/stub #'jepsen-activity)]
+ (act workflow-id "before" nonce)
+ (intemporal/wait-for-signal "go")
+ (act workflow-id "after" nonce)))
+
+;; ---------------------------------------------------------------------------
+;; W2: activity-chain — probes bug 1.3 (no recovery poller).
+;;
+;; Runs a chain of activities. If the worker crashes mid-chain and never
+;; explicitly calls resume-workflow, the remaining activities never run.
+
+(defn activity-chain-workflow
+ "Runs `steps` activities in sequence."
+ [workflow-id nonce steps]
+ (let [act (intemporal/stub #'jepsen-activity)]
+ (dotimes [i steps]
+ (act workflow-id (str "step-" i) nonce))))
+
+;; ---------------------------------------------------------------------------
+;; W3: cancel-sleep — probes bug 2.3 (cancellation can't reach a sleeper).
+;;
+;; Records :started, then waits for signal 'wake' forever. The test client
+;; cancels the workflow via cancel-workflow. With the current implementation
+;; the workflow never observes the cancellation because it never re-enters
+;; the execution loop.
+
+(defn cancel-sleep-workflow
+ "Records :started, then blocks on signal 'wake'."
+ [workflow-id nonce]
+ (let [act (intemporal/stub #'jepsen-activity)]
+ (act workflow-id "started" nonce)
+ (intemporal/wait-for-signal "wake")
+ (act workflow-id "woke" nonce)))
+
+;; ---------------------------------------------------------------------------
+;; W4: rapid-signal — probes bug 2.1 (register-then-consume signal race).
+;;
+;; Immediately waits for signal 'immediate'. The test client sends the signal
+;; at nearly the same time, trying to hit the window between the consume-check
+;; and the register-callback call in process-signal.
+
+(defn rapid-signal-workflow
+ "Suspends immediately on signal 'immediate', records :completed after."
+ [workflow-id nonce]
+ (let [act (intemporal/stub #'jepsen-activity)]
+ (intemporal/wait-for-signal "immediate")
+ (act workflow-id "completed" nonce)))
+
+;; ---------------------------------------------------------------------------
+;; Registry: maps wf-type keyword -> {:fn workflow-fn :signal name-or-nil}.
+
+(def ^:private wf-registry
+ {:signal-wait {:wf-fn #'signal-wait-workflow :signal "go"}
+ :activity-chain {:wf-fn #'activity-chain-workflow :signal nil}
+ :cancel-sleep {:wf-fn #'cancel-sleep-workflow :signal "wake"}
+ :rapid-signal {:wf-fn #'rapid-signal-workflow :signal "immediate"}})
+
+(defn workflow-fn-for [wf-type]
+ (or (get-in wf-registry [(keyword wf-type) :wf-fn])
+ (throw (ex-info "Unknown workflow type" {:wf-type wf-type}))))
+
+(defn signal-name-for [wf-type]
+ (get-in wf-registry [(keyword wf-type) :signal]))
+
+(defn build-args
+ "Reconstructs the arg vector for a workflow from the queue row."
+ [{:jepsen_work_queue/keys [workflow_id nonce wf_type args]}]
+ (case (keyword wf_type)
+ :signal-wait [workflow_id nonce]
+ :activity-chain [workflow_id nonce (or (:steps args) 5)]
+ :cancel-sleep [workflow_id nonce]
+ :rapid-signal [workflow_id nonce]))
diff --git a/test/intemporal/tests/bench/fdb_test.clj b/test/intemporal/tests/bench/fdb_test.clj
index 8395cd5..a2b024f 100644
--- a/test/intemporal/tests/bench/fdb_test.clj
+++ b/test/intemporal/tests/bench/fdb_test.clj
@@ -1,13 +1,13 @@
(ns ^:integration intemporal.tests.bench.fdb-test
- (:require [clojure.test :refer [deftest testing is]]
+ (:require [clojure.test :refer [deftest testing]]
[intemporal.store.fdb :as fdb-store]
[intemporal.tests.bench.test-suite :as suite]
[me.vedang.clj-fdb.FDB :as cfdb]))
(deftest fdb-store-test
(testing "FoundationDB Store Implementation"
- (let [db (cfdb/select-api-version 730)
- db (cfdb/open db)]
+ (let [db (cfdb/select-api-version 710)
+ db (cfdb/open db "docker/fdb.cluster")]
;; Run shared suite
(with-open [store (fdb-store/make-fdb-store db "intemporal-tests")]
@@ -19,5 +19,5 @@
;; 1k => ~1s
;; 10k => ~5s
;; 100k => 6GB, 53s
- (suite/run-store-tests (fdb-store/make-fdb-store (cfdb/open (cfdb/select-api-version 730)) "intemporal-tests") 100000))
+ (suite/run-store-tests (fdb-store/make-fdb-store (cfdb/open (cfdb/select-api-version 710) "docker/fdb.cluster") "intemporal-tests") 100000))
"")
\ No newline at end of file
diff --git a/test/intemporal/tests/bench/memory_test.clj b/test/intemporal/tests/bench/memory_test.clj
index cd2210b..bbd5d02 100644
--- a/test/intemporal/tests/bench/memory_test.clj
+++ b/test/intemporal/tests/bench/memory_test.clj
@@ -11,5 +11,5 @@
(comment
(time
;; 100k => 3GB, 10s
- (run-store-tests (store/->InMemoryStore (atom {})) 100000))
+ (suite/run-store-tests (store/->InMemoryStore (atom {})) 100000))
"")
\ No newline at end of file
diff --git a/test/intemporal/tests/cancellation_test.clj b/test/intemporal/tests/cancellation_test.clj
index d81f9c0..9848dfc 100644
--- a/test/intemporal/tests/cancellation_test.clj
+++ b/test/intemporal/tests/cancellation_test.clj
@@ -1,5 +1,6 @@
(ns intemporal.tests.cancellation-test
(:require [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
[clojure.test :refer [deftest is testing]]
[matcher-combinators.test :refer [match?]]
[matcher-combinators.matchers :as m]))
@@ -39,7 +40,7 @@
;; Workflow should fail with cancellation error
(let [result @result-future]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result)))))))
@@ -57,7 +58,7 @@
(intemporal/cancel-workflow (:store engine) wf-id)
(let [result @result-future]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result)))))))
@@ -73,7 +74,7 @@
(let [result (intemporal/start-workflow engine
cancellable-flow [1]
:workflow-id wf-id)]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result)))))))
@@ -91,7 +92,28 @@
;; Check result indicates failure with cancellation
(let [result @result-future]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result)))))))
+
+(deftest test-cancel-completed-workflow-is-noop
+ (testing "cancel-workflow on an already-completed workflow is a no-op"
+ (intemporal/with-workflow-engine [engine {}]
+ (let [wf-id "cancel-completed-test"]
+ (intemporal/start-workflow engine (fn [] :done) [] :workflow-id wf-id)
+ (is (= :completed (p/get-workflow-status (:store engine) wf-id)))
+ (intemporal/cancel-workflow (:store engine) wf-id)
+ (is (= :completed (p/get-workflow-status (:store engine) wf-id)))))))
+
+(deftest test-cancel-idempotent
+ (testing "cancel-workflow called twice does not throw"
+ (intemporal/with-workflow-engine [engine {:threads 2}]
+ (let [wf-id "cancel-twice-test"
+ result-future (future
+ (intemporal/start-workflow engine long-flow [1]
+ :workflow-id wf-id))]
+ (Thread/sleep 100)
+ (intemporal/cancel-workflow (:store engine) wf-id)
+ (intemporal/cancel-workflow (:store engine) wf-id)
+ (is (match? {:status :cancelled} @result-future))))))
\ No newline at end of file
diff --git a/test/intemporal/tests/cancellation_test.cljs b/test/intemporal/tests/cancellation_test.cljs
index 4c59467..1ed8f25 100644
--- a/test/intemporal/tests/cancellation_test.cljs
+++ b/test/intemporal/tests/cancellation_test.cljs
@@ -38,7 +38,7 @@
200)
(with-result [result (intemporal/start-workflow engine long-flow [1]
:workflow-id wf-id)]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result))))))
@@ -53,7 +53,7 @@
150)
(with-result [result (intemporal/start-workflow engine cancellable-flow [1]
:workflow-id wf-id)]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result))))))
@@ -66,7 +66,7 @@
(intemporal/cancel-workflow (:store engine) wf-id)
(with-result [result (intemporal/start-workflow engine cancellable-flow [1]
:workflow-id wf-id)]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result))))))
@@ -80,7 +80,7 @@
100)
(with-result [result (intemporal/start-workflow engine long-flow [1]
:workflow-id wf-id)]
- (is (match? {:status :failed
+ (is (match? {:status :cancelled
:workflow-id wf-id
:error (m/embeds {:message #"cancelled"})}
result))))))
diff --git a/test/intemporal/tests/context_macros_test.cljs b/test/intemporal/tests/context_macros_test.cljs
index e74ed70..80615cd 100644
--- a/test/intemporal/tests/context_macros_test.cljs
+++ b/test/intemporal/tests/context_macros_test.cljs
@@ -2,7 +2,7 @@
(:require [intemporal.core :as intemporal]
[cljs.test :refer [deftest is testing]]
[intemporal.internal.context :as ctx]
- [intemporal.tests.utils :refer [with-trace-logging]]
+ [intemporal.tests.utils :as utils]
[promesa.core :as p])
(:require-macros [intemporal.internal.context :as ctx :refer [blet bthen bfinally]]
[intemporal.tests.utils :refer [with-result]]))
diff --git a/test/intemporal/tests/crash/saga_compensation_crash_test.clj b/test/intemporal/tests/crash/saga_compensation_crash_test.clj
new file mode 100644
index 0000000..7414927
--- /dev/null
+++ b/test/intemporal/tests/crash/saga_compensation_crash_test.clj
@@ -0,0 +1,95 @@
+(ns ^:crash intemporal.tests.crash.saga-compensation-crash-test
+ "Crash recovery test for saga compensations.
+ A compensation suspends mid-way (waiting on a signal) to simulate a crash
+ between compensating activities. After resume, each compensating activity
+ must run exactly once and the workflow finalizes :failed."
+ (:require [intemporal.core :as intemporal]
+ [intemporal.store :as store]
+ [intemporal.protocol :as p]
+ [clojure.test :refer [deftest is testing]]))
+
+;; ============================================================================
+;; Activities - count actual executions (replays don't re-run the fn)
+;; ============================================================================
+
+(def exec-counts (atom {}))
+(defn- bump! [k] (swap! exec-counts update k (fnil inc 0)))
+
+(defn book-hotel [order] (bump! :book-hotel) {:hotel order})
+(defn book-flight [order] (bump! :book-flight) {:flight order})
+(defn charge-card-fails [order]
+ (bump! :charge-card)
+ (throw (ex-info "card declined" {:order order})))
+
+(defn cancel-hotel [_] (bump! :cancel-hotel) :hotel-cancelled)
+(defn cancel-flight [_] (bump! :cancel-flight) :flight-cancelled)
+
+;; The flight compensation cancels the flight, then waits for a signal. The
+;; missing signal is our deterministic "crash" point: the workflow suspends
+;; mid-compensation and is resumed by a fresh engine in phase 2.
+(defn crash-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ charge (intemporal/stub #'charge-card-fails)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ ;; flight compensation cancels the flight, then waits for a signal -
+ ;; the deterministic "crash" point mid-compensation.
+ _ (intemporal/add-compensation s #(do (cflight f)
+ (intemporal/wait-for-signal "continue-compensation")))]
+ (charge order)
+ :booked)
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+(defn- count-events [store workflow-id event-type]
+ (->> (p/load-history store workflow-id)
+ (filter #(= event-type (:event-type %)))
+ count))
+
+;; ============================================================================
+;; Test
+;; ============================================================================
+
+(deftest test-compensation-survives-crash
+ (testing "Compensation suspended mid-way resumes and runs each step exactly once"
+ (reset! exec-counts {})
+ (let [workflow-id "saga-crash-1"
+ persistent-store (store/->InMemoryStore (atom {}))]
+
+ ;; Phase 1: run until the flight compensation suspends waiting for a signal
+ (testing "Phase 1: fails, begins compensation, suspends mid-compensation"
+ (let [engine-1 (intemporal/make-workflow-engine :store persistent-store :threads 2)
+ fut (future
+ (intemporal/start-workflow engine-1 crash-saga ["o1"]
+ :workflow-id workflow-id))]
+ ;; Give it time to: book hotel+flight, fail charge, cancel-flight,
+ ;; then suspend at wait-for-signal.
+ (Thread/sleep 300)
+ (future-cancel fut)
+ (intemporal/shutdown-engine engine-1)
+
+ ;; flight was cancelled, hotel not yet (we suspend before its comp)
+ (is (= 1 (get @exec-counts :cancel-flight)))
+ (is (nil? (get @exec-counts :cancel-hotel)))
+ ;; not yet finalized
+ (is (zero? (count-events persistent-store workflow-id :workflow-failed)))))
+
+ ;; Phase 2: fresh engine, signal + resume -> finishes compensating, fails
+ (testing "Phase 2: resume completes compensation and finalizes :failed"
+ (let [engine-2 (intemporal/make-workflow-engine :store persistent-store :threads 2)]
+ (intemporal/send-signal persistent-store workflow-id "continue-compensation" {})
+ (let [result (intemporal/resume-workflow engine-2 workflow-id crash-saga ["o1"])]
+ (is (= :failed (:status result)))
+ ;; each compensating activity ran exactly once across the crash
+ (is (= 1 (get @exec-counts :cancel-flight)))
+ (is (= 1 (get @exec-counts :cancel-hotel)))
+ ;; exactly one terminal failure event
+ (is (= 1 (count-events persistent-store workflow-id :workflow-failed)))
+ (intemporal/shutdown-engine engine-2)))))))
diff --git a/test/intemporal/tests/jepsen/bug_1_1_test.clj b/test/intemporal/tests/jepsen/bug_1_1_test.clj
new file mode 100644
index 0000000..462d54b
--- /dev/null
+++ b/test/intemporal/tests/jepsen/bug_1_1_test.clj
@@ -0,0 +1,87 @@
+(ns intemporal.tests.jepsen.bug-1-1-test
+ "Bug 1.1 — Wake on signal across pods. REGRESSION GUARD.
+
+ Root cause (improvements.md §1.1) — now FIXED (Phase C):
+ Wake callbacks lived in a process-local atom on the store record, so a signal
+ delivered through a DIFFERENT store instance (another pod) never woke the
+ workflow — it was persisted but orphaned.
+
+ The fix: add-signal writes a durable runnable marker (C3); a worker (C4) on
+ any pod claims the marker, leases the workflow (C1), and resumes it by id
+ (B3). The wake no longer depends on the process that started the workflow.
+
+ These tests assert the FIXED behaviour: a signal written through a SEPARATE
+ store instance, with a worker running, resumes the workflow to completion.
+ InMemory models a shared store by having both instances share one state atom;
+ JDBC and FDB use two store objects over the same backing."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as mem]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]
+ [intemporal.internal.workflow-registry :as wreg]))
+
+(defn sig-act [x] (* x 2))
+
+(defn sig-wf [x]
+ (let [a (intemporal/stub #'sig-act)
+ r (a x)]
+ (intemporal/wait-for-signal "go")
+ (+ r 100)))
+
+(defn- await-status [store wf-id terminal timeout-ms]
+ (let [deadline (+ (System/currentTimeMillis) timeout-ms)]
+ (loop []
+ (let [s (p/get-workflow-status store wf-id)]
+ (if (or (= terminal s) (> (System/currentTimeMillis) deadline))
+ s
+ (do (Thread/sleep 50) (recur)))))))
+
+(defn- run-scenario
+ "store-a runs the workflow (suspends on signal); store-b (a separate instance
+ over the same backing) delivers the signal; a worker resumes it."
+ [store-a store-b]
+ (wreg/clear-registry!)
+ (let [wid (str "bug11-" (random-uuid))]
+ (let [e1 (intemporal/make-workflow-engine :store store-a :threads 2)
+ f1 (future (intemporal/start-workflow e1 sig-wf [6] :workflow-id wid))]
+ (Thread/sleep 300)
+ (future-cancel f1)
+ (intemporal/shutdown-engine e1))
+ (let [e2 (intemporal/make-workflow-engine :store store-b :threads 2)
+ stop (intemporal/start-worker e2 :poll-ms 50 :owner-id "bug11-w")]
+ (try
+ ;; Signal delivered through the SECOND store instance.
+ (intemporal/send-signal store-b wid "go" {})
+ {:status (await-status store-b wid :completed 5000)
+ :result (intemporal/get-workflow-result store-b wid)}
+ (finally (stop) (intemporal/shutdown-engine e2))))))
+
+(defn- assert-woke [{:keys [status result]}]
+ (is (= :completed status) "cross-instance signal woke the workflow via durable marker (bug 1.1 fixed)")
+ (is (= 112 result) "6*2 + 100 = 112"))
+
+(deftest signal-across-instances-in-memory
+ (testing "InMemoryStore sharing one backing atom"
+ (let [state (atom {})]
+ (assert-woke (run-scenario (mem/->InMemoryStore state) (mem/->InMemoryStore state))))))
+
+(deftest ^:integration signal-across-instances-jdbc
+ (testing "two JdbcStore instances over the same Postgres"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store-a (jdbc-store/make-jdbc-store url)
+ store-b (jdbc-store/make-jdbc-store url)]
+ (try (assert-woke (run-scenario store-a store-b))
+ (finally (.close store-a) (.close store-b))))))
+
+(deftest ^:integration signal-across-instances-fdb
+ (testing "two FDBStore instances over the same FoundationDB"
+ (let [root (str "bug11-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store-a (fdb-store/make-fdb-store db root)
+ store-b (fdb-store/make-fdb-store db root)]
+ (assert-woke (run-scenario store-a store-b)))))
diff --git a/test/intemporal/tests/jepsen/bug_1_2_test.clj b/test/intemporal/tests/jepsen/bug_1_2_test.clj
new file mode 100644
index 0000000..ef15c39
--- /dev/null
+++ b/test/intemporal/tests/jepsen/bug_1_2_test.clj
@@ -0,0 +1,58 @@
+(ns intemporal.tests.jepsen.bug-1-2-test
+ "Bug 1.2 — Concurrent execution corrupting history. REGRESSION GUARD.
+
+ Root cause (improvements.md §1.2) — now FIXED (Phase C, ownership model):
+ Two pods could run the same workflow and both write history; JDBC's
+ ON CONFLICT DO UPDATE silently overwrote, FDB produced duplicate-seq rows.
+ Nothing stopped two concurrent writers.
+
+ The fix: an ownership column. claim-owner atomically stamps
+ `owner WHERE owner IS NULL OR owner = me`, so exactly one pod can own (and
+ therefore run) a workflow; the worker resumes owned workflows one at a time.
+ No two writers execute concurrently, so history cannot be corrupted.
+
+ These tests assert the FIXED behaviour: of two pods racing to claim one
+ unowned workflow, exactly one succeeds; the loser cannot run it."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.protocol :as p]
+ [intemporal.store :as mem]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]))
+
+(defn- run-scenario
+ "Two owners race to claim one unowned workflow. Returns
+ {:a-claimed? :b-claimed? :pending-for-loser}."
+ [store]
+ (let [wid (str "bug12-" (random-uuid))]
+ (p/save-event store wid {:event-type :workflow-started :workflow-id wid :args []})
+ (let [a (p/claim-owner store wid "owner-A")
+ b (p/claim-owner store wid "owner-B")] ; A already owns it -> B must fail
+ {:a-claimed? a
+ :b-claimed? b
+ ;; scope to this wid — the shared DB may hold unowned rows from prior runs
+ :wid-pending-for-b? (contains? (set (p/list-pending store "owner-B" 1000)) wid)})))
+
+(defn- assert-fixed [{:keys [a-claimed? b-claimed? wid-pending-for-b?]}]
+ (is a-claimed? "owner-A claimed the unowned workflow")
+ (is (false? b-claimed?) "owner-B could NOT claim A's workflow — exclusive ownership (bug 1.2 fixed)")
+ (is (not wid-pending-for-b?) "the workflow is not runnable by B, so B never executes it"))
+
+(deftest claim-is-exclusive-in-memory
+ (testing "InMemoryStore"
+ (assert-fixed (run-scenario (mem/->InMemoryStore (atom {}))))))
+
+(deftest ^:integration claim-is-exclusive-jdbc
+ (testing "JdbcStore"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (assert-fixed (run-scenario store)) (finally (.close store))))))
+
+(deftest ^:integration claim-is-exclusive-fdb
+ (testing "FDBStore"
+ (let [root (str "bug12-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (assert-fixed (run-scenario store)))))
diff --git a/test/intemporal/tests/jepsen/bug_1_3_test.clj b/test/intemporal/tests/jepsen/bug_1_3_test.clj
new file mode 100644
index 0000000..5a85457
--- /dev/null
+++ b/test/intemporal/tests/jepsen/bug_1_3_test.clj
@@ -0,0 +1,84 @@
+(ns intemporal.tests.jepsen.bug-1-3-test
+ "Bug 1.3 — Recovery after restart. REGRESSION GUARD.
+
+ Root cause (improvements.md §1.3) — now FIXED (Phase C):
+ There was no background process that resumed workflows after a restart, and
+ resume required the caller to know the workflow fn + args. A workflow whose
+ engine crashed stayed suspended forever.
+
+ The fix: durable runnable markers (C3) written on every signal, a lease (C1)
+ so only one worker runs a workflow, the workflow registry (B3) so a workflow
+ can be resumed by id alone, and start-worker (C4) which polls markers, claims
+ the lease, and resumes. A restarted process running a worker recovers
+ workflows it never started.
+
+ These tests assert the FIXED behaviour: after the engine crashes, a worker on a
+ fresh engine (same shared store) resumes the workflow to completion once the
+ signal arrives. InMemory shares one state atom to model a shared store; JDBC and
+ FDB use the same backing."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as mem]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]
+ [intemporal.internal.workflow-registry :as wreg]))
+
+(defn rec-act [x] (* x 10))
+
+(defn recover-wf [x]
+ (let [a (intemporal/stub #'rec-act)
+ r (a x)]
+ (intemporal/wait-for-signal "go")
+ (+ r 7)))
+
+(defn- await-status [store wf-id terminal timeout-ms]
+ (let [deadline (+ (System/currentTimeMillis) timeout-ms)]
+ (loop []
+ (let [s (p/get-workflow-status store wf-id)]
+ (if (or (= terminal s) (> (System/currentTimeMillis) deadline))
+ s
+ (do (Thread/sleep 50) (recur)))))))
+
+(defn- run-scenario
+ "Start on engine-a (suspends on signal), crash it, then a worker on engine-b
+ resumes after a signal. Returns the terminal status + result."
+ [store]
+ (wreg/clear-registry!)
+ (let [wid (str "bug13-" (random-uuid))]
+ (let [e1 (intemporal/make-workflow-engine :store store :threads 2)
+ f1 (future (intemporal/start-workflow e1 recover-wf [4] :workflow-id wid))]
+ (Thread/sleep 300)
+ (future-cancel f1)
+ (intemporal/shutdown-engine e1))
+ (let [e2 (intemporal/make-workflow-engine :store store :threads 2)
+ stop (intemporal/start-worker e2 :poll-ms 50 :owner-id "bug13-w")]
+ (try
+ (intemporal/send-signal store wid "go" {})
+ {:status (await-status store wid :completed 5000)
+ :result (intemporal/get-workflow-result store wid)}
+ (finally (stop) (intemporal/shutdown-engine e2))))))
+
+(defn- assert-recovered [{:keys [status result]}]
+ (is (= :completed status) "worker on a fresh engine resumed the crashed workflow (bug 1.3 fixed)")
+ (is (= 47 result) "4*10 + 7 = 47"))
+
+(deftest engine-restart-recovers-in-memory
+ (testing "shared InMemoryStore: worker recovers after crash"
+ (assert-recovered (run-scenario (mem/->InMemoryStore (atom {}))))))
+
+(deftest ^:integration engine-restart-recovers-jdbc
+ (testing "JdbcStore: worker recovers after crash"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (assert-recovered (run-scenario store)) (finally (.close store))))))
+
+(deftest ^:integration engine-restart-recovers-fdb
+ (testing "FDBStore: worker recovers after crash"
+ (let [root (str "bug13-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (assert-recovered (run-scenario store)))))
diff --git a/test/intemporal/tests/jepsen/bug_2_1_test.clj b/test/intemporal/tests/jepsen/bug_2_1_test.clj
new file mode 100644
index 0000000..fa36a71
--- /dev/null
+++ b/test/intemporal/tests/jepsen/bug_2_1_test.clj
@@ -0,0 +1,112 @@
+(ns intemporal.tests.jepsen.bug-2-1-test
+ "Bug 2.1 — Register-then-consume signal race in process-signal. REGRESSION GUARD.
+
+ Root cause (improvements.md §2.1) — now FIXED (Phase A1):
+ process-signal previously did consume-check THEN register-callback. A
+ signal arriving in that window fired into an empty callbacks atom and was
+ lost, stranding the workflow forever.
+
+ The fix (execution.clj/.cljs process-signal) reverses the order: register
+ the callback FIRST, then consume-check. consume-signal is atomic, so
+ exactly one of {the inline check, the callback} consumes the signal; the
+ callback only wakes if it consumed, so the inline path never double-runs.
+
+ Mechanism:
+ RacingStore (intemporal.tests.jepsen.racing-store) deterministically pins
+ the executing thread at the consume-check and lets the test inject a signal
+ at exactly the adversarial moment. Because the callback is now registered
+ BEFORE that consume-check, inner.add-signal finds it and fires it — the
+ workflow wakes and completes on every run.
+
+ These tests assert the FIXED behaviour: the workflow wakes, completes, and
+ leaves no orphaned signal. They will fail again if the race is reintroduced."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as mem]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]
+ [intemporal.tests.jepsen.racing-store :refer [->RacingStore]]))
+
+;; ── Shared workflow ───────────────────────────────────────────────────────────
+
+(defn- wait-signal-wf []
+ (intemporal/wait-for-signal "go")
+ :woke)
+
+;; ── Shared scenario ───────────────────────────────────────────────────────────
+
+(defn- run-scenario
+ "Drives the race against any store via RacingStore. Returns a map:
+ :result — the start-workflow result map, or ::timeout if it never woke
+ :pending — pending-signal names still in the store after the race
+ :status — workflow status from the store"
+ [inner]
+ (let [gate-nil (promise)
+ gate-sent (promise)
+ store (->RacingStore inner gate-nil gate-sent (atom true))
+ wf-id (str "bug21-" (random-uuid))
+ result (promise)
+ engine (intemporal/make-workflow-engine :store store :threads 2)]
+ (future
+ (try
+ (deliver result (intemporal/start-workflow engine wait-signal-wf []
+ :workflow-id wf-id))
+ (catch Exception e (deliver result {:error (str e)}))))
+
+ (let [gate-info (deref gate-nil 5000 ::timeout)]
+ (when (= ::timeout gate-info)
+ (intemporal/shutdown-engine engine)
+ (throw (ex-info "Race gate never opened" {:wf-id wf-id})))
+ ;; Gate open: the callback is already registered (Phase A1). Inject the
+ ;; signal in the window — inner.add-signal finds the callback and fires it.
+ (p/add-signal inner wf-id "go" {:source :injected-in-race-window})
+ (deliver gate-sent :signal-injected)
+ (let [r (deref result 3000 ::timeout)
+ pending (p/get-pending-signals inner wf-id)
+ status (p/get-workflow-status inner wf-id)]
+ (intemporal/shutdown-engine engine)
+ {:result r
+ ;; Count remaining signal *values*, not keys: InMemoryStore leaves an
+ ;; empty vector under the signal name after consuming, while JDBC/FDB
+ ;; delete the row. Both mean "nothing left to deliver".
+ :pending-count (reduce + 0 (map count (vals pending)))
+ :status status}))))
+
+(defn- assert-woke [{:keys [result pending-count status]}]
+ (is (not= ::timeout result)
+ "Workflow woke and completed — the in-window signal was delivered (bug 2.1 fixed)")
+ (is (= :completed (:status result))
+ "start-workflow returned a :completed result")
+ (is (zero? pending-count)
+ "No signal left pending — it was consumed exactly once")
+ (is (= :completed status)
+ "Workflow status is :completed"))
+
+;; ── In-memory (always runs) ───────────────────────────────────────────────────
+
+(deftest signal-delivered-in-register-consume-window-in-memory
+ (testing "RacingStore on InMemoryStore: in-window signal wakes the workflow"
+ (assert-woke (run-scenario (mem/->InMemoryStore (atom {}))))))
+
+;; ── JDBC (requires Postgres) ──────────────────────────────────────────────────
+
+(deftest ^:integration signal-delivered-in-register-consume-window-jdbc
+ (testing "RacingStore on JdbcStore: in-window signal wakes the workflow"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ inner (jdbc-store/make-jdbc-store url)]
+ (try
+ (assert-woke (run-scenario inner))
+ (finally (.close inner))))))
+
+;; ── FDB (requires FoundationDB) ───────────────────────────────────────────────
+
+(deftest ^:integration signal-delivered-in-register-consume-window-fdb
+ (testing "RacingStore on FDBStore: in-window signal wakes the workflow"
+ (let [root (str "bug21-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ inner (fdb-store/make-fdb-store db root)]
+ (assert-woke (run-scenario inner)))))
diff --git a/test/intemporal/tests/jepsen/bug_2_3_test.clj b/test/intemporal/tests/jepsen/bug_2_3_test.clj
new file mode 100644
index 0000000..0b8fc0f
--- /dev/null
+++ b/test/intemporal/tests/jepsen/bug_2_3_test.clj
@@ -0,0 +1,98 @@
+(ns intemporal.tests.jepsen.bug-2-3-test
+ "Bug 2.3 — Cancellation reaching a workflow sleeping in wait-for-signal. REGRESSION GUARD.
+
+ Root cause (improvements.md §2.3) — now FIXED (Phase A2):
+ cancel-workflow set the cancelled flag but did nothing to wake a workflow
+ parked on wait-for-signal. Such a workflow never re-entered its loop, so
+ check-cancelled! never fired and the cancellation was silently ignored —
+ the workflow (and its thread) stayed alive forever.
+
+ The fix adds IStore/wake-workflow plus a generic wake callback registered
+ whenever a workflow suspends (execution.clj/.cljs run-workflow-internal).
+ cancel-workflow now calls mark-cancelled THEN wake-workflow, forcing the
+ sleeper to re-enter, observe the flag at the loop-top cancel check, and
+ finalize.
+
+ These tests assert the FIXED behaviour:
+ • the workflow TERMINATES (start-workflow returns; no longer stuck)
+ • is-cancelled? is true
+ • get-workflow-status is :cancelled (finalize-cancelled writes a first-class
+ :workflow-cancelled terminal event)
+ They will fail again if cancel stops waking sleepers."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as mem]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]))
+
+;; ── Shared workflow ───────────────────────────────────────────────────────────
+
+(defn- cancel-sleep-wf []
+ (intemporal/wait-for-signal "wake")
+ :woke)
+
+;; ── Shared scenario ───────────────────────────────────────────────────────────
+
+(defn- run-scenario
+ "Starts a workflow that sleeps on a signal, cancels it, and observes whether
+ the cancellation actually terminates it. Returns
+ :terminated? :cancelled-flag-set? :status."
+ [store]
+ (let [wf-id (str "bug23-" (random-uuid))
+ result (promise)
+ engine (intemporal/make-workflow-engine :store store :threads 2)]
+ (future
+ (try
+ (deliver result (intemporal/start-workflow engine cancel-sleep-wf []
+ :workflow-id wf-id))
+ (catch Exception e (deliver result {:error (str e)}))))
+ ;; Wait for the workflow to suspend and register its wake callback
+ (Thread/sleep 400)
+ ;; Cancel: sets the flag AND wakes the sleeper (Phase A2)
+ (intemporal/cancel-workflow store wf-id)
+ (let [r (deref result 2000 :stuck)
+ flag? (p/is-cancelled? store wf-id)
+ status (p/get-workflow-status store wf-id)]
+ (intemporal/shutdown-engine engine)
+ {:terminated? (not= :stuck r)
+ :cancelled-flag-set? flag?
+ :status status})))
+
+(defn- assert-cancelled [{:keys [terminated? cancelled-flag-set? status]}]
+ (is terminated?
+ "Workflow terminated after cancel — wake-workflow forced loop re-entry (bug 2.3 fixed)")
+ (is cancelled-flag-set?
+ "Cancelled flag is set in the store")
+ ;; A finalized cancelled workflow has a first-class :workflow-cancelled event in
+ ;; history (finalize-cancelled), so the derived terminal status is :cancelled.
+ (is (= :cancelled status)
+ "Workflow status is :cancelled after finalization (first-class cancelled terminal event)"))
+
+;; ── In-memory (always runs) ───────────────────────────────────────────────────
+
+(deftest cancellation-reaches-sleeping-workflow-in-memory
+ (testing "cancel-workflow terminates a signal-sleeping workflow (InMemoryStore)"
+ (assert-cancelled (run-scenario (mem/->InMemoryStore (atom {}))))))
+
+;; ── JDBC (requires Postgres) ──────────────────────────────────────────────────
+
+(deftest ^:integration cancellation-reaches-sleeping-workflow-jdbc
+ (testing "cancel-workflow terminates a signal-sleeping workflow (JdbcStore)"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try
+ (assert-cancelled (run-scenario store))
+ (finally (.close store))))))
+
+;; ── FDB (requires FoundationDB) ───────────────────────────────────────────────
+
+(deftest ^:integration cancellation-reaches-sleeping-workflow-fdb
+ (testing "cancel-workflow terminates a signal-sleeping workflow (FDBStore)"
+ (let [root (str "bug23-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (assert-cancelled (run-scenario store)))))
diff --git a/test/intemporal/tests/jepsen/racing_store.clj b/test/intemporal/tests/jepsen/racing_store.clj
new file mode 100644
index 0000000..cf1ca92
--- /dev/null
+++ b/test/intemporal/tests/jepsen/racing_store.clj
@@ -0,0 +1,59 @@
+(ns intemporal.tests.jepsen.racing-store
+ "RacingStore: a store wrapper that deterministically injects the
+ register-then-consume signal race described in improvements.md §2.1.
+
+ The race window in process-signal (execution.clj:227-255) is:
+
+ (1) (p/consume-signal store wf-id name) → nil ; no signal present
+ ← WINDOW: sender calls add-signal here →
+ (2) (p/register-signal-callback store wf-id name f) ; now registers wake-fn
+
+ add-signal in step (1.5) checks the callbacks atom and finds it empty —
+ the wake-fn fires into nothing. After step (2) the callback is registered
+ but add-signal has already run; it will never re-fire retroactively.
+ The workflow is permanently stuck with the signal row sitting in the store.
+
+ RacingStore widens and synchronises this window with two promises:
+ gate-nil — delivered by RacingStore after consume-signal returns nil
+ gate-sent — delivered by the test after add-signal is called
+
+ Usage:
+ (let [gate-nil (promise)
+ gate-sent (promise)
+ inner (make-your-store)
+ store (->RacingStore inner gate-nil gate-sent (atom true))]
+ ;; start workflow using `store` …
+ (deref gate-nil 5000 :timeout) ; wait: consume-check done
+ (p/add-signal inner wf-id sig payload) ; inject signal into window
+ (deliver gate-sent :go) ; close window, let process-signal continue
+ …)"
+ (:require [intemporal.protocol :as p]))
+
+(defrecord RacingStore [inner gate-nil gate-sent armed?]
+ p/IStore
+ (load-history [_ wf-id] (p/load-history inner wf-id))
+ (save-event [_ wf-id ev] (p/save-event inner wf-id ev))
+ (save-events [_ wf-id evs] (p/save-events inner wf-id evs))
+ (find-event [_ wf-id et sq] (p/find-event inner wf-id et sq))
+ (get-pending-signals [_ wf-id] (p/get-pending-signals inner wf-id))
+ (add-signal [_ wf-id sn sd] (p/add-signal inner wf-id sn sd))
+ (register-signal-callback [_ wf-id sn cb] (p/register-signal-callback inner wf-id sn cb))
+ (unregister-signal-callback [_ wf-id sn] (p/unregister-signal-callback inner wf-id sn))
+ (register-wake-callback [_ wf-id cb] (p/register-wake-callback inner wf-id cb))
+ (wake-workflow [_ wf-id] (p/wake-workflow inner wf-id))
+ (is-cancelled? [_ wf-id] (p/is-cancelled? inner wf-id))
+ (mark-cancelled [_ wf-id] (p/mark-cancelled inner wf-id))
+ (get-workflow-status [_ wf-id] (p/get-workflow-status inner wf-id))
+ (claim-owner [_ wf-id o] (p/claim-owner inner wf-id o))
+ (list-pending [_ o lim] (p/list-pending inner o lim))
+ (release-owner [_ o] (p/release-owner inner o))
+ (set-wake-at [_ wf-id wa] (p/set-wake-at inner wf-id wa))
+
+ (consume-signal [_ wf-id sig-name]
+ (let [result (p/consume-signal inner wf-id sig-name)]
+ ;; Only intercept the FIRST nil return (armed? guards re-entrant calls).
+ (when (and (nil? result) (compare-and-set! armed? true false))
+ (deliver gate-nil {:wf-id wf-id :sig-name sig-name})
+ ;; Block until the test has injected the signal into the window.
+ (deref gate-sent 5000 :timeout-in-racing-store))
+ result)))
diff --git a/test/intemporal/tests/saga_test.clj b/test/intemporal/tests/saga_test.clj
new file mode 100644
index 0000000..9ef74c6
--- /dev/null
+++ b/test/intemporal/tests/saga_test.clj
@@ -0,0 +1,236 @@
+(ns intemporal.tests.saga-test
+ "Tests for saga / compensation support (saga + add-compensation + compensate).
+ A compensation registered for a successful step runs (in reverse order)
+ when the workflow later fails and the catch block calls compensate."
+ (:require [intemporal.core :as intemporal]
+ [intemporal.tests.utils :refer [with-result]]
+ [clojure.test :refer [deftest is testing]]
+ [matcher-combinators.matchers :as m]
+ [matcher-combinators.test :refer [match?]]))
+
+;; ============================================================================
+;; Activities - record execution order + args into a shared atom
+;; ============================================================================
+
+(def events (atom []))
+(defn- record! [e] (swap! events conj e))
+
+(defn book-hotel [order] (record! [:book-hotel order]) {:hotel order})
+(defn book-flight [order] (record! [:book-flight order]) {:flight order})
+(defn charge-card [order] (record! [:charge-card order]) {:charge order})
+
+(defn charge-card-fails [order]
+ (record! [:charge-card order])
+ (throw (ex-info "card declined" {:order order})))
+
+(defn book-flight-fails [order]
+ (record! [:book-flight order])
+ (throw (ex-info "no seats" {:order order})))
+
+(defn cancel-hotel [v] (record! [:cancel-hotel v]) :hotel-cancelled)
+(defn cancel-flight [v] (record! [:cancel-flight v]) :flight-cancelled)
+
+(defn failing-cancel-flight [v]
+ (record! [:cancel-flight v])
+ (throw (ex-info "refund provider down" {:v v})))
+
+(defn slow-step [x] (record! [:slow x]) (Thread/sleep 50) x)
+
+;; ============================================================================
+;; Workflows
+;; ============================================================================
+
+(defn happy-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ charge (intemporal/stub #'charge-card)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ (charge order)
+ :booked)
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+(defn failing-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ charge (intemporal/stub #'charge-card-fails)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ (charge order)
+ :booked)
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+(defn fail-on-flight-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight-fails)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ ;; flight fails before its compensation is registered
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ :booked)
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+;; Books hotel + flight, then stays busy in a loop so a cancel arrives after the
+;; bookings have completed (mirrors cancellation-test/long-flow).
+(defn cancel-rollback-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)
+ slow (intemporal/stub #'slow-step)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ (loop [i 0]
+ (if (< i 40) (do (slow i) (recur (inc i))) :booked)))
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+;; Cancel lands before any compensation is registered (busy first).
+(defn cancel-early-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ chotel (intemporal/stub #'cancel-hotel)
+ slow (intemporal/stub #'slow-step)]
+ (try
+ (loop [i 0]
+ (when (< i 3) (slow i) (recur (inc i))))
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))]
+ :booked)
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+;; Compensation activity itself fails -> swallowed, others still run.
+(defn failing-comp-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ charge (intemporal/stub #'charge-card-fails)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'failing-cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ (charge order)
+ :booked)
+ (catch Exception e
+ (intemporal/compensate s)
+ (throw e)))))
+
+;; ============================================================================
+;; Tests
+;; ============================================================================
+
+(deftest test-happy-path-no-compensation
+ (testing "When the workflow succeeds, no compensation runs"
+ (reset! events [])
+ (intemporal/with-workflow-engine [engine {:threads 2}]
+ (with-result [result (intemporal/start-workflow engine happy-saga ["o1"])]
+ (is (match? {:status :completed :result :booked} result))
+ (is (= [[:book-hotel "o1"] [:book-flight "o1"] [:charge-card "o1"]]
+ @events))))))
+
+(deftest test-compensation-runs-lifo-on-failure
+ (testing "On a later failure, compensations run in reverse order with the forward result"
+ (reset! events [])
+ (intemporal/with-workflow-engine [engine {:threads 2}]
+ (with-result [result (intemporal/start-workflow engine failing-saga ["o2"])]
+ (is (match? {:status :failed} result))
+ ;; forward steps, the failing charge, then compensations in reverse (LIFO)
+ (is (= [[:book-hotel "o2"]
+ [:book-flight "o2"]
+ [:charge-card "o2"]
+ [:cancel-flight {:flight "o2"}]
+ [:cancel-hotel {:hotel "o2"}]]
+ @events))))))
+
+(deftest test-failed-step-registers-no-compensation
+ (testing "A step whose own body fails registers no compensation; earlier steps still compensate"
+ (reset! events [])
+ (intemporal/with-workflow-engine [engine {:threads 2}]
+ (with-result [result (intemporal/start-workflow engine fail-on-flight-saga ["o3"])]
+ (is (match? {:status :failed} result))
+ ;; flight failed -> no :cancel-flight; only hotel compensates
+ (is (= [[:book-hotel "o3"]
+ [:book-flight "o3"]
+ [:cancel-hotel {:hotel "o3"}]]
+ @events))
+ (is (not (some #(= :cancel-flight (first %)) @events)))))))
+
+(defn- compensations [events]
+ (filterv #(#{:cancel-flight :cancel-hotel} (first %)) events))
+
+(deftest test-cancellation-rolls-back-completed-steps
+ (testing "Cancelling a running saga runs compensations (LIFO) for completed steps"
+ (reset! events [])
+ (intemporal/with-workflow-engine [engine {:threads 2}]
+ (let [wf-id "saga-cancel-1"
+ fut (future (intemporal/start-workflow engine cancel-rollback-saga ["c1"]
+ :workflow-id wf-id))]
+ ;; let hotel + flight + a few slow steps run, then cancel
+ (Thread/sleep 250)
+ (intemporal/cancel-workflow (:store engine) wf-id)
+ (let [result @fut]
+ (is (match? {:status :cancelled
+ :workflow-id wf-id
+ :error (m/embeds {:message #"cancelled"})}
+ result))
+ ;; both completed steps rolled back, in reverse order, with their values
+ (is (= [[:cancel-flight {:flight "c1"}]
+ [:cancel-hotel {:hotel "c1"}]]
+ (compensations @events))))))))
+
+(deftest test-cancellation-with-no-completed-steps
+ (testing "Cancelling before any with-failure step completes runs no compensations"
+ (reset! events [])
+ (intemporal/with-workflow-engine [engine {:threads 2}]
+ (let [wf-id "saga-cancel-2"
+ fut (future (intemporal/start-workflow engine cancel-early-saga ["c2"]
+ :workflow-id wf-id))]
+ (Thread/sleep 60) ;; mid first slow step, before the with-failure
+ (intemporal/cancel-workflow (:store engine) wf-id)
+ (let [result @fut]
+ (is (match? {:status :cancelled :workflow-id wf-id} result))
+ (is (empty? (compensations @events))))))))
+
+(deftest test-observer-compensation-lifecycle
+ (testing "Observer sees compensation-started/-completed, and -failed for a failing compensation"
+ (reset! events [])
+ (intemporal/with-workflow-engine [engine {:threads 2 :enable-logging true}]
+ (with-result [_ (intemporal/start-workflow engine failing-comp-saga ["c3"])]
+ (let [evs (set (map :event @(:log engine)))]
+ (is (contains? evs :compensation-started))
+ (is (contains? evs :compensation-completed))
+ ;; failing-cancel-flight throws -> swallowed + surfaced to the observer
+ (is (contains? evs :compensation-failed)))))))
diff --git a/test/intemporal/tests/saga_test.cljs b/test/intemporal/tests/saga_test.cljs
new file mode 100644
index 0000000..0a28a89
--- /dev/null
+++ b/test/intemporal/tests/saga_test.cljs
@@ -0,0 +1,126 @@
+(ns intemporal.tests.saga-test
+ "Tests for saga / compensation support (saga + add-compensation + compensate)."
+ (:require [intemporal.core :as intemporal]
+ [intemporal.tests.utils :refer [with-result]]
+ [cljs.test :as t :refer [deftest is testing]]
+ [matcher-combinators.test :refer [match?]])
+ (:require-macros [intemporal.tests.utils :refer [with-result]]))
+
+;; ============================================================================
+;; Activities - record execution order + args into a shared atom
+;; ============================================================================
+
+(def events (atom []))
+(defn- record! [e] (swap! events conj e))
+
+(defn book-hotel [order] (record! [:book-hotel order]) {:hotel order})
+(defn book-flight [order] (record! [:book-flight order]) {:flight order})
+(defn charge-card [order] (record! [:charge-card order]) {:charge order})
+
+(defn charge-card-fails [order]
+ (record! [:charge-card order])
+ (throw (ex-info "card declined" {:order order})))
+
+(defn book-flight-fails [order]
+ (record! [:book-flight order])
+ (throw (ex-info "no seats" {:order order})))
+
+(defn cancel-hotel [v] (record! [:cancel-hotel v]) :hotel-cancelled)
+(defn cancel-flight [v] (record! [:cancel-flight v]) :flight-cancelled)
+
+;; ============================================================================
+;; Workflows
+;; ============================================================================
+
+(defn happy-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ charge (intemporal/stub #'charge-card)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ (charge order)
+ :booked)
+ (catch :default e
+ (when (intemporal/suspension? e) (throw e))
+ (intemporal/compensate s)
+ (throw e)))))
+
+(defn failing-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight)
+ charge (intemporal/stub #'charge-card-fails)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ (charge order)
+ :booked)
+ (catch :default e
+ (when (intemporal/suspension? e) (throw e))
+ (intemporal/compensate s)
+ (throw e)))))
+
+(defn fail-on-flight-saga [order]
+ (let [s (intemporal/saga)
+ hotel (intemporal/stub #'book-hotel)
+ flight (intemporal/stub #'book-flight-fails)
+ chotel (intemporal/stub #'cancel-hotel)
+ cflight (intemporal/stub #'cancel-flight)]
+ (try
+ (let [h (hotel order)
+ _ (intemporal/add-compensation s #(chotel h))
+ f (flight order)
+ _ (intemporal/add-compensation s #(cflight f))]
+ :booked)
+ (catch :default e
+ (when (intemporal/suspension? e) (throw e))
+ (intemporal/compensate s)
+ (throw e)))))
+
+;; ============================================================================
+;; Tests
+;; ============================================================================
+
+(deftest test-happy-path-no-compensation
+ (testing "When the workflow succeeds, no compensation runs"
+ (reset! events [])
+ (let [engine (intemporal/make-workflow-engine :threads 2)]
+ (with-result [result (intemporal/start-workflow engine happy-saga ["o1"])]
+ (is (match? {:status :completed :result :booked} result))
+ (is (= [[:book-hotel "o1"] [:book-flight "o1"] [:charge-card "o1"]]
+ @events))))))
+
+(deftest test-compensation-runs-lifo-on-failure
+ (testing "On a later failure, compensations run in reverse order with the forward result"
+ (reset! events [])
+ (let [engine (intemporal/make-workflow-engine :threads 2)]
+ (with-result [result (intemporal/start-workflow engine failing-saga ["o2"])]
+ (is (match? {:status :failed} result))
+ (is (= [[:book-hotel "o2"]
+ [:book-flight "o2"]
+ [:charge-card "o2"]
+ [:cancel-flight {:flight "o2"}]
+ [:cancel-hotel {:hotel "o2"}]]
+ @events))))))
+
+(deftest test-failed-step-registers-no-compensation
+ (testing "A step whose own body fails registers no compensation; earlier steps still compensate"
+ (reset! events [])
+ (let [engine (intemporal/make-workflow-engine :threads 2)]
+ (with-result [result (intemporal/start-workflow engine fail-on-flight-saga ["o3"])]
+ (is (match? {:status :failed} result))
+ (is (= [[:book-hotel "o3"]
+ [:book-flight "o3"]
+ [:cancel-hotel {:hotel "o3"}]]
+ @events))
+ (is (not (some #(= :cancel-flight (first %)) @events)))))))
diff --git a/test/intemporal/tests/signal_test.clj b/test/intemporal/tests/signal_test.clj
index 14b76cb..8aefe38 100644
--- a/test/intemporal/tests/signal_test.clj
+++ b/test/intemporal/tests/signal_test.clj
@@ -72,18 +72,29 @@
(deftest test-multiple-signals
(testing "Multiple signals can be sent to same workflow"
(intemporal/with-workflow-engine [engine {:threads 2}]
- (let [wf-id "multi-signal-test"]
- ;; Send signals before workflow starts
- (intemporal/send-signal (:store engine) wf-id "approval" {:user "alice"})
- (intemporal/send-signal (:store engine) wf-id "approval" {:user "bob"})
+ ;; Two independent workflow runs, each waiting for a signal
+ (let [wf-id-1 "multi-signal-test-1"
+ wf-id-2 "multi-signal-test-2"
+ fut1 (future (intemporal/start-workflow engine signal-flow [100]
+ :workflow-id wf-id-1))
+ fut2 (future (intemporal/start-workflow engine signal-flow [200]
+ :workflow-id wf-id-2))]
+ (Thread/sleep 100)
+ (intemporal/send-signal (:store engine) wf-id-1 "approval" {:user "alice"})
+ (intemporal/send-signal (:store engine) wf-id-2 "approval" {:user "bob"})
+ (is (match? {:result {:approved {:user "alice"}}} @fut1))
+ (is (match? {:result {:approved {:user "bob"}}} @fut2))))))
- ;; First workflow run consumes first signal
- (let [result1 (intemporal/start-workflow engine
- signal-flow [100]
- :workflow-id wf-id)]
- (is (match? {:result {:approved {:user "alice"}}} result1)))
+(deftest test-send-signal-not-found
+ (testing "send-signal throws when workflow does not exist"
+ (intemporal/with-workflow-engine [engine {}]
+ (is (thrown-with-msg? clojure.lang.ExceptionInfo #"not active"
+ (intemporal/send-signal (:store engine) "no-such-wf" "approval" {}))))))
- ;; Second run on same workflow-id consumes second signal
- (let [result2 (intemporal/resume-workflow engine wf-id
- signal-flow [100])]
- (is (match? {:result {:approved {:user "alice"}}} result2)))))))
+(deftest test-send-signal-to-completed-workflow
+ (testing "send-signal throws when workflow is already completed"
+ (intemporal/with-workflow-engine [engine {}]
+ (let [wf-id "completed-signal-test"]
+ (intemporal/start-workflow engine (fn [] :done) [] :workflow-id wf-id)
+ (is (thrown-with-msg? clojure.lang.ExceptionInfo #"not active"
+ (intemporal/send-signal (:store engine) wf-id "approval" {})))))))
diff --git a/test/intemporal/tests/signal_test.cljs b/test/intemporal/tests/signal_test.cljs
index 6d2e5fa..d94fd8e 100644
--- a/test/intemporal/tests/signal_test.cljs
+++ b/test/intemporal/tests/signal_test.cljs
@@ -3,7 +3,7 @@
[intemporal.tests.utils :refer [with-result]]
[cljs.test :as t :refer [deftest is testing]]
[matcher-combinators.test :refer [match?]]
- [promesa.core :as p])
+)
(:require-macros [intemporal.tests.utils :refer [with-result]]
[intemporal.internal.context :refer [blet]]))
@@ -68,17 +68,18 @@
(deftest test-multiple-signals
(testing "Multiple signals can be sent to same workflow"
- (let [wf-id "multi-signal-test"
- engine (intemporal/make-workflow-engine :threads 2)]
- ;; Send signals before workflow starts
- (intemporal/send-signal (:store engine) wf-id "approval" {:user "alice"})
- (intemporal/send-signal (:store engine) wf-id "approval" {:user "bob"})
- ;; First workflow run consumes first signal, then second
+ (let [engine (intemporal/make-workflow-engine :threads 2)
+ wf-id-1 "multi-signal-test-1"
+ wf-id-2 "multi-signal-test-2"]
+ ;; Each workflow gets its own delayed signal. blet sequences r1 then r2,
+ ;; so each setTimeout fires after its respective workflow is suspended.
+ (js/setTimeout #(intemporal/send-signal (:store engine) wf-id-1 "approval" {:user "alice"}) 100)
(with-result [[result1 result2]
(blet [r1 (intemporal/start-workflow engine signal-flow [100]
- :workflow-id wf-id)
- r2 (intemporal/resume-workflow engine wf-id
- signal-flow [100])]
+ :workflow-id wf-id-1)
+ _ (js/setTimeout #(intemporal/send-signal (:store engine) wf-id-2 "approval" {:user "bob"}) 100)
+ r2 (intemporal/start-workflow engine signal-flow [200]
+ :workflow-id wf-id-2)]
[r1 r2])]
(is (match? {:result {:approved {:user "alice"}}} result1))
- (is (match? {:result {:approved {:user "alice"}}} result2))))))
+ (is (match? {:result {:approved {:user "bob"}}} result2))))))
diff --git a/test/intemporal/tests/status_test.clj b/test/intemporal/tests/status_test.clj
new file mode 100644
index 0000000..9111704
--- /dev/null
+++ b/test/intemporal/tests/status_test.clj
@@ -0,0 +1,54 @@
+(ns intemporal.tests.status-test
+ "Phase B2 — get-workflow-status reflects lifecycle via the cached status
+ column/key (O(1) for terminal states), across InMemory + JDBC + FDB."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as store]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]))
+
+(defn dbl [x] (* x 2))
+(defn done-wf [x] (let [a (intemporal/stub #'dbl)] (a x)))
+(defn sleep-wf [] (intemporal/wait-for-signal "go"))
+
+(defn- check-status [store]
+ ;; unknown id
+ (is (= :not-found (p/get-workflow-status store (str (random-uuid)))))
+ ;; completed (terminal -> cached fast path)
+ (let [e (intemporal/make-workflow-engine :store store :threads 2)]
+ (try
+ (let [{:keys [workflow-id]} (intemporal/submit-workflow e done-wf [21])]
+ (is (= {:status :completed :result 42}
+ (intemporal/await-workflow e workflow-id :timeout-ms 5000)))
+ (is (= :completed (p/get-workflow-status store workflow-id))))
+ ;; A cancelled workflow is first-class: finalize-cancelled writes a
+ ;; :workflow-cancelled terminal event, so the derived status is :cancelled
+ ;; both during the mark-cancelled window and after finalization.
+ (let [wid (str "cancel-" (random-uuid))
+ f (future (intemporal/start-workflow e sleep-wf [] :workflow-id wid))]
+ (Thread/sleep 300)
+ (intemporal/cancel-workflow store wid)
+ @f
+ (is (= :cancelled (p/get-workflow-status store wid))))
+ (finally (intemporal/shutdown-engine e)))))
+
+(deftest status-in-memory
+ (testing "status lifecycle on InMemoryStore"
+ (check-status (store/->InMemoryStore (atom {})))))
+
+(deftest ^:integration status-jdbc
+ (testing "status lifecycle on JdbcStore"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (check-status store) (finally (.close store))))))
+
+(deftest ^:integration status-fdb
+ (testing "status lifecycle on FDBStore"
+ (let [root (str "status-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (check-status store))))
diff --git a/test/intemporal/tests/store/fdb_test.clj b/test/intemporal/tests/store/fdb_test.clj
index aadad6f..1d459bb 100644
--- a/test/intemporal/tests/store/fdb_test.clj
+++ b/test/intemporal/tests/store/fdb_test.clj
@@ -1,13 +1,13 @@
(ns ^:integration intemporal.tests.store.fdb-test
- (:require [clojure.test :refer [deftest testing is]]
+ (:require [clojure.test :refer [deftest testing]]
[intemporal.store.fdb :as fdb-store]
[intemporal.tests.store.test-suite :as suite]
[me.vedang.clj-fdb.FDB :as cfdb]))
(deftest fdb-store-test
(testing "FoundationDB Store Implementation"
- (let [db (cfdb/select-api-version 730)
- db (cfdb/open db)]
+ (let [db (cfdb/select-api-version 710)
+ db (cfdb/open db "docker/fdb.cluster")]
;; Run shared suite
(with-open [store (fdb-store/make-fdb-store db "intemporal-tests")]
diff --git a/test/intemporal/tests/submit_workflow_test.clj b/test/intemporal/tests/submit_workflow_test.clj
new file mode 100644
index 0000000..bd6636a
--- /dev/null
+++ b/test/intemporal/tests/submit_workflow_test.clj
@@ -0,0 +1,39 @@
+(ns intemporal.tests.submit-workflow-test
+ "Phase B4 — async submit-workflow + await-workflow.
+
+ submit-workflow returns {:workflow-id …} immediately (the workflow runs on a
+ background thread); await-workflow blocks until the workflow reaches a
+ terminal state and returns its result."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.store :as store]))
+
+(defn dbl [x] (* x 2))
+
+(defn submit-wf [x]
+ (let [a (intemporal/stub #'dbl)]
+ (a x)))
+
+(deftest submit-returns-id-then-await-completes
+ (testing "submit-workflow returns an id immediately; await-workflow yields the result"
+ (let [st (store/->InMemoryStore (atom {}))
+ e (intemporal/make-workflow-engine :store st :threads 2)]
+ (try
+ (let [{:keys [workflow-id]} (intemporal/submit-workflow e submit-wf [21])]
+ (is (string? workflow-id) "submit-workflow returns a workflow-id immediately")
+ (let [r (intemporal/await-workflow e workflow-id :timeout-ms 5000)]
+ (is (= :completed (:status r)) "await sees the workflow reach terminal state")
+ (is (= 42 (:result r)) "21*2 = 42")))
+ (finally (intemporal/shutdown-engine e))))))
+
+(deftest submit-honours-explicit-id
+ (testing "submit-workflow uses a caller-supplied :workflow-id"
+ (let [st (store/->InMemoryStore (atom {}))
+ e (intemporal/make-workflow-engine :store st :threads 2)]
+ (try
+ (let [{:keys [workflow-id]} (intemporal/submit-workflow e submit-wf [50]
+ :workflow-id "explicit-1")]
+ (is (= "explicit-1" workflow-id))
+ (is (= {:status :completed :result 100}
+ (intemporal/await-workflow e "explicit-1" :timeout-ms 5000))))
+ (finally (intemporal/shutdown-engine e))))))
diff --git a/test/intemporal/tests/timer_recovery_test.clj b/test/intemporal/tests/timer_recovery_test.clj
new file mode 100644
index 0000000..f361cb1
--- /dev/null
+++ b/test/intemporal/tests/timer_recovery_test.clj
@@ -0,0 +1,161 @@
+(ns intemporal.tests.timer-recovery-test
+ "Milestone 4 (C2) — persistent / cross-pod timers.
+
+ Three properties, each across InMemory + JDBC + FDB:
+ 1. fire-at determinism — a crash-resumed sleep keeps its original deadline
+ (the persisted :timer-scheduled fire-at is reused, not recomputed);
+ 2. timer recovery — a workflow that sleeps, then loses its engine, is driven
+ to completion by a worker on a fresh engine when the timer comes due;
+ 3. wake_at filtering — list-pending skips a workflow whose wake-at is still in
+ the future, and returns it once due."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as store]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]
+ [intemporal.internal.workflow-registry :as wreg]))
+
+(defn t-act [x] (* x 3))
+
+(defn sleeper-wf [x ms]
+ (let [a (intemporal/stub #'t-act)
+ r (a x)]
+ (intemporal/sleep ms)
+ (+ r 1)))
+
+(defn- fire-at-for [store wf-id]
+ (->> (p/load-history store wf-id)
+ (filter #(= :timer-scheduled (:event-type %)))
+ first
+ :fire-at))
+
+(defn- await-status [store wf-id terminal timeout-ms]
+ (let [deadline (+ (System/currentTimeMillis) timeout-ms)]
+ (loop []
+ (let [s (p/get-workflow-status store wf-id)]
+ (cond
+ (= terminal s) s
+ (> (System/currentTimeMillis) deadline) s
+ :else (do (Thread/sleep 50) (recur)))))))
+
+;; ── 1. fire-at determinism across a crash-resume ────────────────────────────────
+
+(defn- check-determinism [store]
+ (wreg/clear-registry!)
+ (let [wid (str "det-" (random-uuid))]
+ ;; Start with a long sleep so it suspends on the timer, then crash.
+ (let [e1 (intemporal/make-workflow-engine :store store :threads 2)
+ f1 (future (intemporal/start-workflow e1 sleeper-wf [7 60000] :workflow-id wid))]
+ (Thread/sleep 300)
+ (future-cancel f1)
+ (intemporal/shutdown-engine e1))
+ (let [fire-at-1 (fire-at-for store wid)]
+ (is (some? fire-at-1) "a :timer-scheduled fire-at was persisted")
+ ;; Resume on a fresh engine; it re-suspends on the same timer.
+ (let [e2 (intemporal/make-workflow-engine :store store :threads 2)
+ f2 (future (intemporal/resume-workflow e2 wid sleeper-wf [7 60000]))]
+ (Thread/sleep 300)
+ (future-cancel f2)
+ (intemporal/shutdown-engine e2))
+ (let [fire-at-2 (fire-at-for store wid)]
+ (is (= fire-at-1 fire-at-2)
+ "fire-at is identical across resume — no deadline drift (C2 determinism)")))))
+
+(deftest fire-at-deterministic-in-memory
+ (testing "InMemoryStore"
+ (check-determinism (store/->InMemoryStore (atom {})))))
+
+(deftest ^:integration fire-at-deterministic-jdbc
+ (testing "JdbcStore"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (check-determinism store) (finally (.close store))))))
+
+(deftest ^:integration fire-at-deterministic-fdb
+ (testing "FDBStore"
+ (let [root (str "det-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (check-determinism store))))
+
+;; ── 2. timer recovery: worker drives a crashed sleeper to completion ────────────
+
+(defn- check-timer-recovery [store]
+ (wreg/clear-registry!)
+ (let [wid (str "trec-" (random-uuid))]
+ ;; Short sleep (300ms) so the timer becomes due quickly after the crash.
+ (let [e1 (intemporal/make-workflow-engine :store store :threads 2)
+ f1 (future (intemporal/start-workflow e1 sleeper-wf [8 300] :workflow-id wid))]
+ (Thread/sleep 150) ; suspend on the timer, before it fires
+ (future-cancel f1)
+ (intemporal/shutdown-engine e1))
+ (is (= :running (p/get-workflow-status store wid))
+ "workflow is durably suspended on the timer after the crash")
+ ;; A worker on a fresh engine picks it up once the timer is due.
+ (let [e2 (intemporal/make-workflow-engine :store store :threads 2)
+ stop (intemporal/start-worker e2 :poll-ms 50 :owner-id "trec-w")]
+ (try
+ (is (= :completed (await-status store wid :completed 5000))
+ "worker resumed the crashed timer workflow once it came due (C2 recovery)")
+ (is (= 25 (intemporal/get-workflow-result store wid)) "8*3 + 1 = 25")
+ (finally (stop) (intemporal/shutdown-engine e2))))))
+
+(deftest timer-recovery-in-memory
+ (testing "InMemoryStore"
+ (check-timer-recovery (store/->InMemoryStore (atom {})))))
+
+(deftest ^:integration timer-recovery-jdbc
+ (testing "JdbcStore"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (check-timer-recovery store) (finally (.close store))))))
+
+(deftest ^:integration timer-recovery-fdb
+ (testing "FDBStore"
+ (let [root (str "trec-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (check-timer-recovery store))))
+
+;; ── 3. wake_at filtering: list-pending skips not-yet-due workflows ──────────────
+
+(defn- check-wake-at-filter [store]
+ (let [wid (str "wake-" (random-uuid))]
+ (p/save-event store wid {:event-type :workflow-started :workflow-id wid :args []})
+ ;; Far-future wake-at -> not due -> excluded from list-pending.
+ (p/set-wake-at store wid (+ (System/currentTimeMillis) 3600000))
+ (is (not (contains? (set (p/list-pending store "any-owner" 1000)) wid))
+ "a workflow whose wake-at is in the future is skipped (C2 filtering)")
+ ;; Past wake-at -> due -> included.
+ (p/set-wake-at store wid (- (System/currentTimeMillis) 1000))
+ (is (contains? (set (p/list-pending store "any-owner" 1000)) wid)
+ "a workflow whose wake-at has passed is returned")
+ ;; nil wake-at -> always eligible -> included.
+ (p/set-wake-at store wid nil)
+ (is (contains? (set (p/list-pending store "any-owner" 1000)) wid)
+ "a workflow with nil wake-at is always eligible")))
+
+(deftest wake-at-filter-in-memory
+ (testing "InMemoryStore"
+ (check-wake-at-filter (store/->InMemoryStore (atom {})))))
+
+(deftest ^:integration wake-at-filter-jdbc
+ (testing "JdbcStore"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (check-wake-at-filter store) (finally (.close store))))))
+
+(deftest ^:integration wake-at-filter-fdb
+ (testing "FDBStore"
+ (let [root (str "wake-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (check-wake-at-filter store))))
diff --git a/test/intemporal/tests/timer_test.cljs b/test/intemporal/tests/timer_test.cljs
index d481349..0dba50b 100644
--- a/test/intemporal/tests/timer_test.cljs
+++ b/test/intemporal/tests/timer_test.cljs
@@ -3,7 +3,7 @@
[intemporal.tests.utils :refer [with-result]]
[cljs.test :as t :refer [deftest is testing]]
[matcher-combinators.test :refer [match?]]
- [promesa.core :as p])
+)
(:require-macros [intemporal.tests.utils :refer [with-result]]
[intemporal.internal.context :refer [blet]]))
diff --git a/test/intemporal/tests/worker_test.clj b/test/intemporal/tests/worker_test.clj
new file mode 100644
index 0000000..9343226
--- /dev/null
+++ b/test/intemporal/tests/worker_test.clj
@@ -0,0 +1,108 @@
+(ns intemporal.tests.worker-test
+ "Phase C (ownership model) — claim exclusivity + the recovery worker.
+
+ Proves the durable, cross-pod recovery model WITHOUT leases:
+ - a workflow whose original engine crashed is resumed by a worker (the
+ ownership scan is both the live wake and the crash recovery);
+ - claim-owner is the exclusivity gate: only one owner can claim a workflow,
+ so concurrent execution (and history corruption) cannot occur (bug 1.2)."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.protocol :as p]
+ [intemporal.store :as store]
+ [intemporal.store.jdbc :as jdbc-store]
+ [intemporal.store.fdb :as fdb-store]
+ [me.vedang.clj-fdb.FDB :as cfdb]
+ [intemporal.internal.workflow-registry :as wreg]))
+
+(defn w-act [x] (* x 10))
+
+(defn worker-wf [x]
+ (let [a (intemporal/stub #'w-act)
+ r (a x)]
+ (intemporal/wait-for-signal "go")
+ (+ r 1)))
+
+(defn- await-status [store wf-id terminal timeout-ms]
+ (let [deadline (+ (System/currentTimeMillis) timeout-ms)]
+ (loop []
+ (let [s (p/get-workflow-status store wf-id)]
+ (cond
+ (= terminal s) s
+ (> (System/currentTimeMillis) deadline) s
+ :else (do (Thread/sleep 50) (recur)))))))
+
+;; ── recovery: worker resumes a crashed workflow via the ownership scan ──────────
+
+(defn- check-worker-recovery [store]
+ (wreg/clear-registry!)
+ (let [wid (str "worker-" (random-uuid))]
+ ;; Phase 1: start, suspend on signal, then crash (no signal sent).
+ (let [e1 (intemporal/make-workflow-engine :store store :threads 2)
+ f1 (future (intemporal/start-workflow e1 worker-wf [5] :workflow-id wid))]
+ (Thread/sleep 300)
+ (future-cancel f1)
+ (intemporal/shutdown-engine e1))
+ (is (= :running (p/get-workflow-status store wid))
+ "workflow is durably suspended, not terminal, after the crash")
+ ;; Phase 2: a worker (fresh engine) + a signal delivered via the shared store.
+ (let [e2 (intemporal/make-workflow-engine :store store :threads 2)
+ stop (intemporal/start-worker e2 :poll-ms 50 :owner-id "w2")]
+ (try
+ (intemporal/send-signal store wid "go" {})
+ (is (= :completed (await-status store wid :completed 5000))
+ "worker scan claimed ownership and resumed the workflow to completion")
+ (is (= 51 (intemporal/get-workflow-result store wid)) "5*10 + 1 = 51")
+ (finally (stop) (intemporal/shutdown-engine e2))))))
+
+(deftest worker-recovery-in-memory
+ (testing "shared InMemoryStore: worker resumes a crashed, then-signalled workflow"
+ (check-worker-recovery (store/->InMemoryStore (atom {})))))
+
+(deftest ^:integration worker-recovery-jdbc
+ (testing "JdbcStore: worker resumes via the ownership scan"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (check-worker-recovery store) (finally (.close store))))))
+
+(deftest ^:integration worker-recovery-fdb
+ (testing "FDBStore: worker resumes via the ownership scan"
+ (let [root (str "worker-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (check-worker-recovery store))))
+
+;; ── exclusivity: claim-owner lets exactly one owner run a workflow ──────────────
+
+(defn- check-claim-exclusivity [store]
+ (let [wid (str "claim-" (random-uuid))]
+ (p/save-event store wid {:event-type :workflow-started :workflow-id wid :args []})
+ (is (p/claim-owner store wid "owner-A") "A claims the unowned workflow")
+ (is (p/claim-owner store wid "owner-A") "A re-claims its own (idempotent)")
+ (is (false? (p/claim-owner store wid "owner-B")) "B cannot claim A's workflow")
+ ;; scope to this wid — the shared DB may hold unowned rows from prior runs
+ (is (contains? (set (p/list-pending store "owner-A" 1000)) wid) "the workflow is pending for A")
+ (is (not (contains? (set (p/list-pending store "owner-B" 1000)) wid)) "and not pending for B")
+ (p/release-owner store "owner-A")
+ (is (p/claim-owner store wid "owner-B") "B claims after A releases")))
+
+(deftest claim-exclusivity-in-memory
+ (testing "InMemoryStore claim-owner exclusivity"
+ (check-claim-exclusivity (store/->InMemoryStore (atom {})))))
+
+(deftest ^:integration claim-exclusivity-jdbc
+ (testing "JdbcStore claim-owner exclusivity"
+ (let [url (or (System/getenv "DATABASE_URL")
+ "jdbc:postgresql://localhost:5432/root?user=root&password=root")
+ store (jdbc-store/make-jdbc-store url)]
+ (try (check-claim-exclusivity store) (finally (.close store))))))
+
+(deftest ^:integration claim-exclusivity-fdb
+ (testing "FDBStore claim-owner exclusivity"
+ (let [root (str "claim-" (random-uuid))
+ fdb (cfdb/select-api-version 710)
+ db (.open fdb "docker/fdb.cluster")
+ store (fdb-store/make-fdb-store db root)]
+ (check-claim-exclusivity store))))
diff --git a/test/intemporal/tests/workflow_registry_test.clj b/test/intemporal/tests/workflow_registry_test.clj
new file mode 100644
index 0000000..68a626f
--- /dev/null
+++ b/test/intemporal/tests/workflow_registry_test.clj
@@ -0,0 +1,57 @@
+(ns intemporal.tests.workflow-registry-test
+ "Phase B3 — workflow registry + resume-by-id.
+
+ Verifies that resume-workflow [engine workflow-id] (no fn, no args) can resolve
+ both the workflow function and its original arguments from the :workflow-started
+ event via the process-global registry, and resume to completion without
+ re-running already-completed activities."
+ (:require [clojure.test :refer [deftest is testing]]
+ [intemporal.core :as intemporal]
+ [intemporal.store :as store]
+ [intemporal.internal.workflow-registry :as wreg]))
+
+(def exec-count (atom 0))
+
+(defn reg-activity [x]
+ (swap! exec-count inc)
+ (* x 2))
+
+(defn reg-workflow [a b]
+ (let [act (intemporal/stub #'reg-activity)
+ r1 (act a)]
+ (intemporal/wait-for-signal "go")
+ (+ r1 (act b))))
+
+(deftest registry-basic-ops
+ (testing "register-workflow! / resolve-workflow / clear-registry!"
+ (wreg/clear-registry!)
+ (let [nm (wreg/register-workflow! #'reg-workflow)]
+ (is (= "intemporal.tests.workflow-registry-test/reg-workflow" nm))
+ (is (= @#'reg-workflow (wreg/resolve-workflow nm)))
+ (wreg/clear-registry!)
+ (is (thrown-with-msg? clojure.lang.ExceptionInfo #"No workflow function registered"
+ (wreg/resolve-workflow nm))))))
+
+(deftest resume-by-id-resolves-fn-and-args
+ (testing "resume-workflow [engine wf-id] resolves fn+args from history"
+ (reset! exec-count 0)
+ (wreg/clear-registry!)
+ (let [st (store/->InMemoryStore (atom {}))
+ wid "reg-resume-1"]
+ ;; Phase 1: start, run until it suspends on signal, then simulate a crash.
+ (let [e1 (intemporal/make-workflow-engine :store st :threads 2)
+ f1 (future (intemporal/start-workflow e1 reg-workflow [10 5]
+ :workflow-id wid))]
+ (Thread/sleep 300)
+ (future-cancel f1)
+ (intemporal/shutdown-engine e1))
+ (is (= 1 @exec-count) "only the first activity ran before suspension")
+ ;; Phase 2: fresh engine, deliver signal, resume BY ID ONLY.
+ (let [e2 (intemporal/make-workflow-engine :store st :threads 2)]
+ (intemporal/send-signal st wid "go" {})
+ (let [r (intemporal/resume-workflow e2 wid)] ; no fn, no args
+ (is (= :completed (:status r)) "resumed-by-id workflow completes")
+ (is (= 30 (:result r)) "10*2 + 5*2 = 30")
+ (is (= 2 @exec-count)
+ "second activity ran once on resume; first not re-executed"))
+ (intemporal/shutdown-engine e2)))))
diff --git a/test/resources/migrations/jepsen/postgres/20260529000001-jepsen-tables.down.sql b/test/resources/migrations/jepsen/postgres/20260529000001-jepsen-tables.down.sql
new file mode 100644
index 0000000..b7b58cf
--- /dev/null
+++ b/test/resources/migrations/jepsen/postgres/20260529000001-jepsen-tables.down.sql
@@ -0,0 +1,7 @@
+DROP TABLE IF EXISTS jepsen_cancels_sent;
+--;;
+DROP TABLE IF EXISTS jepsen_signals_sent;
+--;;
+DROP TABLE IF EXISTS jepsen_invocations;
+--;;
+DROP TABLE IF EXISTS jepsen_work_queue;
diff --git a/test/resources/migrations/jepsen/postgres/20260529000001-jepsen-tables.up.sql b/test/resources/migrations/jepsen/postgres/20260529000001-jepsen-tables.up.sql
new file mode 100644
index 0000000..c10a9e5
--- /dev/null
+++ b/test/resources/migrations/jepsen/postgres/20260529000001-jepsen-tables.up.sql
@@ -0,0 +1,57 @@
+-- Jepsen test side-channel tables.
+-- Applied by the Jepsen runner (not by make-jdbc-store) against the same
+-- Postgres instance as intemporal itself.
+
+-- Work queue: the test client inserts workflow specs here; worker JVMs poll and
+-- claim items with FOR UPDATE SKIP LOCKED.
+CREATE TABLE IF NOT EXISTS jepsen_work_queue (
+ id BIGSERIAL PRIMARY KEY,
+ test_run TEXT NOT NULL,
+ workflow_id TEXT NOT NULL UNIQUE,
+ wf_type TEXT NOT NULL, -- signal-wait | activity-chain | cancel-sleep | rapid-signal
+ nonce TEXT NOT NULL,
+ args JSONB,
+ claimed_by TEXT,
+ claimed_at TIMESTAMP WITH TIME ZONE,
+ completed BOOLEAN DEFAULT FALSE,
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
+);
+--;;
+CREATE INDEX IF NOT EXISTS idx_jepsen_work_queue_unclaimed
+ ON jepsen_work_queue (test_run, claimed_by, id)
+ WHERE claimed_by IS NULL AND completed = FALSE;
+--;;
+-- Side-channel: one row per activity invocation. Written with auto-commit so
+-- rows survive a SIGKILL between :begin and :end.
+CREATE TABLE IF NOT EXISTS jepsen_invocations (
+ id BIGSERIAL PRIMARY KEY,
+ test_run TEXT NOT NULL,
+ workflow_id TEXT NOT NULL,
+ step TEXT NOT NULL,
+ nonce TEXT,
+ phase TEXT NOT NULL, -- begin | end | fail
+ owner TEXT,
+ ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
+);
+--;;
+CREATE INDEX IF NOT EXISTS idx_jepsen_invocations_lookup
+ ON jepsen_invocations (test_run, workflow_id, nonce);
+--;;
+-- Signals sent by the test client. Used by the checker to verify that every
+-- sent signal was eventually consumed.
+CREATE TABLE IF NOT EXISTS jepsen_signals_sent (
+ id BIGSERIAL PRIMARY KEY,
+ test_run TEXT NOT NULL,
+ workflow_id TEXT NOT NULL,
+ signal_name TEXT NOT NULL,
+ sent_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
+);
+--;;
+-- Cancels issued by the client. Used by the checker to verify that cancelled
+-- workflows eventually reach a terminal state.
+CREATE TABLE IF NOT EXISTS jepsen_cancels_sent (
+ id BIGSERIAL PRIMARY KEY,
+ test_run TEXT NOT NULL,
+ workflow_id TEXT NOT NULL,
+ sent_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
+);
diff --git a/test2/intemporal/failures_test.cljc b/test2/intemporal/failures_test.cljc
deleted file mode 100644
index b0f0e69..0000000
--- a/test2/intemporal/failures_test.cljc
+++ /dev/null
@@ -1,47 +0,0 @@
-(ns intemporal.failures-test
- #?(:cljs (:require [cljs.test :as t :refer-macros [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [promesa.core :as p])
- :clj (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]))
- #?(:cljs (:require-macros [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]])
- :clj (:require [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]])))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defprotocol MyActivities
- (foo [this a])
- (forced-failure [this]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a] [:proto a])
- (forced-failure [this] (throw (ex-info "Forced" {:a 1}))))
-
-(defn-workflow my-workflow [k]
- (let [stub (stub-protocol MyActivities {})
- prr (if (= :ok k)
- (foo stub :pr)
- (forced-failure stub))]
-
- ;; chain values: ensure tests work under cljs too
- #_:clj-kondo/ignore
- (#?(:clj let :cljs p/let) [res prr]
- res)))
-
-;;;; test proper
-
-(deftest activity-failure-test
- (testing "failure: activity throws"
- (let [mstore (store/make-store)
- ex (w/start-poller! mstore {:protocols {`MyActivities (->MyActivitiesImpl)}})]
- (with-result [res (w/with-env {:store mstore}
- (my-workflow :nok))]
- (is (instance? #?(:clj Exception :cljs js/Error) res))
- (w/shutdown ex 1000)))))
diff --git a/test2/intemporal/internal_failures_test.cljc b/test2/intemporal/internal_failures_test.cljc
deleted file mode 100644
index ecf6c98..0000000
--- a/test2/intemporal/internal_failures_test.cljc
+++ /dev/null
@@ -1,47 +0,0 @@
-(ns intemporal.internal-failures-test
- #?(:cljs (:require [cljs.test :as t :refer-macros [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [promesa.core :as p])
- :clj (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]))
- #?(:cljs (:require-macros [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]])
- :clj (:require [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]])))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defprotocol MyActivities
- (foo [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a] [:proto a]))
-
-(defn-workflow my-workflow [k]
- (let [stub (stub-protocol MyActivities {})
- prr (foo stub :pr)]
-
- ;; chain values: ensure tests work under cljs too
- #_:clj-kondo/ignore
- (#?(:clj let :cljs p/let) [res prr]
- res)))
-
-;;;; test proper
-
-(deftest store-failure-test
- (testing "failure: task validation fails"
- (let [mstore (store/make-store {:failures {:validation 1.0}})
- ex (w/start-poller! mstore {:protocols {`MyActivities (->MyActivitiesImpl)}})]
-
- (with-result [res (w/with-env {:store mstore}
- (my-workflow :ok))]
- (is (instance? #?(:clj Exception :cljs js/Error) res))
- (is (= {:intemporal.workflow.internal/type :internal} (ex-data (or (ex-cause res) res))))
- (w/shutdown ex 1000)))))
-
-;(cljs.test/run-tests *ns*)
\ No newline at end of file
diff --git a/test2/intemporal/matchers.cljc b/test2/intemporal/matchers.cljc
deleted file mode 100644
index 5a41ae6..0000000
--- a/test2/intemporal/matchers.cljc
+++ /dev/null
@@ -1,20 +0,0 @@
-(ns intemporal.matchers
- (:require [matcher-combinators.core :as mc]
- [matcher-combinators.result :as result]))
-
-(defrecord Nilable []
- mc/Matcher
- (-matcher-for [this] this)
- (-matcher-for [this _] this)
- (-match [this actual]
- (if (or (nil? actual)
- (= :matcher-combinators.core/missing actual))
- {::result/type :match
- ::result/value nil
- ::result/weight 1}
- {::result/type :mismatch
- ::result/value actual
- ::result/weight 1}))
- (-base-name [_] 'nilable))
-
-(def nilable? (->Nilable))
diff --git a/test2/intemporal/recovery_failure.edn b/test2/intemporal/recovery_failure.edn
deleted file mode 100644
index eeeae44..0000000
--- a/test2/intemporal/recovery_failure.edn
+++ /dev/null
@@ -1,10 +0,0 @@
-{:tasks {"elegant-robinson" {:args [1], :ref nil, :type :workflow, :state :pending, :sym intemporal.recovery-failure-test/my-workflow-,
- :root nil, :owner "intemporal", :id "elegant-robinson", :runtime {:timeout-ms 900000}, :order 1, :lease-end nil}},
- :history {"elegant-robinson"
- [{:ref "elegant-robinson", :root "elegant-robinson", :type :intemporal.workflow/invoke, :sym intemporal.recovery-failure-test/my-workflow-, :args [1], :error nil, :result nil, :id 1}]
-
- "elegant-robinson-1"
- [{:ref "elegant-robinson-1", :root "elegant-robinson", :type :intemporal.activity/failure, :sym intemporal.recovery-failure-test/activity-fn, :args (1), :error nil, :result nil, :id 2}]}
- :counter 2,
- :pcounter 1,
- :ecounter 0}
diff --git a/test2/intemporal/recovery_failure_test.clj b/test2/intemporal/recovery_failure_test.clj
deleted file mode 100644
index f882ebf..0000000
--- a/test2/intemporal/recovery_failure_test.clj
+++ /dev/null
@@ -1,66 +0,0 @@
-(ns intemporal.recovery-failure-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [clojure.java.io :as io]
- [intemporal.macros :refer [stub-function stub-protocol defn-workflow]]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defn nested-fn [a]
- [a :nested])
-
-(defn activity-fn [a]
- (let [f (stub-function nested-fn)]
- (f :sub)))
-
-(defprotocol MyActivities
- (foo [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a] [:proto a]))
-
-(defn-workflow my-workflow [i]
- (let [sf (stub-function activity-fn)
- pr (stub-protocol MyActivities {})
- sfr (sf 1)
- prr (foo pr :pr)]
-
- ;; chain values: ensure tests work under cljs too
- #_:clj-kondo/ignore
- (let [v1 sfr
- v2 prr]
- [:root v1 v2])))
-
-;;;; test proper
-
-(deftest recovery-failure-test
- ;; make a backup of the db to allow replay
- (io/copy (io/file "./test/intemporal/recovery_failure.edn")
- (io/file "/tmp/recovery_failure.edn"))
- (testing "workflow"
- (let [mstore (store/make-store {:file "/tmp/recovery_failure.edn"})
- [task] (store/list-tasks mstore)
- ex (w/start-poller! mstore {:protocols {`MyActivities (->MyActivitiesImpl)}})]
-
- (try
- (store/reenqueue-pending-tasks mstore println)
- (tu/wait-for-task mstore (:id task))
-
- (testing "workflow failed with unexpected transition"
- (let [[task] (store/list-tasks mstore)
- [_ _ crash-ev last-ev] (->> (store/list-events mstore)
- (sort-by :id))]
-
- (is (= :failure (:state task)))
- (is (= :intemporal.workflow.internal/failure (:type crash-ev)))
- (is (= :intemporal.workflow/failure (:type last-ev)))))
- (finally
- (tu/print-tables mstore)
- (w/shutdown ex 1000))))))
-
-#_:clj-kondo/ignore
-(comment
- (cljs.test/run-tests *ns*))
diff --git a/test2/intemporal/shutdown_restart_test.clj b/test2/intemporal/shutdown_restart_test.clj
deleted file mode 100644
index 91ab757..0000000
--- a/test2/intemporal/shutdown_restart_test.clj
+++ /dev/null
@@ -1,81 +0,0 @@
-(ns intemporal.shutdown-restart-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [matcher-combinators.test :refer [match?]]
- [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :as tu :refer [with-result]]
- [intemporal.test-executor :as te])
- (:import (java.util.concurrent CountDownLatch)))
-
-;(t/use-fixtures :once tu/with-trace-logging)
-
-(def activity-invoked? (CountDownLatch. 1))
-(def executor-shutdown? (CountDownLatch. 1))
-
-(defprotocol MyActivities
- (foo [this a])
- (foo2 [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a]
- (.countDown activity-invoked?)
- (.await executor-shutdown?)
- :foo)
- (foo2 [this a] a))
-
-(defn-workflow my-workflow [k]
- (let [stub (stub-protocol MyActivities {})
- r1 (foo stub :pr)
- r2 (foo2 stub :pr)]
- [r1 r2]))
-
-;;;; test proper
-
-(deftest shutdown-restart-test
- (testing "failure: task validation fails"
- (let [mstore (store/make-store {})
- custom-ex (te/make-test-executor (fn [] (.countDown executor-shutdown?)) nil)
- executor (w/start-poller! mstore custom-ex {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms 500})]
-
- (testing "shutdown of ongoing workflow"
- (future
- ;; ensure activity is inflight
- (.await activity-invoked?)
- ;; immediately countdown the latch
- (w/shutdown executor 0)
- (is (w/shutting-down? executor)))
-
- (with-result [res (w/with-env {:store mstore}
- (my-workflow :ok))]
-
- (is (instance? Exception res))
-
- (testing "workflow is not in failed state"
- (tu/print-tables mstore)
-
- (testing "workflow task"
- (let [tasks (store/list-tasks mstore)
- [w1] tasks]
- (is (match? {:type :workflow :sym 'intemporal.shutdown-restart-test/my-workflow- :state :pending} w1))
-
- (testing "workflow events: workflow has not finished"
- (let [[e1 e2] (store/list-events mstore)]
- (is (match? {:type :intemporal.workflow/invoke :sym 'intemporal.shutdown-restart-test/my-workflow-} e1))
- (is (match? {:type :intemporal.protocol/invoke :sym 'intemporal.shutdown-restart-test/foo} e2))
-
- (let [[w1] (store/list-tasks mstore)]
- (is (match? {:type :workflow :sym 'intemporal.shutdown-restart-test/my-workflow- :state :pending} w1)))))
-
- (testing "workflow resumes"
- (with-open [_ (w/start-poller! mstore {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms 100})]
- (store/reenqueue-pending-tasks mstore (constantly nil))
- (tu/wait-for-task mstore (:id w1))
- (tu/print-tables mstore)
-
- (testing "workflow succeeded"
- (let [[w1] (store/list-tasks mstore)]
- (is (match? {:type :workflow :sym 'intemporal.shutdown-restart-test/my-workflow- :state :success} w1))))))))))))))
diff --git a/test2/intemporal/shutdown_restart_test.cljs b/test2/intemporal/shutdown_restart_test.cljs
deleted file mode 100644
index 0f24a45..0000000
--- a/test2/intemporal/shutdown_restart_test.cljs
+++ /dev/null
@@ -1,78 +0,0 @@
-(ns intemporal.shutdown-restart-test
- (:require [cljs.test :as t :refer-macros [deftest is testing]]
- [intemporal.error :as error]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [matcher-combinators.test :refer [match?]]
- [promesa.core :as p])
- (:require-macros [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(declare stop-worker)
-(defprotocol MyActivities
- (foo [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a]
- (stop-worker)
- :foo))
-
-(defn-workflow my-workflow [k]
- (let [stub (stub-protocol MyActivities {})
- prr (foo stub :pr)]
- ;; chain values: ensure tests work under cljs too
- (p/let [res prr]
- res)))
-
-;;;; test proper
-
-(def mstore (store/make-store {}))
-(def ex (w/start-poller! mstore {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms 10}))
-
-(defn stop-worker []
- (w/shutdown ex 1000))
-
-(deftest executor-shutdown-test
- (testing "shutdown of ongoing workflow"
-
- (with-result [res (w/with-env {:store mstore}
- (my-workflow :ok))]
- (w/shutdown ex 1000)
-
- (is (instance? js/Error res))
- (is (error/panic? res))
-
- (testing "Workflow is not in failed state"
- (tu/print-tables mstore)
-
- (testing "workflow task"
- (let [[w1] (store/list-tasks mstore)]
- (is (match? {:type :workflow :sym 'intemporal.shutdown-restart-test/my-workflow- :state :pending} w1))))
-
- (testing "workflow events"
- (let [[e1 e2 e3] (store/list-events mstore)]
- (is (match? {:type :intemporal.workflow/invoke :sym 'intemporal.shutdown-restart-test/my-workflow-} e1)
- (is (match? {:type :intemporal.protocol/invoke :sym 'intemporal.shutdown-restart-test/foo} e2)))
- (is (nil? e3))))))))
-
-#_(deftest executor-shutdown-resume-test
- (testing "workflow resumes"
- (let [stop-worker (w/start-worker mstore {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms 10})]
- (store/reenqueue-pending-tasks mstore (constantly nil))
- (with-result [_ (p/delay 2000)]
-
- (tu/print-tables mstore)
-
- (testing "workflow succeeded"
- (let [[w1] (store/list-tasks mstore)]
- (is (match? {:type :workflow :sym 'intemporal.shutdown-restart-test/my-workflow- :state :success} w1))))
-
- (stop-worker)))))
-
-;(cljs.test/run-tests *ns*)
\ No newline at end of file
diff --git a/test2/intemporal/store_test.cljc b/test2/intemporal/store_test.cljc
deleted file mode 100644
index a1c3638..0000000
--- a/test2/intemporal/store_test.cljc
+++ /dev/null
@@ -1,91 +0,0 @@
-(ns intemporal.store-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.test-utils :as tu]
- [intemporal.store :as s]
- [matcher-combinators.test :refer [match?]]
- [promesa.core :as p]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defn- is-promise-ok [prom]
- (-> prom
- (p/then (fn [_] (is true)))
- (p/catch (fn [err] (is false (prn-str err))))))
-
-(defn- to-map [rec]
- (into {} rec))
-
-(deftest basic-store-tests
-
- (testing "enqueue/dequeue"
- (let [store (s/make-store)
- task (tu/make-workflow-task)]
- (s/enqueue-task store task)
-
- (testing "dequeueing updates state"
- (is (match? (to-map (assoc task :state :pending))
- (to-map (s/dequeue-task store)))))))
-
- (testing "enqueue/dequeue with lease"
- (let [store (s/make-store)
- task (tu/make-workflow-task)]
- (s/enqueue-task store task)
-
- (testing "dequeueing with lease"
- (is (match? (to-map (assoc task :state :pending))
- (to-map (s/dequeue-task store {:lease-ms 100}))))
- ;; wait for expire
- #?(:clj
- (do
- @(p/delay 1000)
- (is (match?
- (to-map (assoc task :state :pending))
- (to-map (s/dequeue-task store)))))
-
- :cljs
- (t/async done
- (p/finally (p/delay 1000)
- (fn [_ c]
- (t/is (nil? c))
- (is (match?
- (to-map (assoc task :state :pending))
- (to-map (s/dequeue-task store))))
- (done))))))))
-
- (testing "await task"
- (let [store (s/make-store)
- task (tu/make-workflow-task)
- prom (p/vthread
- (s/await-task store (:id task) {:timeout-ms 1000}))]
-
- (is-promise-ok prom)))
-
- (testing "watch task"
- (let [store (s/make-store)
- task (tu/make-workflow-task)
- evt {:ref "some-ref" :root "some-root" :type :intemporal.workflow/invoke :sym 'identity :args []}
- called? (p/deferred)]
-
- (is-promise-ok (p/timeout called? 1000))
- ;; if the watch doesnt happen, the test times out
- (s/watch-task store (:id task) #(p/resolve! called? %))
- (s/enqueue-task store task)
-
- (testing "apply fn event"
- (s/task<-event store (:id task) evt)
-
- (testing "task state updated"
- (let [db-task (s/find-task store (:id task))]
- (is (= (dissoc db-task :id)
- {:type :workflow
- :owner "intemporal"
- :ref "some-ref"
- :root "some-root"
- :sym 'identity
- :fvar #'clojure.core/identity
- :args []
- :result nil
- :state :pending
- :order 1
- :runtime {}}))))))))
-
diff --git a/test2/intemporal/stores/basic_test.clj b/test2/intemporal/stores/basic_test.clj
deleted file mode 100644
index 5b2ee44..0000000
--- a/test2/intemporal/stores/basic_test.clj
+++ /dev/null
@@ -1,204 +0,0 @@
-(ns ^:integration ^:fdb ^:sql intemporal.stores.basic-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.store.foundationdb :as fdb]
- [intemporal.store.jdbc :as jdbc]
- [intemporal.test-utils :as tu]
- [intemporal.workflow.internal :as internal]
- [intemporal.matchers :refer [nilable?]]
- [matcher-combinators.test :refer [match?]]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(def stores (delay {:memory (store/make-store)
- :fdb (fdb/make-store {:cluster-file-path "docker/fdb.cluster"})
- :postgres (jdbc/make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"})}))
-
-(deftest basic-test
- (doseq [[label store] @stores]
- (testing (format "store: %s" label)
- (let [evt {:ref "some-ref" :root "some-root"
- :type :intemporal.activity/invoke
- :sym 'clojure.core/+
- :args [1 "a"]}]
-
- (testing "clear"
- (store/clear-events store)
- (store/clear-tasks store))
-
- (testing "event store"
- (testing "event save"
-
- (let [t1 (internal/create-workflow-task nil nil 'clojure.core/+ (var-get #'+) nil "1")
- t1 (store/enqueue-task store t1)
-
- t2 (internal/create-workflow-task (:id t1) (:id t1) 'clojure.core/+ (var-get #'+) nil "2")
- t2 (store/enqueue-task store t2)
-
- ev (store/save-event store 1 (assoc evt :ref (:id t2) :root (:id t2)))]
-
- (testing "events list"
- (is (match? [(assoc evt
- :id (:id ev)
- :root (:id t2)
- :ref (:id t2)
- :result nilable?)] (store/list-events store))))
-
- (testing "events clear"
- (store/clear-events store)
- (is (empty? (store/list-events store)))))))
-
- (testing "task store"
- (store/clear-events store)
- (store/clear-tasks store)
- (let [task (internal/create-workflow-task "self" "self" 'clojure.core/+ (var-get #'+) ["invoke" 333]
- "self" nil :new
- nil)]
-
- (testing "enqueue task"
- (is (=
- (assoc task :owner store/default-owner)
- (-> (store/enqueue-task store task)
- (dissoc :order)))))
-
- (testing "list tasks"
- (is (match? [(dissoc task :fvar)]
- (store/list-tasks store))))
-
- (testing "dequeue tasks"
- (is (match? {:args ["invoke" 333],
- :ref "self",
- :root "self",
- :type :workflow,
- :state :pending,
- :result nil,
- :id string?,
- :sym 'clojure.core/+,
- :fvar #(or (fn? %) (var? %))
- :lease-end nil}
- (store/dequeue-task store))))
-
- (testing "matching task"
- (is (nil? (store/find-task store "")))
- (is (match? (-> task
- (assoc :state :pending)
- (dissoc :fvar))
- (store/find-task store (:id task)))))
-
- (testing "reenqueue pending"
- (let [args (atom nil)
- cb (fn [t] (reset! args t))]
- (store/reenqueue-pending-tasks store cb)
-
- (testing "callback"
- (is (match? (-> task
- (assoc :state :pending)
- (dissoc :fvar))
- @args)))
-
- (testing "result"
- (is (match? [{:args ["invoke" 333],
- :ref "self",
- :root "self",
- :type :workflow,
- :state :new,
- :result nil,
- :id string?,
- :sym 'clojure.core/+}]
- (store/list-tasks store))))))
-
- (testing "task event handling"
- ;; move to pending
- (store/dequeue-task store)
-
- (let [[db-task] (store/list-tasks store)]
-
- (testing "invoke"
- (let [ev-descr {:ref "self" :root "self" :type :intemporal.workflow/invoke :sym 'clojure.core/+ :args ["invoke" 333]}
- ev (store/task<-event store (:id db-task) ev-descr)
- [task] (store/list-tasks store)]
- (is (match? {:ref "self"
- :root "self"
- :type :intemporal.workflow/invoke
- :sym 'clojure.core/+
- :args ["invoke" 333]}
- ev))
- (is (match? {:args ["invoke" 333],
- :ref "self",
- :root "self",
- :type :workflow,
- :state :pending,
- :sym 'clojure.core/+,
- :result nil,
- :id string?,}
- task)))
-
- (testing "ok"
- (let [ev-descr {:ref "self" :root "self" :type :intemporal.workflow/success :sym 'clojure.core/+ :result ["result"]}
- ev (store/task<-event store (:id db-task) ev-descr)
- [task] (store/list-tasks store)]
- (is (match? {:ref "self"
- :root "self"
- :type :intemporal.workflow/success
- :sym 'clojure.core/+
- :result ["result"]}
- ev))
- (is (match? {:args ["invoke" 333],
- :ref "self",
- :root "self",
- :type :workflow,
- :state :success,
- :sym 'clojure.core/+,
- :result ["result"],
- :id string?}
- task)))))
-
- ;; TODO need to reenqueue another task
- #_(testing "error"
- (let [ex {:some "exception" :data false}
- ev-descr {:ref "self" :root "self" :type :intemporal.workflow/failure :sym 'clojure.core/+ :error ex}
- ev (store/task<-event store (:id db-task) ev-descr)
- [task] (store/list-tasks store)]
-
- (is (match? {:ref "self"
- :root "self"
- :type :intemporal.workflow/failure
- :sym 'clojure.core/+
- :error ex}
- ev))
-
- (is (match? {:args ["invoke" 333],
- :ref "self",
- :root "self",
- :type :workflow,
- :state :failure,
- :sym 'clojure.core/+,
- :result ex
- :id string?}
- task))))))))
-
- (testing "task await+watch"
- (let [task (internal/create-workflow-task "self" "self" 'clojure.core/- (var-get #'-) ["invoke" 333] "4")
- task-id (:id task)
- storage (atom nil)]
-
- (store/enqueue-task store task)
- (store/watch-task store task-id (fn [t] (reset! storage t)))
- (store/dequeue-task store)
-
- ;; wait a bit so watchers can fire
- (Thread/sleep 3000)
- (is (match? {:args ["invoke" 333]
- :ref "self"
- :root "self"
- :type :workflow
- :state :pending
- :sym 'clojure.core/-
- :result nil
- :id string?}
- @storage))))))))
-
-(comment
- (clojure.test/run-tests *ns*))
-
diff --git a/test2/intemporal/stores/basic_workflow_test.clj b/test2/intemporal/stores/basic_workflow_test.clj
deleted file mode 100644
index f5e2615..0000000
--- a/test2/intemporal/stores/basic_workflow_test.clj
+++ /dev/null
@@ -1,47 +0,0 @@
-(ns ^:integration ^:fdb ^:sql intemporal.stores.basic-workflow-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.store.foundationdb :as fdb]
- [intemporal.store.jdbc :as jdbc]
- [intemporal.workflow :as w]
- [intemporal.macros :as macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :as tu]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defprotocol ProtocolActivity
- (some-io [this val]))
-
-(def example-impl
- (reify
- ProtocolActivity
- (some-io [_ val] val)))
-
-;;;;
-;; workflow registration
-
-(defn-workflow run-workflow []
- (let [stub (stub-protocol ProtocolActivity)]
- (some-io stub :ok)))
-
-(def stores (delay {:memory (store/make-store)
- :fdb (fdb/make-store {:cluster-file-path "docker/fdb.cluster"})
- :postgres (jdbc/make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"
- :polling-ms 10})}))
-
-(deftest basic-workflow-test
- (doseq [[label store] @stores]
- (testing (format "store: %s" label)
-
- (testing "running a workflow"
- (store/clear-events store)
- (store/clear-tasks store)
-
- (let [ex (w/start-poller! store {:protocols {`ProtocolActivity example-impl}
- :polling-ms 10})]
- (try
- (is (= :ok (w/with-env {:store store}
- (run-workflow))))
- (finally
- (w/shutdown ex 0))))))))
diff --git a/test2/intemporal/stores/lots_of_workflows_test.clj b/test2/intemporal/stores/lots_of_workflows_test.clj
deleted file mode 100644
index bb1f98d..0000000
--- a/test2/intemporal/stores/lots_of_workflows_test.clj
+++ /dev/null
@@ -1,71 +0,0 @@
-(ns ^:integration ^:fdb ^:sql intemporal.stores.lots-of-workflows-test
- (:require [clojure.test :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.store.foundationdb :as fdb]
- [intemporal.store.jdbc :as jdbc]
- [intemporal.workflow :as w]
- [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :as tu :refer [wait]]
- [matcher-combinators.test :refer [match?]]
- [promesa.core :as p]))
-
-(defprotocol MyActivities
- (foo [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a]
- (Thread/sleep (long (rand-int 100)))
- [:proto a]))
-
-(defn-workflow my-workflow []
- (let [pr (stub-protocol MyActivities {})
- prr (foo pr :pr)]
- prr))
-
-(def iterations 100)
-
-(def stores (delay {:memory (store/make-store)
- :fdb (fdb/make-store {:cluster-file-path "docker/fdb.cluster"})
- :postgres (jdbc/make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"
- :maximum-pool-size 20})}))
-
-
-(deftest lots-of-workflows-test
- (with-redefs [tu/wait-default-timeout 10000]
- (doseq [[label store] @stores]
- (testing (format "store: %s" label)
-
- (store/clear-events store)
- (store/clear-tasks store)
-
- (testing "multiple iterations"
- (dotimes [_ iterations]
- (p/vthread
- (w/with-env {:store store}
- ;; workflows are blocking, we wrap in a virtual thread
- (my-workflow))))
-
- ;; check that all tasks are enqueued
- (wait (= iterations (count (store/list-tasks store)))
- (let [wflows (store/list-tasks store)]
- (testing "workflows are all new"
- (is (= iterations (count wflows)))
- (is (= #{:new} (set (map :state wflows))))))))
-
- (testing "enqueue all jobs"
- (let [ex (w/start-poller! store {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms 100})]
- ;; lets wait for all pending
- (try
- (wait (not (contains? (into #{} (map :state (store/list-tasks store))) :new))
- (w/shutdown ex 10000))
-
- (testing "workflows are all completed"
- (let [tasks (store/list-tasks store)]
- (is (= (* 2 iterations) (count tasks)))
- (is (match? {:success (* 2 iterations)}
- (frequencies (map :state tasks))))))
- (finally
- (w/shutdown ex 0)))))))))
diff --git a/test2/intemporal/stores/multiple_shutdown_test.clj b/test2/intemporal/stores/multiple_shutdown_test.clj
deleted file mode 100644
index 03144f6..0000000
--- a/test2/intemporal/stores/multiple_shutdown_test.clj
+++ /dev/null
@@ -1,71 +0,0 @@
-(ns intemporal.stores.multiple-shutdown-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.store.foundationdb :as fdb]
- [intemporal.store.jdbc :as jdbc]
- [intemporal.workflow :as w]
- [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :as tu]
- [matcher-combinators.test :refer [match?]]
- [matcher-combinators.matchers :as m]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(def stores (delay {:memory (store/make-store)
- :fdb (fdb/make-store {:cluster-file-path "docker/fdb.cluster"})
- :postgres (jdbc/make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"})}))
-
-(def activity-counter (atom 0))
-
-(defprotocol MyActivities
- (sleep [this time]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (sleep [this a]
- (swap! activity-counter inc)
- (Thread/sleep (long a))))
-
-(defn-workflow my-workflow [steps max-sleep]
- (let [stub (stub-protocol MyActivities {})]
- (dotimes [i steps]
- (sleep stub max-sleep))
- :done))
-
-;;;; test proper
-
-(deftest executor-shutdown-test
- (testing "workflow eventually finishes"
- (let [store (store/make-store {})
- polling-ms 500
- make-poller (fn [] (w/start-poller! store {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms polling-ms}))
- executor (atom (make-poller))
-
- steps 2
- max-timeout 500
-
- workflow-id "f100ded0-0000-4000-a000-000000000000"
- future-res (future
- (w/with-env {:store store
- :id workflow-id}
- (my-workflow steps max-timeout)))
- reenqueued (atom [])]
-
- (testing "shutdown of ongoing workflow"
- (add-watch activity-counter ::watch (fn [_ _ _ v]
- (when (and (zero? (mod v 2))
- (empty? @reenqueued))
- (w/shutdown @executor max-timeout)
- (store/reenqueue-pending-tasks store #(swap! reenqueued conj %))
- (reset! executor (make-poller)))))
- (try
- (is (= :done (deref future-res 10000 ::error)))
-
- (finally
- (testing "workflow was re-enqueued"
- (is (match? (m/embeds [{:type :workflow :sym 'intemporal.stores.multiple-shutdown-test/my-workflow-}])
- @reenqueued)))
- (w/shutdown @executor 0)
- (tu/print-tables store)))))))
diff --git a/test2/intemporal/stores/release_reenqueue_test.clj b/test2/intemporal/stores/release_reenqueue_test.clj
deleted file mode 100644
index a41c32d..0000000
--- a/test2/intemporal/stores/release_reenqueue_test.clj
+++ /dev/null
@@ -1,79 +0,0 @@
-(ns ^:integration ^:fdb ^:sql intemporal.stores.release-reenqueue-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.store.foundationdb :as fdb]
- [intemporal.store.jdbc :as jdbc]
- [intemporal.workflow :as w]
- [intemporal.macros :refer [stub-protocol defn-workflow]]
- [intemporal.test-utils :as tu :refer [with-result]])
- (:import (java.util.concurrent Phaser)))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(def stores (delay {:memory (store/make-store)
- :fdb (fdb/make-store {:cluster-file-path "docker/fdb.cluster"})
- :postgres (jdbc/make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"})}))
-
-(def activity-invoked? (Phaser. 1))
-(def executor-shutdown? (Phaser. 1))
-
-(defprotocol MyActivities
- (foo [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a]
- (.arrive activity-invoked?)
- (.awaitAdvance executor-shutdown? (.getPhase executor-shutdown?))
- :foo))
-
-(defn-workflow my-workflow [k]
- (let [stub (stub-protocol MyActivities {})
- prr (foo stub :pr)]
- prr))
-
-;;;; test proper
-
-(deftest release-reenqueue-test
- (doseq [[label store] @stores]
-
- (store/clear-events store)
- (store/clear-tasks store)
-
- (testing (format "store: %s" label)
- (let [executor (w/start-poller! store {:protocols {`MyActivities (->MyActivitiesImpl)}
- :polling-ms 100})]
-
- (testing "shutdown of ongoing workflow"
- ;; give it some time so the poller can pick it up but just once
- (let [fut (future
- ;; ensure activity is inflight
- (.awaitAdvance activity-invoked? (.getPhase activity-invoked?))
- (w/shutdown executor 0)
- ;; proceed activity, it will fail
- (.arrive executor-shutdown?)
- :done)]
-
- (with-result [res (w/with-env {:store store}
- (my-workflow :ok))]
-
- (is (instance? Exception res)))
-
- (is (= :done (deref fut 1000 ::error)))))
-
- (testing "Tasks are pending"
- (let [[task] (store/list-tasks store)]
- (tu/print-tables store)
- (is (= :pending (:state task)))))
-
- (testing "Tasks are released"
- (store/release-pending-tasks store)
- (let [[task] (store/list-tasks store)]
- (is (nil? (:owner task)))))
-
- (testing "Tasks are reenqueued"
- (store/reenqueue-pending-tasks store (constantly nil))
- (let [[task] (store/list-tasks store)]
- (is (= store/default-owner (:owner task)))
- (is (= :new (:state task)))))))))
diff --git a/test2/intemporal/stores/saga_test.clj b/test2/intemporal/stores/saga_test.clj
deleted file mode 100644
index 03b95d0..0000000
--- a/test2/intemporal/stores/saga_test.clj
+++ /dev/null
@@ -1,77 +0,0 @@
-(ns ^:integration ^:fdb ^:sql intemporal.stores.saga-test
- (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.store.foundationdb :as fdb]
- [intemporal.store.jdbc :as jdbc]
- [intemporal.workflow :as w]
- [intemporal.macros :as macros :refer [stub-protocol defn-workflow with-failure]]
- [intemporal.test-utils :as tu]
- [spy.core :as spy]
- [spy.protocol :as pspy]))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defprotocol ProtocolActivity
- (some-io [this val])
- (some-failing-io [this val])
- (finalize [this val])
- (compensate [this val]))
-
-(def example-impl
- (reify
- ProtocolActivity
- (some-io [_ val] val)
- (some-failing-io [_ val] (throw (RuntimeException. "forced")))
- (finalize [_ val] val)
- (compensate [_ val] {:compensate val})))
-
-;;;;
-;; workflow registration
-
-(defn-workflow run-workflow []
- (let [stub (stub-protocol ProtocolActivity)]
- (try
- (let [v (with-failure [r (some-io stub "initial")]
- (compensate stub r))
- v2 (with-failure [r (some-failing-io stub v)]
- (compensate stub r))
- v3 (finalize stub v2)]
- v3)
- (catch Exception e
- (w/compensate)
- ::failed))))
-
-(def stores (delay {:memory (store/make-store)
- :fdb (fdb/make-store {:cluster-file-path "docker/fdb.cluster"})
- :postgres (jdbc/make-store {:jdbcUrl "jdbc:postgresql://localhost:5432/root?user=root&password=root"
- :migration-dir "migrations/postgres"})}))
-
-(deftest saga-test
- (doseq [[label store] @stores]
- (testing (format "store: %s" label)
-
- (testing "running a workflow"
- (store/clear-events store)
- (store/clear-tasks store)
-
- (let [spied-impl (pspy/spy ProtocolActivity example-impl)
- stop-worker (w/start-worker! store {:protocols {`ProtocolActivity spied-impl}})]
- (try
- (testing "workflow run"
- (is (= ::failed (w/with-env {:store store}
- (run-workflow)))))
-
- (testing "protocol calls"
- (let [{:keys [some-io some-failing-io finalize compensate]} (meta spied-impl)]
- (is (spy/called-once-with? some-io spied-impl "initial"))
- (is (spy/called-once-with? some-failing-io spied-impl "initial"))
- (is (not (spy/called-once? finalize)))
-
- (testing "compensation calls in reverse order"
- (let [calls (spy/calls compensate)]
- (is (= [[spied-impl :intemporal.activity/failure]
- [spied-impl "initial"]]
- calls))))))
-
- (finally
- (stop-worker))))))))
diff --git a/test2/intemporal/test_executor.clj b/test2/intemporal/test_executor.clj
deleted file mode 100644
index 31a38c7..0000000
--- a/test2/intemporal/test_executor.clj
+++ /dev/null
@@ -1,42 +0,0 @@
-(ns intemporal.test-executor
- (:require [intemporal.workflow :as w]
- [taoensso.telemere :as t])
- (:import (java.lang AutoCloseable)
- (java.util.concurrent Executors TimeUnit)))
-
-(defn make-test-executor [shutdown-fn? terminated-fn?]
- (let [factory (-> (Thread/ofVirtual)
- (.name "Task Thread")
- (.factory))
- exec (Executors/newThreadPerTaskExecutor factory)
- shutdown? (atom false)
- terminated? (atom false)]
- (reify
- w/ITaskExecutor
- (submit [_ f]
- (.submit exec ^Runnable f))
- (shutdown [_ grace-period-ms]
- (try
- ;; reject tasks
- (.shutdown exec)
- (when (ifn? shutdown-fn?)
- (shutdown-fn?))
- (reset! shutdown? true)
- (t/log! {:level :debug} ["Executor shutdown"])
- ;; await ongoing tasks
- (when-not (.awaitTermination exec grace-period-ms TimeUnit/MILLISECONDS)
- (t/log! {:level :debug} ["Executor shutdown grace period over, shutting down NOW"])
- (.shutdownNow exec))
- ;; in case we got interrupted exception, make sure to set the flag
- ;; so ongoing ops fail
- (finally
- (when (ifn? terminated-fn?)
- (terminated-fn?))
- (reset! terminated? true))))
- (terminated? [_]
- @terminated?)
- (shutting-down? [_]
- @shutdown?)
- AutoCloseable
- (close [this]
- (w/shutdown this 0)))))
\ No newline at end of file
diff --git a/test2/intemporal/test_utils.cljc b/test2/intemporal/test_utils.cljc
deleted file mode 100644
index 3c6f3c4..0000000
--- a/test2/intemporal/test_utils.cljc
+++ /dev/null
@@ -1,179 +0,0 @@
-(ns intemporal.test-utils
- #?(:cljs (:require [intemporal.store :as store]
- [intemporal.workflow.internal :as in]
- [promesa.core :as p]
- [taoensso.telemere :as telemere]
- [taoensso.telemere.utils :as tutils]
- [cljs.test :as t]
- [cljs.pprint :as pprint]))
- #?(:clj (:require [intemporal.store :as store]
- [intemporal.workflow.internal :as in]
- [promesa.core :as p]
- [taoensso.telemere :as telemere]
- [taoensso.telemere.utils :as tutils]
- [taoensso.telemere.open-telemetry :as tot]
- [net.cgrand.macrovich :as macros]
- [clojure.pprint :as pprint]))
- #?(:cljs (:require-macros [net.cgrand.macrovich :as macros]
- [intemporal.test-utils :refer [with-result wait]]))
- #?(:clj (:import [java.util.concurrent TimeoutException])))
-
-;;;;
-;; helpers
-
-(defn now []
- #?(:clj (System/currentTimeMillis)
- :cljs (.getTime (js/Date.))))
-
-(defn- make-task [& {:keys [proto type id ref root sym fvar args result state]
- :or {proto nil
- type :workflow
- id (in/random-id)
- ref "some-ref"
- root "some-root"
- sym 'identity
- fvar #'identity
- args []
- result nil
- state :new}}]
- (cond
- (= type :workflow)
- (in/create-workflow-task ref root sym fvar args id result state nil)
- (= type :activity)
- (in/create-activity-task ref root sym fvar args id result state nil)
- (= type :proto-activity)
- (in/create-proto-activity-task proto ref root sym fvar args id result state nil)
- :else (throw (ex-info (str "Unknown task type:" type) {:type type}))))
-
-(defn make-workflow-task [& {:keys [] :as args}]
- (make-task (assoc args :type :workflow)))
-
-(comment "unused"
- (defn make-activity-task [& {:keys [] :as args}]
- (make-task (assoc args :type :activity)))
-
- (defn make-protocol-task [& {:keys [] :as args}]
- (make-task (assoc args :type :proto-activity))))
-
-;;;;
-;; misc
-
-(defn print-tables
- "Prints the task and events tables to sysout via pprint"
- [store]
- (let [tasks (store/list-tasks store)
- events (->> (store/list-events store)
- (sort-by :id))]
- (locking *out*
- (print "==================== TASKS")
- (pprint/print-table tasks)
- (println "\n==================== EVENTS")
- (pprint/print-table events)
- (flush))))
-
-(defn wait-for-task
- "Waits for the task with given id to reach terminal state"
- ;; only works in clj, should probably take a body and be a macro
- ([store id]
- (wait-for-task store id {:timeout 10000 :sleep-ms 100}))
- ([store id {:keys [timeout sleep-ms]}]
- (let [start (now)]
- #_:clj-kondo/ignore
- @(p/loop [task (store/find-task store id)]
- (when (not (#{:failure :success} (:state task)))
- (let [elapsed (- (now) start)]
- (when (> elapsed timeout)
- (throw (ex-info (str "More than " timeout " ms (" elapsed " ms) elapsed while waiting for task " id " to finish") {:task task})))
- (p/then (p/delay sleep-ms id)
- (fn [_] (p/recur (store/find-task store id))))))))))
-
-;;;;
-;; macros
-
-(def ^:dynamic with-result-default-timeout 10000)
-(def ^:dynamic wait-default-timeout 3000)
-
-(defmacro with-result
- "Promise-aware macro: the result can either be a value or a thrown exception.
- Waits for result for 10 secs, then times out
- Doesn't really work for exceptions returned as values
- ```
- (with-result [r (my-worfklow 1)]
- (is (instance? Exception r))
- (is (= 1 2)))
- ```
- "
- [bindings & body]
- (assert (vector? bindings) "First argument should be a binding of [res resbody]")
- (let [[res resbody] bindings]
- (macros/case
- :clj
- `(let [~res (let [future# (future (do ~resbody))]
- (try
- (deref future# with-result-default-timeout (TimeoutException. "Operation timed out."))
- (catch Exception e# e#)))]
- ~@body)
- :cljs
- `(t/async done#
- (js/setTimeout
- (fn []
- ;; force wrap resbody in a deferred
- (p/finally (-> nil
- (p/then (fn [_#]
- (do ~resbody)))
- (p/timeout with-result-default-timeout))
- (fn [res# err#]
- (try
- (let [~res (or res# err#)]
- (do ~@body))
- (finally
- (done#)))))
- 0))))))
-
-(defmacro wait
- "Waits for 10 secs until the result is true, or throws;
- In `clj` it polls every 100ms
- In `cljs` it continuously loops
- ```
- (wait (db/find id)
- (is (= 1 1))
- ```
- "
- [condition & body]
- (macros/case
- :clj
- `(let [timeout# wait-default-timeout
- start# (System/currentTimeMillis)]
- (loop []
- (if ~condition
- (do ~@body)
- (if (> (- (System/currentTimeMillis) start#) timeout#)
- (throw (ex-info "Timed out" {:timeout timeout#}))
- (do (Thread/sleep 100)
- (recur))))))
-
- :cljs
- `(let [timeout# 3000
- start# (.getTime (js/Date.))]
- (loop []
- (if ~condition
- (do ~@body)
- (if (> (- (.getTime (js/Date.)) start#) timeout#)
- (throw (ex-info "Timed out" {:timeout timeout#}))
- ;; Note: In CLJS this is a "busy wait" loop
- (recur)))))))
-
-(defn setup-telemere []
- ;#?(:clj (clojure.pprint/pprint (telemere/check-interop)))
- (telemere/set-min-level! :info)
- (telemere/remove-handler! ::custom)
- #?(:clj (telemere/add-handler! :default/open-telemetry (tot/handler:open-telemetry)))
- (telemere/add-handler! ::custom
- (telemere/handler:console
- {:output-fn
- (tutils/format-signal-fn
- {:content-fn (taoensso.telemere.utils/signal-content-fn {:incl-keys #{:thread}})})})))
-
-(def with-trace-logging
- #?(:cljs {:before setup-telemere}
- :clj (fn with-trace-logging [f] (setup-telemere) (f))))
diff --git a/test2/intemporal/vthread-recovery.edn b/test2/intemporal/vthread-recovery.edn
deleted file mode 100644
index e66e892..0000000
--- a/test2/intemporal/vthread-recovery.edn
+++ /dev/null
@@ -1,62 +0,0 @@
-{:tasks {"silly-mcclintock" {:args [],
- :ref nil,
- :type :workflow,
- :state :pending,
- :sym intemporal.vthread-recovery-test/my-workflow-,
- :root nil,
- :result nil,
- :id "silly-mcclintock",
- :owner "intemporal",
- :order 1,
- :lease-end nil}},
- :history {"silly-mcclintock" [{:ref "silly-mcclintock",
- :root "silly-mcclintock",
- :type :intemporal.workflow/invoke,
- :sym intemporal.vthread-recovery-test/my-workflow-,
- :args [],
- :error nil,
- :result nil,
- :id 17}
- #_{:ref "silly-mcclintock",
- :root "silly-mcclintock",
- :type :intemporal.workflow/success,
- :sym intemporal.vthread-recovery-test/my-workflow-,
- :args nil,
- :error nil,
- :result [0 1 2 3 4 5 6 7 8 9],
- :id 38}],
- "silly-mcclintock-1" [{:ref "silly-mcclintock-1",
- :root "silly-mcclintock",
- :type :intemporal.protocol/invoke,
- :sym intemporal.vthread-recovery-test/with-thread,
- :args [0],
- :error nil,
- :result nil,
- :id 18}
- {:ref "silly-mcclintock-1",
- :root "silly-mcclintock",
- :type :intemporal.protocol/success,
- :sym intemporal.vthread-recovery-test/with-thread,
- :args nil,
- :error nil,
- :result 0,
- :id 28}],
- "silly-mcclintock-2" [{:ref "silly-mcclintock-2",
- :root "silly-mcclintock",
- :type :intemporal.protocol/invoke,
- :sym intemporal.vthread-recovery-test/with-thread,
- :args [1],
- :error nil,
- :result nil,
- :id 19}
- {:ref "silly-mcclintock-2",
- :root "silly-mcclintock",
- :type :intemporal.protocol/success,
- :sym intemporal.vthread-recovery-test/with-thread,
- :args nil,
- :error nil,
- :result 1,
- :id 29}]}
- :counter 38,
- :pcounter 2,
- :ecounter 0}
diff --git a/test2/intemporal/vthread_recovery_test.clj b/test2/intemporal/vthread_recovery_test.clj
deleted file mode 100644
index bbb6c23..0000000
--- a/test2/intemporal/vthread_recovery_test.clj
+++ /dev/null
@@ -1,57 +0,0 @@
-(ns intemporal.vthread-recovery-test
- (:require [clojure.java.io :as io]
- [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.macros :refer [stub-protocol vthread defn-workflow]]
- [intemporal.test-utils :as tu]
- [promesa.core :as p]))
-
-;;;;
-;; demo - recovery of a crashed process
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defprotocol ThreadActivity
- (with-thread [this id]))
-
-(defrecord ThreadActivityImpl []
- ThreadActivity
- (with-thread [this id]
- (Thread/sleep 200)
- id))
-
-(def nthreads 10)
-
-(defn-workflow my-workflow []
- (let [pr (stub-protocol ThreadActivity {})
- proms (for [i (range nthreads)]
- (vthread
- (with-thread pr i)))]
- ;; at this point, all of `with-thread` calls are queued, so
- ;; this code is deterministic up to here
- @(p/all proms)))
-
-(deftest vthread-recovery-test
- ;; make a backup of the db to allow replay
- (io/copy (io/file "./test/intemporal/vthread-recovery.edn")
- (io/file "/tmp/intemporal-vthread-recovery.edn"))
- (let [mstore (store/make-store {:file "/tmp/intemporal-vthread-recovery.edn"})
- ex (w/start-poller! mstore {:protocols {`ThreadActivity (->ThreadActivityImpl)}})]
-
- (store/reenqueue-pending-tasks mstore println)
-
- (let [[task] (store/list-tasks mstore)]
- (tu/wait-for-task mstore (:id task))
- (tu/print-tables mstore))
-
- (testing "linear history"
- (testing "stored events"
- (let [evts (store/list-events mstore)
- evts (sort-by :id evts)]
-
- (testing "workflow has result"
- (is (= (into [] (range nthreads))
- (-> evts last :result)))))))
-
- (w/shutdown ex 1000)))
diff --git a/test2/intemporal/vthread_test.cljc b/test2/intemporal/vthread_test.cljc
deleted file mode 100644
index 7d751c1..0000000
--- a/test2/intemporal/vthread_test.cljc
+++ /dev/null
@@ -1,95 +0,0 @@
-(ns intemporal.vthread-test
- #?(:cljs (:require [cljs.test :as t :refer-macros [deftest is testing]]
- [cljs.pprint :as pprint]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [promesa.core :as p])
- :clj (:require [clojure.test :as t :refer [deftest is testing]]
- [clojure.pprint :as pprint]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [promesa.core :as p]))
- #?(:cljs (:require-macros [intemporal.macros :refer [stub-protocol defn-workflow vthread]]
- [intemporal.test-utils :refer [with-result]])
- :clj (:require [intemporal.macros :refer [stub-protocol defn-workflow vthread]]
- [intemporal.test-utils :refer [with-result]])))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defprotocol ThreadActivity
- (sleep [this id ms]))
-
-(defrecord ThreadActivityImpl []
- ThreadActivity
- (sleep [this id ms]
- #?(:clj (do
- (Thread/sleep (long ms))
- id)
- :cljs (p/then (p/delay ms id)
- (fn [_] id)))))
-
-(defn-workflow my-workflow [sleep-time]
- (let [pr (stub-protocol ThreadActivity {})
- proms (->> (for [i (range 10)]
- (vthread
- (sleep pr i sleep-time)))
- (doall))]
- #?(:clj (Thread/sleep (long sleep-time)))
- (p/all proms)))
-
-(deftest workflow-with-vthread-test
- (let [sleep-time (+ 1000 (rand-int 500))]
- (testing "workflow"
- (let [mstore (store/make-store)
- executor (w/start-poller! mstore {:protocols {`ThreadActivity (->ThreadActivityImpl)}
- :polling-ms 10})
-
- start (store/now)]
-
- ;; cljs runtimes return promises
- ;; clj runtime will run synchronously
- (with-result [v (w/with-env {:store mstore}
- (my-workflow sleep-time))]
-
- (testing "result"
- (is (= (range 10)
- v)))
- (testing "ran every activity concurrently"
- (let [elapsed (- (store/now) start)]
- (is (>= elapsed sleep-time) "Should take at least `sleep-time` to run")
- (is (< elapsed (* sleep-time 2)) "Should not take more than 2x sleep time to run")))
-
- (testing "linear history"
- (testing "stored events"
- (let [evts (->> (store/list-events mstore)
- (filterv #(= (:type %) :intemporal.protocol/invoke))
- (sort-by :id))
- aargs (map :args evts)]
-
- (testing "sequential activity invocation args"
- ;; even though each activity runs in a thread, they are started in order
- ;; this ensures determinism
-
- (is (= [[0 sleep-time] [1 sleep-time] [2 sleep-time] [3 sleep-time] [4 sleep-time] [5 sleep-time] [6 sleep-time] [7 sleep-time] [8 sleep-time] [9 sleep-time]]
- aargs))))))
-
- (w/shutdown executor 0)
-
- ;; debugging
- (let [tasks (sort-by :order (store/list-tasks mstore))
- events (->> (store/list-events mstore)
- (sort-by :id))
- pprint-table (fn [table]
- (->> table
- (map (fn [r]
- (cond-> r
- (contains? r :fvar) (assoc :fvar ""))))
- (pprint/print-table)))]
- (pprint-table tasks)
- (pprint-table events)))))))
-
-#_:clj-kondo/ignore
-(comment
- (cljs.test/run-tests *ns*))
diff --git a/test2/intemporal/workflow_test.cljc b/test2/intemporal/workflow_test.cljc
deleted file mode 100644
index 4d64941..0000000
--- a/test2/intemporal/workflow_test.cljc
+++ /dev/null
@@ -1,119 +0,0 @@
-(ns intemporal.workflow-test
- #?(:cljs (:require [cljs.test :as t :refer-macros [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [matcher-combinators.test :refer [match?]]
- [promesa.core :as p])
- :clj (:require [clojure.test :as t :refer [deftest is testing]]
- [intemporal.store :as store]
- [intemporal.workflow :as w]
- [intemporal.test-utils :as tu]
- [matcher-combinators.test :refer [match?]]))
- #?(:cljs (:require-macros [intemporal.macros :refer [env-let stub-function stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]])
- :clj (:require [intemporal.macros :refer [stub-function stub-protocol defn-workflow]]
- [intemporal.test-utils :refer [with-result]])))
-
-(t/use-fixtures :once tu/with-trace-logging)
-
-(defn nested-fn [a]
- [a :nested])
-
-(defn activity-fn [a]
- #?(:clj
- (let [f (stub-function nested-fn)]
- (f :sub))
-
- :cljs
- (env-let [f (stub-function nested-fn)]
- (f :sub))))
-
-(defprotocol MyActivities
- (foo [this a]))
-
-(defrecord MyActivitiesImpl []
- MyActivities
- (foo [this a] [:proto a]))
-
-(defn-workflow my-workflow [atm]
- (reset! atm (w/workflow-id))
-
- (let [sf (stub-function activity-fn)
- pr (stub-protocol MyActivities {})
- sfr (sf 1)
- prr (foo pr :pr)]
-
- ;; chain values: ensure tests work under cljs too
- #_:clj-kondo/ignore
- (#?(:clj let :cljs p/let) [v1 sfr
- v2 prr]
-
- [:root v1 v2])))
-
-;;;; test proper
-
-(deftest workflow-happy-path-test
- (testing "workflow"
- (let [mstore (store/make-store)
- ex (w/start-poller! mstore {:protocols {`MyActivities (->MyActivitiesImpl)}})
- uuid-store (atom nil)
- workflow-id (str (random-uuid))]
-
- (with-result [v (w/with-env {:store mstore
- :id workflow-id}
- (my-workflow uuid-store))]
-
- (testing "workflow result"
- (is (= [:root [:sub :nested] [:proto :pr]]
- v)))
-
- (testing "stored events"
- (let [evts (store/list-events mstore)
- evts (sort-by :id evts)
- ;; cljs is promise based, so stubs dont run in lexical order
- ;; due to p/let
- #?(:clj [w1 a1 n1 n2 a2 p1 p2 w2]
- :cljs [w1 a1 p1 p2 n1 n2 a2 w2]) evts]
-
- (tu/print-tables mstore)
-
- (testing "workflow uuid"
- (is (every? #(= @uuid-store %) (map :root evts))))
-
- (testing "workflow events"
- (is (match? {:type :intemporal.workflow/invoke :sym 'intemporal.workflow-test/my-workflow- #_#_:args [uuid-store]} w1))
- (is (match? {:type :intemporal.workflow/success :sym 'intemporal.workflow-test/my-workflow-} w2)))
-
- (testing "activity events"
- (is (match? {:type :intemporal.activity/invoke :sym 'intemporal.workflow-test/activity-fn :args [1]} a1))
- (is (match? {:type :intemporal.activity/success :sym 'intemporal.workflow-test/activity-fn} a2)))
-
- (testing "nested activity events"
- (is (match? {:type :intemporal.activity/invoke :sym 'intemporal.workflow-test/nested-fn :args '(:sub)} n1))
- (is (match? {:type :intemporal.activity/success :sym 'intemporal.workflow-test/nested-fn} n2)))
-
- (testing "protocol activity events"
- (is (match? {:type :intemporal.protocol/invoke :sym 'intemporal.workflow-test/foo :args [:pr]} p1))
- (is (match? {:type :intemporal.protocol/success :sym 'intemporal.workflow-test/foo} p2)))))
-
- (testing "stored tasks"
- (let [tasks (store/list-tasks mstore)
- ;; due to promises,
- ;; the order of execution is not exactly the same between clj/cljs
- #?(:clj [w1]
- :cljs [w1]) tasks]
- (tu/print-tables mstore)
-
- (testing "workflow task"
- (is (match? {:type :workflow :sym 'intemporal.workflow-test/my-workflow- :state :success} w1)))
-
- (testing "workflow uuid"
- (is (some #(= @uuid-store %) (map :id tasks)))
- (is (= @uuid-store workflow-id)))))
-
- (w/shutdown ex 1000)))))
-
-#_:clj-kondo/ignore
-(comment
- (cljs.test/run-tests *ns*))
diff --git a/tests.edn b/tests.edn
index 0877e3c..fbf09e9 100644
--- a/tests.edn
+++ b/tests.edn
@@ -33,11 +33,15 @@
:test-paths ["test"]}
;; jvm based tests
- {:id :test
+ {:id :test
;:kaocha.filter/skip-meta [:crash]
- :ns-patterns ["-test$"]
- :source-paths ["src"]
- :test-paths ["test"]}
+ :ns-patterns ["-test$"]
+ ;; Exclude the jepsen namespace — it is not a kaocha test suite and
+ ;; its ns name ends in .runner / .worker (not -test) anyway, but
+ ;; exclude explicitly to be safe.
+ :kaocha.filter/skip-meta [:jepsen]
+ :source-paths ["src"]
+ :test-paths ["test"]}
{:id :test-cljs
:type :kaocha.type/cljs
:cljs/repl-env cljs.repl.node/repl-env