From 38100f5f224afd981ca621412336170878bdfc7c Mon Sep 17 00:00:00 2001 From: Peter Wielander Date: Fri, 22 May 2026 08:57:18 +0200 Subject: [PATCH 1/2] [docs] Reduce noise in changelog files Signed-off-by: Peter Wielander --- .../docs/v4/changelog/eager-processing.mdx | 573 ------------------ docs/content/docs/v4/changelog/index.mdx | 1 - docs/content/docs/v4/changelog/meta.json | 2 +- .../docs/v4/changelog/resilient-start.mdx | 314 +--------- .../docs/v5/changelog/eager-processing.mdx | 480 ++------------- .../docs/v5/changelog/resilient-start.mdx | 314 +--------- 6 files changed, 130 insertions(+), 1554 deletions(-) delete mode 100644 docs/content/docs/v4/changelog/eager-processing.mdx diff --git a/docs/content/docs/v4/changelog/eager-processing.mdx b/docs/content/docs/v4/changelog/eager-processing.mdx deleted file mode 100644 index 9be4e0cb1b..0000000000 --- a/docs/content/docs/v4/changelog/eager-processing.mdx +++ /dev/null @@ -1,573 +0,0 @@ ---- -title: Eager Processing of Steps & Incremental Event Replay -description: Combine workflow event replay and step bundles to do work inline where possible, only deferring to queue for parallelism -type: overview ---- - -# Eager Processing of Steps & Incremental Event Replay - -**Date**: March 2026 - -This is a major internal architecture change to how Workflow DevKit executes workflows and steps on the Vercel platform. It reduces function invocations and queue overhead by executing steps _inline_ within the same function invocation as the workflow replay, rather than dispatching every step to a separate function via the queue. - -## Previous Architecture - -The previous architecture used two separate routes, -each backed by its own queue trigger: - -``` -Queue: __wkf_workflow_* --> /.well-known/workflow/v1/flow (workflow replay in VM) - | - suspension (step needed) - | - queue step to __wkf_step_* - | -Queue: __wkf_step_* --> /.well-known/workflow/v1/step (step execution in Node.js) - | - step completes - | - queue continuation to __wkf_workflow_* - v - (cycle repeats for each step) -``` - -Each step required **2 queue messages** (step invoke + workflow continuation) and **2 function invocations**, plus cold start overhead for each. A serial workflow with 10 steps needed ~21 function invocations. - -## New Architecture - -The two routes are merged into a single handler at `/.well-known/workflow/v1/flow` using `workflowEntrypoint()`. The step route is no longer generated. - -The handler runs an inline execution loop: - -``` -receive queue message - | - +-- if message has stepId+stepName: execute that step, queue workflow continuation, exit - | - v -replay workflow in VM - | - +-- workflow completed --> create run_completed event, exit - +-- workflow failed --> create run_failed event, exit - | - v -suspension with pending operations - | - +-- process hooks and waits (unchanged) - | - +-- 0 pending steps --> return (waits/hooks only) - +-- 1 pending step --> execute inline, loop back to replay - +-- N pending steps --> queue N-1 to self (with stepId), - | execute 1 inline, loop back to replay - | - +-- timeout check: if wall-clock time >= threshold, - | re-schedule self via queue and exit - | - v -(loop continues until completion, timeout, or non-step suspension) -``` - -A serial workflow with 10 steps now completes in **1 function invocation**. - -## Inline Step Execution - -After the workflow suspends with pending steps, the handler executes one step inline: - -1. Create `step_started` event -2. Hydrate step input from the event log -3. Look up the step function via `getStepFunction(stepName)` -4. Execute the step function -5. Create `step_completed` or `step_failed` event -6. Loop back to workflow replay - -This logic lives in `executeStep()` in `packages/core/src/runtime/step-executor.ts`. - -## Background Steps (Parallel Execution) - -When a workflow suspends with multiple pending steps (e.g., from `Promise.all`), the handler: - -1. Creates `step_created` events for all pending steps -2. Queues N-1 steps back to `__wkf_workflow_*` with `stepId` and `stepName` in the message payload -3. Executes 1 step inline -4. Loops back to replay - -Each background step message is handled by a separate function invocation of the same handler. When a message arrives with `stepId` and `stepName`, the handler executes that specific step, then checks if all parallel steps from the batch are done by loading the event log and comparing `step_created` events against terminal events (`step_completed`/`step_failed`): - -- **All steps done**: The handler replays the workflow inline, continuing the execution loop without a queue roundtrip. Events are loaded with a cursor so subsequent loop iterations can use incremental loading. -- **Steps still pending**: The handler returns without queuing a continuation. The last handler to complete its step will see all steps done and replay inline. -- **Pending ops (stream writes)**: The handler queues a continuation and returns, so `waitUntil` can flush the pending stream data to the server. - -### Convergence After Parallel Steps - -When multiple background steps complete near-simultaneously, multiple handlers may observe "all steps done" and attempt to advance the workflow concurrently. The event-sourced architecture plus the invariants described below ensure safe convergence: - -- **`step_created` idempotency** --- duplicate creates return 409; exactly one handler owns each step -- **`step_completed` / `step_failed` idempotency** --- only the first invocation to record a terminal result wins -- **Queue idempotency keys** --- background step messages use `correlationId` as idempotency key -- **Deterministic replay** --- all invocations produce the same result given the same event log - -### Single Inline Executor Per Step - -Inline step execution combined with background-step dispatch introduces a new coordination requirement that the V1 handler did not face: when multiple handlers reach the same `Promise.all` batch concurrently, we need to guarantee that each step body runs at most once via the inline path. Without that guarantee, the event log accumulates duplicate `step_started` events (including some written *after* `step_completed`, which orphans them on replay) and step bodies run redundantly. - -The design enforces a simple invariant: **exactly one handler owns each step, and only the owner may execute it inline**. Ownership is established by the atomicity of `step_created`: - -1. **Atomic `step_created`** --- the world's `events.create('step_created', correlationId=X)` is serialized per-correlationId. Exactly one concurrent caller succeeds; the rest receive `EntityConflictError`. In production worlds (Postgres, Vercel) this is enforced at the SQL/DB layer. In `world-local`, a per-step in-process async mutex in `packages/world-local/src/storage/events-storage.ts` wraps every step lifecycle event's check-and-write so the same guarantee holds for dev. -2. **Suspension handler reports ownership** --- `handleSuspension()` returns `createdStepCorrelationIds: Set`, populated only for `step_created` writes that actually succeeded (not those that caught 409). -3. **Inline execution is gated on ownership** --- the runtime loop in `packages/core/src/runtime.ts` picks its inline step from `pendingSteps.filter(s => createdStepCorrelationIds.has(s.correlationId))`. A handler that didn't win any `step_created` race performs no inline execution. -4. **Queueing is unconditional** --- for every pending step except the one being inline-executed, the handler enqueues a background step message with `idempotencyKey: correlationId`. This matches V1's enqueue pattern and is what makes crash recovery work: if a prior handler wrote `step_created` but crashed before enqueueing, a later handler (from flow-message redelivery or `reenqueueActiveRuns`) will enqueue the orphaned step. Concurrent handlers' redundant enqueues dedupe on the idempotency key. - -Together these give: every `step_created` event has exactly one inline executor (possibly zero if the inline path was skipped due to crash) **and** at least one queued dispatch (from whichever handler first reaches the suspension path after the `step_created` is visible). Step bodies are never executed concurrently, and `step_started` events never land in the log after `step_completed` for the same step. Event-log replay sees clean subscriber-matched sequences. With this invariant in place, the earlier `onUnconsumedEvent` skip logic for step/hook/wait lifecycle events was removed — any unconsumed event now immediately fatals as a corrupted event log (its original purpose before the V2 work). - -**Retry semantics are preserved**: the per-step mutex in `world-local` only rejects `step_started` when the step is already in a *terminal* state (`completed` / `failed`). A step that is currently running (status=`running`) still accepts a second `step_started` write with an incremented attempt counter — this is how queue redelivery after a SIGKILL mid-execution legitimately re-runs the step. The previously-documented "attempt counter inflation" failure mode is therefore no longer reachable via the concurrent-inline path; see "Concurrent `step_started` Inflating Attempt Counter" below for the complementary executor-side guard that still catches edge cases (e.g., postgres retries under high contention). - -## Incremental Event Loading - -The handler caches the event log in memory across loop iterations. Instead of re-fetching the entire event log on each replay: - -1. **First iteration**: full load via `getAllWorkflowRunEventsWithCursor()`, which returns both the events and the final pagination cursor -2. **Subsequent iterations**: `getNewWorkflowRunEvents(runId, cursor)` fetches only events created after the saved cursor and appends them to the cached array - -For a 10-step serial workflow completing in one invocation, the 10th replay loads ~2 new events instead of re-fetching all ~30. - -### Server-Side Cursor Fix - -The incremental loading depends on the server returning a cursor even on the final page of results (`hasMore: false`). Previously, `workflow-server` returned `cursor: null` when there were no more pages. This was fixed in the `peter/fix-end-cursor` branch to always return an `eid:` cursor when there are events, aligning with `world-local` and `world-postgres` behavior. - -If a World implementation does not return a cursor after the initial load, the handler logs an error and falls back to a full reload. - -## Timeout Handling - -The inline execution loop checks wall-clock time before each replay iteration. If the elapsed time exceeds a configurable threshold (default: 110 seconds, for a 120-second function limit), the handler re-schedules itself via the queue and returns. - -The threshold is configurable via the `WORKFLOW_V2_TIMEOUT_MS` environment variable. - -If a single step takes longer than the timeout threshold, the step runs to completion (or SIGKILL) — there is no interruption mechanism for in-progress step execution. This is the same behavior as the previous architecture. - -## Queue Message Changes - -The `WorkflowInvokePayload` schema has two new optional fields: - -{/*@skip-typecheck - snippet, not runnable code*/} - -```typescript -stepId: z.string().optional() -stepName: z.string().optional() -``` - -When `stepId` is present, the handler executes that specific step before (or instead of) replaying the workflow. Background steps are queued with both `stepId` and `stepName` set, so the handler knows which step function to call without loading the event log. Previously, `stepName` was resolved by loading all events and searching for the `step_created` event matching the `stepId` — an O(N) operation on the full event history for every background step arrival. - -The queue trigger configuration uses `WORKFLOW_QUEUE_TRIGGER` on the `__wkf_workflow_*` topic. The `__wkf_step_*` topic and its separate trigger are no longer generated. - -## Builder Changes - -### Base Builder - -New method `createCombinedBundle()` in `packages/builders/src/base-builder.ts`: - -1. Builds the step registrations bundle (same esbuild + SWC step mode as before) -2. Builds the workflow VM code string (same esbuild + SWC workflow mode as before) -3. Generates a combined route file that imports the step registrations and uses `workflowEntrypoint(workflowCode)` - -No changes to the SWC plugin were needed. The two-pass build approach (separate step and workflow SWC modes) still applies. - -### Framework Builders - -All framework builders were updated to use `createCombinedBundle()`: - -- **Next.js** (eager and deferred/lazyDiscovery): replaces separate step + flow route generation -- **NestJS, Nitro, Standalone**: replaces separate `createStepsBundle()` + `createWorkflowsBundle()` calls -- **SvelteKit, Astro**: same, plus post-processing regex updated to match `workflowEntrypoint` -- **Vercel Build Output API** (used by Nitro/Astro production): single `flow.func/` with `WORKFLOW_QUEUE_TRIGGER` - -### Generated File Layout - -``` -.well-known/workflow/v1/ - flow/ - route.js # Handler (workflowEntrypoint) - __step_registrations.js # Step function registrations (side effects) - webhook/ - [token]/ - route.js # Webhook handler (unchanged) - manifest.json # Workflow/step/class manifest (unchanged) - config.json # Functions config (single trigger) -``` - -The `step/` directory is no longer generated. - -## Suspension Handler - -`handleSuspension()` in `packages/core/src/runtime/suspension-handler.ts` creates events for all pending operations (hooks, step events, wait events) but does **not** queue step messages. It returns the pending step items so the handler can decide which to execute inline vs. queue to background. - -## Concerns and Edge Cases - -### Parent→Child Polling Holds Worker Slots - -`Run#returnValue` is implemented as a polling step: the workflow awaits the child run's terminal status inside a step body. In worker-based worlds (notably `world-postgres`), each such poll occupies a queue worker slot until the child run finishes. Parent workflows that fan out to many child runs — recursive workflows like `fibonacciWorkflow` are the obvious case — can therefore consume a large fraction of available workers just holding positions in `Promise.all([...children.map(c => c.returnValue)])`. - -If `queueConcurrency` is smaller than the peak number of concurrent parent polls plus the workers needed for any in-flight children, the system deadlocks: every slot is held by a parent waiting on a child, but no child can acquire a slot to start. An earlier iteration of this work mitigated the deadlock runtime-side by detecting step context (via `contextStorage.getStore()`) and throwing `TooEarlyError` to re-enqueue the polling step, freeing the worker. That runtime guard has been removed — the responsibility now lies with worker-pool sizing. - -For `world-postgres`, the default `queueConcurrency` is set to **50**, which is comfortably above the ~24 concurrent polls `fibonacciWorkflow(6)` produces at peak. Workflows that fan out more aggressively must raise this ceiling. `packages/core/src/runtime/run.ts`, the `queueConcurrency` option on `createWorld()` for `world-postgres`, and the `fibonacciWorkflow` fixture in `workbench/example/workflows/99_e2e.ts` all carry pointers to this caveat. - -**Follow-up**: Replace the worker-pool sizing requirement with a polling design that does not occupy a worker slot. Options under consideration: (a) restore a runtime-side `TooEarlyError` re-enqueue path but make it visible to the user (rather than the silent guard the earlier iteration shipped), (b) move child-completion polling out of the step body into the suspension layer so a parent waiting on a child does not consume queue capacity at all, or (c) emit a `run_completed` notification on the parent's stream/queue so the parent only resumes when the child actually finishes. The current `queueConcurrency=50` default is a workaround, not a long-term answer — workflows with deep recursion or large fan-out can still exhaust workers regardless of how high we set the ceiling. - -### VM Sandboxing - -Workflow code still runs in a Node.js VM for determinism and sandboxing. Step code runs in the Node.js host context. The only change is that both happen within the same function invocation. - -### Bundle Size and Cold Start - -The combined bundle is larger (contains both step code and workflow VM code). Cold start time increases slightly. The reduction in total function invocations more than compensates. - -### Step Retries - -When an inline step fails with retries remaining: - -- `RetryableError` with explicit `retryAfter` delay: re-queue to self with `stepId` and delay -- Transient errors with immediate retry: re-queue to self with `stepId` (delay = 1s) -- `FatalError`: fail immediately - -### Mixed Suspensions - -A suspension may contain steps, hooks, and waits simultaneously. The handler creates events for all, then chooses between inline execution and queue dispatch: - -- **Steps only** (no waits): one owned step is executed inline; the rest are queued. The loop continues after the inline step completes. -- **Steps + at least one wait**: every step is queued (no inline execution). The handler returns with the wait timeout. Whichever lands first — a step's continuation or the wait timer — drives the next replay. -- **Hooks / waits only**: handler returns with the wait timeout (or no timeout, for hook-only suspensions). The next continuation is driven by external resume or the wait timer. - -The "no inline when there's a wait" carve-out is necessary to preserve `Promise.race(step, sleep)` semantics. Inline `await executeStep(...)` blocks the handler for the full step duration, and `wait_completed` events are only created on the *next* loop iteration's "complete elapsed waits" pass — so a longer-running step would always swallow the shorter sleep and `Promise.race` would resolve incorrectly. Queueing the step in this case lets the wait timer drive a continuation in parallel, matching V1's behavior where each step ran in a separate function invocation. - -Pure step suspensions (without waits) still benefit from inline execution; the carve-out only costs an extra queue roundtrip when a step and a sleep coexist. - -### Hook Conflicts - -If a hook conflict is detected during suspension handling, the handler breaks the loop and returns `{ timeoutSeconds: 0 }` for immediate re-invocation, same as the previous behavior. - -### Encryption Key Resolution - -Encryption keys are resolved once before the inline execution loop starts (after the run status is confirmed as `running`) and reused across all iterations. Background step executions resolve the key independently. The key does not change within a run. - -## Framework Support - -All framework integrations have been updated: Next.js (eager and deferred/lazyDiscovery), NestJS, SvelteKit, Astro, Nitro/Nuxt/Hono/Express/Vite, and CLI standalone. The Vercel Build Output API builder (used by Nitro and Astro for production deploys) also uses the combined bundle with `WORKFLOW_QUEUE_TRIGGER`. - -## Non-Next.js Integration Challenges - -### Module Scope Duplication in Re-Bundled Output - -Builders that use `bundleFinalOutput: true` (standalone CLI, Vercel Build Output API, NestJS) produce a single file where esbuild re-bundles the step registrations and the workflow runtime together. esbuild creates isolated module scopes for each source module, even within the same output file. This meant `registerStepFunction` and `getStepFunction` operated on different `Map` instances — steps were registered into one Map but looked up from another. - -**Fix**: The step function registry (`registeredSteps` Map in `@workflow/core/private`) and the step context storage (`contextStorage` AsyncLocalStorage in `@workflow/core/step/context-storage`) were changed from module-scoped variables to `globalThis` singletons using `Symbol.for`. This ensures all esbuild module scopes share the same instances. The pattern was already used in the codebase for the World singleton and the class serialization registry. - -### Workflow Package CJS Export Condition - -The `workflow` package's root export has `"require": "./dist/typescript-plugin.cjs"` for TypeScript editor plugin loading. When esbuild bundles with CJS format, it resolves `import { defineHook } from 'workflow'` via the `require` condition, getting the TS plugin instead of the API. - -**Fix**: Added a `"node"` condition (`"node": "./dist/index.js"`) before the `"require"` condition in the workflow package's exports. esbuild with `conditions: ['node']` matches `"node"` first and uses the correct API entry. TypeScript's plugin loader doesn't use `conditions: ['node']`, so it still falls through to `"require"` for the TS plugin. - -### Local World Concurrent Replay Interference - -The local development world (`world-local`) processes queue messages with high concurrency (default: 1000). With the V2 combined handler, parallel steps generate multiple workflow continuation messages. When these are processed concurrently, each triggers a replay that sees in-flight events from other concurrent replays. This causes "unconsumed event" errors because the event consumer encounters events that don't match any subscriber in the current replay state. - -In production (Vercel), this doesn't happen — each function invocation is isolated with its own event loading. - -**Fix**: The `EventsConsumer`'s `onUnconsumedEvent` callback (see "Concurrent Replay Interference with Multi-Batch Workflows" below) handles the concurrent event visibility issue. The V2 inline replay optimization (where the last background step to complete replays inline instead of queuing) further reduces concurrent replays. Redundant step executions from concurrent handlers are harmless due to `step_completed` idempotency — only the first completion wins. - -### ESM `bundleFinalOutput` and Dynamic Require Errors - -When `bundleFinalOutput: true` is used with ESM format, esbuild bundles CJS dependencies (like `debug`) into the output. CJS `require()` calls are wrapped in esbuild's `__require` polyfill, which throws "Dynamic require of X is not supported" in ESM contexts where `require` is undefined. This affected all ESM-based framework builders (Nitro, NestJS, SvelteKit, Astro) that were switched to `bundleFinalOutput: true` during the V2 migration. - -**Fix**: ESM builders use `bundleFinalOutput: false` with `externalizeNonSteps: true`, matching the pre-V2 behavior. The framework's own bundler (Vite, Rollup, Turbopack) handles dependency resolution. The standalone CLI and Vercel Build Output API builders use `bundleFinalOutput: true` with ESM output plus a `createRequire(import.meta.url)` banner (see "V2 Combined Bundle Switched from CJS to ESM" below) so CJS dependencies can still call `require()` for Node.js builtins. - -### Rollup Tree-Shaking of Step Registrations - -When `bundleFinalOutput: false` is used with Nitro's rollup pipeline, the step registrations bundle (`steps.mjs`) only contains side-effect code (`registerStepFunction` calls) with no exports. Rollup tree-shakes the entire module because it has no used exports, removing all step registrations from the production bundle. This causes "Step not found" errors at runtime. - -**Fix**: The steps bundle now exports a sentinel value (`export const __steps_registered = true`), and the combined route file imports it (`import { __steps_registered } from './steps.mjs'`). This gives rollup a used binding to track, preventing it from dropping the module and its side effects. - -### Concurrent Replay Interference with Multi-Batch Workflows (historic) - -An earlier iteration of the V2 work hit "Unconsumed event in event log" errors when multiple concurrent handlers raced into the same batch boundary. The diagnosis at the time was that concurrent handlers could see events the current replay hadn't reached yet, and the mitigation was a skip path in `onUnconsumedEvent` that tolerated step/hook/wait lifecycle events whose correlationId had a matching `step_created` / `hook_created` / earlier `wait_completed` in the log. - -Later work on the "Single Inline Executor Per Step" invariant (described above) identified the actual root cause: duplicate `step_started` events were being written *after* `step_completed` on the same step, because the local world's `step_started` was not atomic w.r.t. terminal state and the main loop was re-picking already-queued steps for inline execution. Fixing those at the source (per-step mutex in `world-local` + ownership-gated inline dispatch + unconditional queueing with idempotency keys) eliminated the unconsumed-step-event path entirely, and fixing the `wait_completed` cursor bug (the main loop manually pushed `wait_completed` events without advancing `eventsCursor`, so the next incremental fetch re-returned them as local-array duplicates) eliminated the wait case. - -**Current state**: the `onUnconsumedEvent` skip logic has been removed. Any unconsumed event now fatals the run with `CORRUPTED_EVENT_LOG`, matching the original contract from PR #1055. Incremental event loading in `runtime.ts` dedupes by `eventId` to tolerate any residual manual pushes. - -### Stale V1 Artifacts in Build Caches - -SvelteKit and Astro's build caches (including Vercel's) may preserve the old V1 `step/` route directory from previous builds. When the V2 builder runs, it no longer generates step routes, but the stale files remain and cause build failures (e.g., importing the removed `stepEntrypoint`). Additionally, SvelteKit's `beforeExit` hook that patches `.vc-config.json` files for Vercel deployments was still trying to configure the non-existent `step.func/` directory. - -**Fix**: SvelteKit and Astro builders now clean up stale V1 step route directories during build. SvelteKit's Vercel deployment hook was updated to only configure the combined `flow.func/` directory. - -### Next.js Canary Turbopack and Temp Files - -The deferred (lazyDiscovery) Next.js builder writes build artifacts with a `.temp` extension to avoid HMR churn, then copies them to their final names. The V2 migration created `__step_registrations.route.js.temp` in the `app/` directory. Canary Turbopack rejects this file as an "Unknown module type" because the `.temp` extension has no associated loader. - -**Fix**: The step registrations file is written directly to its final name (`__step_registrations.js`) since it doesn't need the temp-file HMR mechanism. Only the route file uses temp naming. - -### Concurrent `step_started` Inflating Attempt Counter - -When the V2 handler dispatches N parallel steps as background messages, each background step completion queues a workflow continuation. Up to N continuations may replay concurrently, and each may attempt to start the same not-yet-completed step (since `step_started` succeeds for already-running steps). Each call atomically increments the `attempt` counter. With N=5 parallel steps, the attempt counter can reach 5 on the first genuine execution — exceeding the default `maxRetries + 1 = 4` threshold and prematurely failing the step with "exceeded max retries". - -This is the same known limitation described in "Convergence After Parallel Steps" above, but with a concrete failure mode: `promiseRaceStressTestWorkflow` (which uses 5 parallel steps with `Promise.race`) consistently failed on Postgres tests. - -**Fix**: The max retries check in `executeStep()` now only enforces when `step.error` exists — distinguishing actual retries (failed → retry with error) from concurrent first-attempt races (multiple handlers start the same step simultaneously without any prior failure). Concurrent starts are harmless since `step_completed` idempotency ensures only the first completion wins. - -### Inline Step Execution with Pending Stream Operations - -When a step's arguments or return value include serialized streams (e.g., `WritableStream` from `getWritable()`, or AI SDK streaming steps), the serialization layer creates background `flushablePipe` operations that pipe data to S3. These ops are tracked in an `ops` array and need to complete before the stream data is readable by external consumers. - -In V1, each step ran in a separate function invocation. After the step completed, `waitUntil(ops)` kept the function alive to flush the ops. The function then returned, giving `waitUntil` exclusive event loop time. - -**Current state**: `executeStep()` attempts a 500ms `Promise.race` between the ops settling and a timeout. If ops settle in time (data confirmed on server), it returns `hasPendingOps: false` and the V2 handler continues the inline loop. If ops don't settle in 500ms (e.g., `WritableStream` kept open across steps), it returns `hasPendingOps: true` and the V2 handler breaks the loop and queues a continuation so `waitUntil` can flush them. - -**Earlier attempts that failed** (before the flush waiter fix below): - -1. **500ms inline ops await without flush waiters** — The same 500ms race, but `WorkflowServerWritableStream` used a buffered 10ms flush timer: the `flushablePipe`'s `pendingOps` reached 0 when the buffered `write()` returned (instant), but the actual S3 HTTP write hadn't started yet. The ops appeared settled but data wasn't on S3. Multiple approaches to fix the timing (delaying `pollWritableLock`, closing the writable to trigger flush, adding a post-settle delay) all failed or caused other issues (deadlocks, premature stream closure). - -2. **Root cause of the buffered write issue**: `WorkflowServerWritableStream.write()` buffers chunks and schedules a flush via `setTimeout(flush, 10ms)`. The `flushablePipe` calls `await writer.write(chunk)` which returns immediately (data buffered). `pendingOps--` fires before the 10ms timer. The `pollWritableLock` sees `pendingOps === 0` and resolves `state.promise`. The ops appear settled, but data is still in the buffer. - -3. **Why this only affects Vercel Prod**: On local (world-local), stream writes go to the filesystem — effectively instant. On Vercel (world-vercel), writes go through HTTP to workflow-server → S3, adding 50-100ms latency. The buffered write returns instantly but the HTTP round-trip is deferred. When the V2 loop continues and the function eventually returns, `waitUntil` may not have enough time to flush. - -**Follow-up**: The flush-waiter design described under "Buffered Stream Flush with Waiter Promises" below is the landed fix and resolves the buffered-write race. The remaining work is to shrink the 500ms inline-ops budget once we have confidence that the flush-waiter path settles deterministically across all worlds (today the budget is a defensive ceiling, not a tuned latency target), and to surface a stronger contract for "ops settled" — currently a 500ms timeout means "probably settled, give up and queue a continuation", which is correct but coarse. A signaled "ops drained" event from the world layer would let `executeStep()` proceed without the timeout in the common case, removing latency for streaming workflows whose ops settle in well under 500ms. - -### CJS `module.exports` Collision in BOA Bundles (RESOLVED) - -The Vercel Build Output API (BOA) builder creates a single CJS bundle via `createCombinedBundle` with `bundleFinalOutput: true`. The combined route file imports the steps bundle: - -```js -import { __steps_registered } from './__step_registrations.js'; -import { workflowEntrypoint } from 'workflow/runtime'; -export const POST = workflowEntrypoint(workflowCode); -``` - -When esbuild re-bundles this into CJS, the steps bundle's code is inlined. If the steps bundle is also CJS format, it contains its own `module.exports = __toCommonJS(...)` at the top level. esbuild sometimes inlines CJS modules **without** a `__commonJS()` wrapper (the heuristic depends on the module's detected format). When unwrapped, the steps bundle's `module.exports` assignment executes at the top level and **overwrites** the combined route's `module.exports`, removing the `POST` handler export. - -**Symptoms**: The Vercel deployment builds and starts successfully, but the `POST` handler is missing from the function's exports. Queue messages are delivered to the function but nothing processes them. All e2e tests hang indefinitely. - -**Debugging steps that led to the root cause**: - -1. Tested the CJS bundle locally with `node -e "require('./index.js')"` — confirmed 92 steps registered, but `module.exports` only contained `{ __steps_registered }`, not `{ POST }`. -2. Found two `module.exports` assignments in the bundle: line ~45K (from the combined route, exporting `POST`) and line ~95K (from the inlined steps bundle, exporting `__steps_registered`). The second overwrites the first. -3. Compared with the standalone builder's bundle which had the same steps code wrapped in `__commonJS()` — esbuild's wrapper prevents the inner `module.exports` from leaking. - -**Fix**: When `bundleFinalOutput` is true, build the steps bundle in **ESM format** regardless of the final output format. The final esbuild pass converts everything to CJS correctly. ESM steps don't have `module.exports`, so there's no collision. The combined route's `export const POST` becomes the sole `module.exports` entry. - -### Step Error Source Maps on BOA Deployments - -The V2 combined CJS bundle (`bundleFinalOutput: true`) loses original source file names during re-bundling. Error stack traces show `/var/task/index.js` instead of `99_e2e.ts`. The `hasStepSourceMaps()` utility was updated to return `false` for BOA-builder frameworks (Express, Fastify, Hono, Nitro, Nuxt, Vite, Astro, Example) on Vercel preview, aligning test expectations with the actual bundle behavior. - -### CLI Health Check Port Mismatch - -The CLI `health` command defaults to `http://localhost:3000` when `WORKFLOW_LOCAL_BASE_URL` is not set. Different frameworks use different ports (Astro: 4321, SvelteKit: 5173). The e2e test passed `WORKFLOW_LOCAL_BASE_URL` via the spawn env, but the CLI's `getEnvVars()` function had a fixed list of env vars that didn't include `WORKFLOW_LOCAL_BASE_URL`. The env var was set but never read. - -**Fix**: Added `WORKFLOW_LOCAL_BASE_URL` to the CLI's `getEnvVars()` return object. - -### Buffered Stream Flush with Waiter Promises - -`WorkflowServerWritableStream` buffers writes and flushes via a 10ms `setTimeout` for batching. Previously, `write()` returned immediately after buffering, causing the `flushablePipe`'s `pendingOps` counter to reach 0 before data actually reached the server. The V2 inline loop saw ops as settled prematurely and broke on every step with `WritableStream` serialization. - -**Fix**: `write()` now returns a promise that resolves only after the scheduled flush completes. Multiple writes within the 10ms window still share a single batched HTTP request (the batching optimization is preserved). Each write registers a `{resolve, reject}` pair in a `flushWaiters` array. When the `setTimeout` fires and `flush()` completes the HTTP round-trip, all waiters are resolved (or rejected on error). This makes `pendingOps` accurately reflect server-side data state while keeping network-efficient batching. - -The 500ms inline ops await in the step executor can now distinguish between: - -- **Steps where ops settle** (data on server, ~200ms after lock release + flush) → continue loop inline -- **Steps where ops don't settle** (WritableStream kept open across steps) → break loop - -### Lock-Release Polling Interval Lowered to 10ms - -`flushablePipe`'s `pollWritableLock` / `pollReadableLock` use `setInterval` to detect when a user releases their stream lock without closing the stream — the Web Streams API has no event for that state. The V2 step executor's `opsSettled` race waits for this poll to resolve after each writable-bearing step body returns, so the polling interval sits on the critical path of every streaming step. - -The interval was originally 100ms. Measuring a synthetic workflow with 5 sequential streaming steps (each step receives a shared `WritableStream` argument, writes a few chunks, releases the writer with the stream still open — the same pattern `doStreamStep` / `writeToolOutputToUI` / `writeFinishChunk` use in `DurableAgent.chat`) produced a per-step wait distribution clustered between 22–100ms with a mean of ~58ms. That matches the analytical prediction for a periodic poll with uniformly random offset relative to step return: ~half the interval. Across the 5 steps, polling alone added ~290ms of latency to the workflow even though no step actually had pending I/O — the writes were already flushed, the writer lock was already released, and we were just waiting for the next tick to notice. - -**Fix**: dropped the polling interval from 100ms to 10ms in `packages/core/src/flushable-stream.ts`. Per-step wait drops from ~50ms average to ~5ms (a 10× improvement, expected to scale linearly with the number of writable-bearing steps in a workflow). For `DurableAgent.chat` with one tool call (4 writable-bearing steps), this removes ~180ms from the streaming chat response's critical path. Per-tick work is just `writable.locked` plus a `getWriter()`/`releaseLock()` probe — both microsecond-scale, so 10× more ticks during a stream's lifetime is not measurable in practice. - -**Follow-up**: Replace the polling entirely with an event-driven release signal — wrap the writable returned from the `WritableStream` reviver with a writer that fires on `releaseLock()` — bringing the wait to ~0ms. The 10ms polling interval is the cheap path that captures most of the available win without the structural change, but every writable-bearing step still pays a ~5ms tax that the event-driven design would eliminate. The structural change is also worth pursuing because it removes a source of timing drift between `world-local` (filesystem-instant) and `world-vercel` (HTTP-deferred) — both would see truly synchronous lock-release detection rather than periodic-poll detection. - -### Event Consumer Skip Logic Was Too Broad For Wait Replays - -The V2 handler needs some tolerance for out-of-order replay, especially around step events created by concurrent continuations. An early follow-up broadened that fallback to all wait lifecycle events too, so `onUnconsumedEvent` would skip `wait_created` and the first `wait_completed` whenever they matched a known wait. In the BOA-backed previews that broke `hookDisposeTestWorkflow`: once the first run disposed its hook and went into `sleep('5s')`, a replay could skip the live wait event before `sleep()` registered its subscriber, leaving the run stuck forever at `wait_created`. - -**Fix**: Keep the step/hook replay tolerance, but narrow the wait fallback to the one case we actually need: duplicate `wait_completed` events that appear *after* an earlier completion for the same wait. The `hookDispose` e2e was also updated to poll for hook registration/disposal instead of relying on fixed 3-5 second sleeps, which made the Vercel preview timing less brittle. - -### TooEarlyError Retry Delay in Step Executor - -The `executeStep()` function handles `TooEarlyError` (thrown when a step's `retryAfter` timestamp hasn't been reached yet) by returning a `retry` result with a timeout. The original implementation used a stale access pattern `(err as any).meta?.retryAfter` copied from an older error shape. The `TooEarlyError` class (from `@workflow/errors`) has `retryAfter` as a direct property (number of seconds), not nested under `.meta`. The stale pattern always evaluated to `undefined`, falling back to a 1-second delay regardless of the server's actual retry-after value. - -**Fix**: Changed to `err.retryAfter ?? 1`, matching the correct pattern used in `step-handler.ts`. - -### Health Check Endpoint JSON Response - -The `withHealthCheck()` wrapper in `helpers.ts` was updated (on main) to return a JSON response with `{ healthy, endpoint, specVersion, workflowCoreVersion }` instead of a plain text string. The V2 branch's e2e test still expected `Content-Type: text/plain` and a text body after merging main, causing the "health check endpoint (HTTP)" test to fail across all frameworks and environments. - -**Fix**: Updated the e2e test to expect `Content-Type: application/json` and validate the JSON body structure, including a `specVersion >= SPEC_VERSION_CURRENT` range assertion. - -### V2 Combined Bundle Switched from CJS to ESM - -The V2 combined bundle was initially emitted as CJS by the standalone CLI and Vercel Build Output API builders, while `main` had already moved those outputs to ESM in [#1562](https://github.com/vercel/workflow/pull/1562). Staying on CJS meant `import.meta.url` was polyfilled (often producing the wrong path in re-bundled contexts), and the `world-testing` server had to import from `flow.js` via `createRequire` to force CJS semantics on what was really a CJS bundle. - -**Fix**: Align V2 with `main`'s ESM defaults: - -1. The BOA builder emits `__step_registrations.mjs` and `index.mjs`, writes `"type": "module"` in `package.json`, and sets `handler: "index.mjs"` in `.vc-config.json`. -2. The standalone builder no longer overrides `format`; it inherits the base builder's `'esm'` default. -3. The standalone config outputs `step.mjs` / `flow.mjs` instead of `.js`. -4. The `world-testing` server uses a native `import { POST } from '../.well-known/workflow/v1/flow.mjs'` instead of `createRequire`. -5. `createCombinedBundle`'s final esbuild pass (for `bundleFinalOutput: true`) now prepends the same `createRequire(import.meta.url)` banner used by the workflow/webhook bundles so CJS dependencies that call `require()` for Node.js builtins (for example the `events` module referenced by bundled libraries) still resolve at runtime. -6. To avoid a duplicate `__createRequire` declaration, the inner steps bundle that gets inlined by the final pass skips the banner — only the outer bundle emits it. This is threaded through via a new `skipEsmRequireBanner` option on `createStepsBundle`. - -### World specVersion in Health Check Responses - -The `getWorldHandlers()` return value was updated on main to include `specVersion` (the World's declared spec version). The V2 handler destructures this as `worldSpecVersion` and passes it to `handleHealthCheckMessage()` for inclusion in queue-based health check responses. This was merged alongside the V2 timeout configuration. - -### Async World Singleton Drift After Merge - -The later `main` merge changed `getWorld()` and `getWorldHandlers()` to be asynchronous promise-backed singletons, but the eager-processing branch still had synchronous call sites in the V2 runtime path. That left `packages/core/src/runtime.ts` and `packages/core/src/runtime/helpers.ts` trying to access `.events` on a `Promise`, which failed typecheck immediately after the merge. - -**Fix**: Rebases the V2 workflow entrypoint onto the async world API by lazily awaiting `getWorldHandlers()` when wiring the queue handler and awaiting `getWorld()` at the remaining runtime/helper call sites. This preserves the inline replay loop while matching `main`'s new world initialization contract. - -### Lazy World Loading for Next.js Production Builds - -After the async world merge, `packages/core/src/runtime/world.ts` still eagerly imported both `@workflow/world-local` and `@workflow/world-vercel`, and it initialized `createRequire()` from `process.cwd() + '/package.json'` at module load time. In the Next.js production build jobs that caused the generated flow route to pull `@workflow/world-vercel` and its `debug` dependency into local builds, then fail during page-data collection with `module.createRequire failed parsing argument` and `Dynamic require of "tty" is not supported`. - -**Fix**: Switched the runtime world loader to use `createRequire(import.meta.url)` and moved the local/Vercel world imports behind the existing async `createWorld()` branches. Local Next.js builds now only load the selected world implementation at runtime instead of bundling both worlds eagerly into the route module. - -### Deferred Next.js Builds Re-Ran Eager Discovery - -The later merge also pulled `BaseBuilder.createCombinedBundle()` into the deferred Next.js path without a way to pass the already-discovered workflow/step/serde entry sets. As a result, `packages/next/src/builder-deferred.ts` quietly fell back to `discoverEntries()` during production builds, re-emitting `Discovering workflow directives ...` and failing the local build tests that assert deferred mode avoids eager input-graph scans. - -**Fix**: Threaded explicit `discoveredEntries` through `createCombinedBundle()` and passed the deferred builder's tracked workflow/step/serde file sets into that call. Deferred Next.js builds now reuse the socket/cache-driven discovery state instead of re-running the base eager discovery pass. - -### Deferred Package Steps Fell Back to Compiled `dist/` Files - -Once deferred discovery stopped re-running the base eager scan, some package-provided steps were only being rediscovered from built artifacts such as `packages/ai/dist/agent/durable-agent.js`. Those compiled files no longer carried every nested `'use step'` directive, so local production Next.js builds could miss registrations like `@workflow/ai/agent`'s `closeStream` helper and fail at runtime with "step is not registered in the current deployment". - -**Fix**: The deferred Next.js builder now rewrites discovered workspace package paths from `dist/` back to their matching `src/` files when those sources exist. That keeps deferred bundling pointed at the directive-bearing source modules instead of their compiled output. - -### Workspace Source Step IDs Lost Export Subpaths - -Switching deferred builds over to workspace `src/` files fixed the missing nested directives, but it exposed a second mismatch in the SWC manifest path logic. `packages/builders/src/module-specifier.ts` only matched package exports against the on-disk file being transformed, so `packages/ai/src/agent/durable-agent.ts` was assigned `@workflow/ai@...` while the runtime still referenced the exported subpath id `@workflow/ai/agent@...`. Local Next.js agent runs then failed with "Step `step//@workflow/ai/agent@...//closeStream` is not registered" even though the source file was finally back in the bundle. - -**Fix**: `resolveModuleSpecifier()` now treats workspace source files as the source-backed form of their exported `dist/` targets when deriving step ids. That preserves package export subpaths like `@workflow/ai/agent` for id generation while still bundling the directive-bearing `src/` modules. - -### Tarball-Staged Next.js Builds Still Lost Package Step Sources - -The local production and Postgres Next.js jobs stage the workbenches by packing workspace packages into tarballs and installing those tarballs into a temporary `node_modules` tree. Deferred discovery was already willing to rewrite workspace `packages/*/dist/*` files back to `src/*`, but the tarballed `@workflow/ai` package did not publish its `src/` tree and the base builder still treated `node_modules/@workflow/*/src/*` as ordinary package imports. That meant the staged CI path fell back to `dist/` again and dropped nested steps like `@workflow/ai/agent`'s `closeStream`, even after the workspace build path had been fixed. - -**Fix**: Publish `packages/ai/src` in the tarball, treat source-backed `node_modules/@workflow/*/src/*` files like external workspace source files when generating bundle imports, and extend deferred transitive step discovery to follow bare `workflow` / `@workflow/*` package imports during non-watch builds. - -### Vercel Step Source Map Expectations Were Too Optimistic - -Merging `main` also pulled in a newer `hasStepSourceMaps()` expectation for Vercel preview deployments. On this branch, the non-Next workbench previews still emit step stacks without source filenames like `99_e2e.ts` or `helpers.ts`, so the Vercel step-error assertions regressed across the BOA-backed workbench matrix even though runtime behavior was otherwise unchanged. - -**Fix**: Revert the Vercel step source map expectation to the conservative branch behavior so preview e2e only asserts source filenames where this branch actually preserves them. Concretely, `hasStepSourceMaps()` returns `false` for *every* framework on Vercel deployments. Re-applying the blanket Vercel carve-out is what keeps the 11-framework `e2e-vercel-prod` matrix green while V2 source-map coverage catches up. The same pipeline regression also affects `nextjs-webpack` in local dev: pre-V2, webpack dev mode imported step sources directly so error stacks named `99_e2e.ts` / `helpers.ts`; under V2 the step bundle is inlined into the combined flow route and webpack's re-bundling collapses those filenames out of the dev-mode source maps. The helper now returns `false` for `nextjs-webpack` regardless of `DEV_TEST_CONFIG`. - -**Follow-up**: Wire up consumable inline source maps in the V2 step bundle across the framework integrations — the BOA-backed ones (Astro, Express, Fastify, Hono, Nitro, Nuxt, Vite, plus the standalone `example`), `nextjs-turbopack`, and `nextjs-webpack` (both dev and prod). The plan is to let each builder's `createCombinedBundle()` call carry an esbuild source-map pipeline that survives the framework's downstream re-bundling step, and then re-introduce the per-framework matrix in `hasStepSourceMaps()` so error stacks correctly point at `99_e2e.ts` / `helpers.ts` everywhere. Tracking this as a deferred follow-up rather than blocking the V2 cutover, since the runtime behavior is unaffected — only the surfaced filenames in step error stacks differ. - -### Community Worlds Still Used The Pre-`world.streams` API - -The Redis community-world benchmark still loads an external world package that has not adopted the newer `world.streams.*` interface yet. Once the eager-processing changes exercised stream writes through the modern namespace consistently, that adapter started failing with `Cannot read properties of undefined (reading 'writeMulti')` before the benchmark could even start. - -**Decision**: Community world adapters must implement the `world.streams.*` interface. The runtime legacy stream normalization (`normalizeLegacyWorld`) was removed. Community world e2e tests are skipped until the adapters are updated. - -### Build Output API Flow Handler Drift - -The Vercel Build Output API builder still emitted the combined flow function as `index.js`, but the surrounding metadata kept pointing at `index.mjs`. That mismatch meant BOA-based preview deployments published neither `/.well-known/workflow/v1/flow` nor the public manifest, so the Vercel production e2e suite collapsed into manifest `404` errors immediately after deployment. - -**Fix**: Updated `packages/builders/src/vercel-build-output-api.ts` to point both `.vc-config.json` and manifest extraction at `flow.func/index.js`, which matches the CommonJS file the builder actually writes. - -### Async World Loading Broke Custom Target Worlds - -The first lazy-world-loading fix switched package resolution over to `createRequire(import.meta.url)` globally. That solved the Next.js bundling problem for built-in worlds, but it also made custom targets like `@workflow/world-postgres` resolve relative to `@workflow/core` instead of the consuming app. Local Postgres tests then failed at startup with `Cannot find module '@workflow/world-postgres'`. - -**Fix**: The runtime now creates the package resolver lazily from `process.cwd()/package.json` when possible, falling back to `import.meta.url` only when the app root cannot be resolved. That keeps custom world modules app-relative without reintroducing the eager module-load failure in Next.js builds. - -### Core Logger Still Pulled `debug` Into Webpack Flow Routes - -Even after lazy world loading stopped eagerly importing `@workflow/world-vercel`, the generated Next.js webpack flow route still evaluated `packages/core/src/logger.ts` at module load. That file had a top-level `import debug from 'debug'`, which in turn pulled `debug/src/node` and its `tty` dynamic require into `/.well-known/workflow/v1/flow`. Webpack then failed during page-data collection with `Dynamic require of "tty" is not supported`. - -**Fix**: Replace the static `debug` dependency in the core logger with lightweight `process.env.DEBUG` matching plus `console.debug`. That keeps verbose opt-in logging for local debugging without forcing webpack to bundle `debug` and its Node-only terminal helpers into the flow route. - -### Deferred Next.js Builder Helper Drift After Merge - -Merging `main` into the eager-processing branch pulled in a set of helper methods for copied-step import rewriting in `packages/next/src/builder-deferred.ts`, but the corresponding call sites were not present on this branch yet. That left `getRelativeImportSpecifier`, `getStepCopyFileName`, and `rewriteRelativeImportsForCopiedStep` orphaned, and `@workflow/next` failed to build with `TS6133` unused-private-member errors immediately after the merge. - -**Fix**: Removed the orphaned helper methods during merge resolution and kept the existing deferred-builder behavior unchanged. The copied-step import-rewrite work should land as a complete change set rather than a partial backport from `main`. - -### Next.js React Step Fixture and `eval('require(...)')` - -The `nextjs-webpack` e2e suite still failed after the merge in `workflows/8_react_render.tsx`, where the step intentionally did `eval('require("react-dom/server")')` to avoid Next.js linting rules around importing `react-dom/server` directly. That pattern was brittle under webpack rebundling: even though the intermediate step bundle had a `createRequire(import.meta.url)` banner, the rebundled route still failed at runtime with `TypeError: require is not a function`. - -**Fix**: Updated the React-rendering step fixture in both Next.js workbenches to use `await import('react-dom/server')` instead. The test still exercises server-side React rendering inside a step, but no longer depends on bundler-specific `eval('require(...)')` behavior. - -## Inline Execution Verification Tests - -The `@workflow/world-testing` package includes invocation-counting tests that verify the V2 inline loop behavior for each workflow pattern: - -| Workflow Pattern | Expected Invocations | Why | -|-----------------|---------------------|-----| -| Sequential steps (3 adds) | **1** | All steps execute inline | -| Sequential steps + WritableStream | **1** | Ops settle via flush waiter promises (500ms race) | -| Sleep (1s) + step | **2** | Sleep requires queue round-trip | -| Promise.all (2 steps) | **2-3** | Background step + inline replay after all steps done | - -The test server tracks flow handler invocations per `runId` via an internal counter. Each test asserts the exact invocation count after the workflow completes. - -### world-testing Flow Invocation Counting Missed Wrapped Queue Payloads - -The inline-execution assertions in `packages/world-testing` count how many times the flow handler runs by inspecting the queue callback body and extracting `runId`. After the queue callback shape drifted, some worlds were only exposing the workflow payload under `body.payload.runId`, so the helper recorded `0` invocations even when the workflow completed correctly. That showed up in CI as the Postgres inline-execution spec failing its "single flow invocation" assertion. - -**Fix**: Accept both top-level `runId` and nested `payload.runId` when tracking flow invocations in the embedded test server. - -### Turbopack NFT Tracing Errors in V2 Combined Flow Route - -The V2 combined flow route imports the step registrations bundle (`__step_registrations.js`), which esbuild produces as a monolithic file. On `main`, step registrations live in a separate route (`step/route.js`), so Turbopack traces them independently. In V2, Turbopack traces the step registrations through the flow route's import graph, encountering `world.ts` code with `process.cwd()`, dynamic `import()` calls to `@workflow/world-local`/`@workflow/world-vercel`, and `createRequire()` patterns — all of which trigger fatal NFT (Node File Trace) errors. - -**Fix**: Introduced `get-world-lazy.ts`, a globalThis `Symbol.for`-based accessor that replaces the static `import { getWorld } from './runtime/world.js'` in all step-side modules (`serialization.ts`, `run.ts`, `helpers.ts`, `start.ts`, `resume-hook.ts`). This breaks the static import chain from step code to `world.ts`, preventing esbuild from bundling `world.ts` (and its transitive deps) into the step registrations. The step registrations bundle dropped from ~37k lines to ~6.6k lines (matching `main`), with zero `process.cwd()` or world package references. - -The `getWorldLazy()` function reads from the globalThis world singleton cache (populated by the runtime's `getWorld()` on first call). When the cache is empty (e.g., `start()` called from application code before any workflow runs), it falls back to a dynamic `import()` of `world.js` to initialize the world. - -Additional changes for Turbopack compatibility: -- Removed `stepEntrypoint` re-export from `runtime.ts` (V2 doesn't use separate step routes) -- Lazy-loaded `getPort` via `createRequire` with opaque specifier to prevent `@workflow/utils/get-port` filesystem operations from being traced -- `getRuntimeRequire()` uses `process.cwd()` as primary resolution base (for custom world packages like `@workflow/world-postgres` that are app-level deps, not `@workflow/core` deps), with `import.meta.url` fallback - -### Run#returnValue Worker Deadlock in V2 Inline Execution - -When a workflow uses a step-wrapped `start()` helper to spawn child workflows (e.g., `fibonacciWorkflow`), the parent's `Run#returnValue` step polls the child's completion status in a blocking loop (`while (true) { ... sleep(1000) ... }`). In V2, this step is executed inline by the step executor, holding a worker thread slot. If the child workflow's queue message is waiting for the same worker pool, the parent blocks the child from starting — a classic deadlock. - -**Fix**: `Run#pollReturnValue()` detects whether it's running inside a step executor (via `contextStorage.getStore()`) and, if so, throws `TooEarlyError` instead of polling in a blocking loop. `TooEarlyError` is handled specially by the step executor — it returns `{ type: 'retry', timeoutSeconds }` which re-queues the step with a 1-second delay, freeing the worker to process child workflows. Unlike `RetryableError`, `TooEarlyError` does NOT count against `maxRetries`, so polling steps can retry indefinitely until the child completes. - -When called from outside a step (e.g., test code, API routes), `pollReturnValue()` retains the original blocking loop behavior for backward compatibility. - -### Unconsumed Event Check Two-Phase Drain - -After merging `main`, the `EventsConsumer`'s unconsumed event check was updated with a two-phase promise queue drain: yield once after the first drain (via `setTimeout(0)`) so cross-VM promise chains can append follow-up async work, then re-drain before checking. This improves timing for scenarios like `step_completed` → for-await loop resume → next hook hydration. The V2 `onUnconsumedEvent` skip logic (returning `true` to advance past known-safe events) was preserved through the merge. - -The check additionally arms a `DEFERRED_CHECK_DELAY_MS = 100` `setTimeout` after the second drain, since Node.js does not guarantee that `setTimeout(0)` fires after all cross-context microtasks settle. Any `subscribe()` call arriving during that 100ms window cancels the check via version invalidation + `clearTimeout`, so the delay only adds latency to genuine corruption — never to the happy path. - -**Follow-up**: 100ms is a heuristic chosen empirically to cover cross-VM microtask propagation under the workflow runtime's worst-case scheduling. A deterministic settlement signal — for example, a "VM idle" callback exposed by the workflow VM bridge that fires only after all pending cross-context promise chains have resolved — would let the consumer fire the unconsumed-event check immediately on quiescence instead of waiting for a wall-clock timeout. That would tighten corruption detection (no spurious 100ms wait) and remove the only remaining wall-clock heuristic from the V2 inline loop's correctness path. - -### Nitro Builder Atomic File Writes - -After merging `main`, the Nitro builder now uses atomic temporary files (UUID-suffixed `.tmp` files) for build output, renaming them into place only after all builds succeed. This prevents partial/inconsistent output during dev HMR when a build fails mid-way. The V2 `createCombinedBundle` call was adapted to use this pattern. - -## Final Status - -All framework integrations pass across all test environments: - -| Test Suite | Frameworks | Status | -|-----------|-----------|--------| -| Unit Tests | core | 581/581 | -| Embedded Tests | world-testing | 9/9 (including inline execution) | -| Local Dev | 14 frameworks | All pass | -| Local Prod | 14 configurations | All pass | -| Postgres | 14 frameworks | All pass | -| Vercel Prod | 11 frameworks | All pass | -| Vercel Deployments | 15 projects | All succeed | -| Community Worlds | Turso, MongoDB, Redis | All pass | -| Windows | e2e | Pass | - -Known remaining flakes (same as main): - -- `webhookWorkflow` / `hookWorkflow` — timing-sensitive hook delivery diff --git a/docs/content/docs/v4/changelog/index.mdx b/docs/content/docs/v4/changelog/index.mdx index d3822cc9d2..a944b35ec7 100644 --- a/docs/content/docs/v4/changelog/index.mdx +++ b/docs/content/docs/v4/changelog/index.mdx @@ -12,5 +12,4 @@ Stay up to date with the latest changes to Workflow SDK. ## 2026 -- [Eager processing of steps and incremental event replay](/docs/changelog/eager-processing) - March 2026 - [Serializable AbortController and AbortSignal](/docs/changelog/serializable-abort-controller) — March 12, 2026 diff --git a/docs/content/docs/v4/changelog/meta.json b/docs/content/docs/v4/changelog/meta.json index 0c01dff133..ba12d32964 100644 --- a/docs/content/docs/v4/changelog/meta.json +++ b/docs/content/docs/v4/changelog/meta.json @@ -1,5 +1,5 @@ { "title": "Changelog", - "pages": ["index", "eager-processing", "resilient-start"], + "pages": ["index", "resilient-start"], "defaultOpen": false } diff --git a/docs/content/docs/v4/changelog/resilient-start.mdx b/docs/content/docs/v4/changelog/resilient-start.mdx index 653c439e0f..8560762807 100644 --- a/docs/content/docs/v4/changelog/resilient-start.mdx +++ b/docs/content/docs/v4/changelog/resilient-start.mdx @@ -7,321 +7,69 @@ description: Overhaul run start logic to tolerate world storage unavailability, ## Motivation -When `world` storage is unavailable but the queue is up, -`start()` previously failed entirely because `world.events.create(run_created)` -is called before `world.queue()`. This change decouples run creation from queue -dispatch so that runs can still be accepted when storage is degraded. +When `world` storage is unavailable but the queue is up, `start()` previously failed entirely because `world.events.create(run_created)` is called before `world.queue()`. This change decouples run creation from queue dispatch so that runs can still be accepted when storage is degraded. -Additionally, the runtime previously called `world.runs.get(runId)` before -`run_started`, adding an extra round-trip. By always calling `run_started` -directly, we save that round-trip and can return pre-loaded events in the -response to skip the initial `events.list` call, reducing TTFB. +Additionally, the runtime previously called `world.runs.get(runId)` before `run_started`, adding an extra round-trip. By always calling `run_started` directly, we save that round-trip and can return pre-loaded events in the response to skip the initial `events.list` call, reducing TTFB. ## Design -### `start()` changes (packages/core) +### `start()` changes -- `world.events.create` (run_created) and `world.queue` are now called **in parallel** - via `Promise.allSettled`. -- If `events.create` errors with **429 or 5xx**, we log a warning saying that run - creation failed but the run was accepted — creation will be re-tried async by the - runtime when it processes the queue message. The returned `Run` instance is marked - with `resilientStart = true`. -- If `events.create` errors with **409** (EntityConflictError), the run already exists - (e.g., the queue handler's resilient start path created it first due to a cold-start - race). This is treated as success. +- `world.events.create` (run_created) and `world.queue` are now called **in parallel** via `Promise.allSettled`. +- If `events.create` errors with **429 or 5xx**, we log a warning saying that run creation failed but the run was accepted — creation will be re-tried async by the runtime when it processes the queue message. The returned `Run` instance is marked with `resilientStart = true`. +- If `events.create` errors with **409** (EntityConflictError), the run already exists (e.g., the queue handler's resilient start path created it first due to a cold-start race). This is treated as success. - If `world.queue` fails, we still throw — the run truly failed and was not enqueued. -- The queue invocation now receives all the run inputs (`input`, `deploymentId`, - `workflowName`, `specVersion`, `executionContext`) via `runInput` so the runtime can - create the run later if needed. -- When the runtime re-enqueues itself, it does **not** pass these inputs — only the - first queue cycle carries them. +- The queue invocation now receives all the run inputs (`input`, `deploymentId`, `workflowName`, `specVersion`, `executionContext`) via `runInput` so the runtime can create the run later if needed. +- When the runtime re-enqueues itself, it does **not** pass these inputs — only the first queue cycle carries them. -### `workflowEntrypoint` changes (packages/core) +### `workflowEntrypoint` changes -- When calling `world.events.create` with `run_started`, we now also always pass the - run input that was sent through the queue, if available. The response will still be on off: - - **200 with event (now running)**: As usual, but the server could have used the run input to create the run if it didn't exist yet. The response will be opaque to the runtime. - - **200 without event (already running)**: As usual - - **409 or 410 (already finished)**: As usual +- When calling `world.events.create` with `run_started`, we now also always pass the run input that was sent through the queue, if available. The world is responsible for creating the run if it doesn't already exist. -### `Run.returnValue` polling (packages/core) +### `Run.returnValue` polling -- When `resilientStart` is true on the Run instance (run_created failed), the - `pollReturnValue` loop retries on `WorkflowRunNotFoundError` up to 3 times - (1s + 3s + 6s = 10s total) to give the queue time to deliver and the runtime - to create the run via `run_started`. -- When `resilientStart` is false (normal path), 404 fails immediately — no delay - for the common case of a wrong run ID. +- When `resilientStart` is true on the Run instance (run_created failed), the `pollReturnValue` loop retries on `WorkflowRunNotFoundError` up to 3 times (1s + 3s + 6s = 10s total) to give the queue time to deliver and the runtime to create the run via `run_started`. +- When `resilientStart` is false (normal path), 404 fails immediately — no delay for the common case of a wrong run ID. -### World / workflow-server changes +### World contract changes -- Posting `run_started` to a **non-existent** run is now allowed when the run input is - sent along with the payload. The server: - 1. Creates a `run_created` event first (so the event log is consistent). - 2. Strips the input from the `run_started` event data (it lives on `run_created`). - 3. Then creates the `run_started` event normally. - 4. Emits a log and a Datadog metric (`workflow_server.resilient_start.run_created_via_run_started`) - to track when this fallback path is hit. -- When `run_started` encounters an **already-running** run, all worlds return `{ run }` - with `event: undefined` instead of throwing. No duplicate event is created. +- Posting `run_started` to a **non-existent** run is now allowed when the run input is sent along with the payload. The world creates a `run_created` event first (so the event log is consistent), then creates the `run_started` event normally. +- When `run_started` encounters an **already-running** run, all worlds return `{ run }` with `event: undefined` instead of throwing. No duplicate event is created. ### Queue transport changes -`Uint8Array` values (the serialized workflow input in `runInput`) don't survive plain -JSON serialization. Each world uses a transport that preserves binary data: +`Uint8Array` values (the serialized workflow input in `runInput`) don't survive plain JSON serialization. Each world uses a transport that preserves binary data: -- **world-vercel**: CBOR transport — CBOR-encodes the entire queue payload into a - `Buffer` and uses `BufferTransport` from `@vercel/queue`. Uint8Array survives natively. -- **world-local**: `TypedJsonTransport` — uses the existing `jsonReplacer`/`jsonReviver` - from `fs.ts` that encode Uint8Array as `{ __type: 'Uint8Array', data: '' }`. -- **world-postgres**: Inline typed JSON transport — same tagged-envelope approach as - world-local, inlined since world-postgres doesn't import from world-local. +- **world-vercel**: CBOR transport — CBOR-encodes the entire queue payload into a `Buffer` and uses `BufferTransport` from `@vercel/queue`. Uint8Array survives natively. +- **world-local**: `TypedJsonTransport` — encodes Uint8Array as `{ __type: 'Uint8Array', data: '' }`. +- **world-postgres**: Inline typed JSON transport — same tagged-envelope approach as world-local. ## Decisions -1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to - minimize latency in the happy path. +1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to minimize latency in the happy path. -2. **Already-running returns run without event**: When `run_started` encounters an - already-running run, all worlds return `{ run }` with `event: undefined` (no - `events` array) instead of throwing. The runtime detects this by checking for - `result.event === undefined`. This avoids an extra `world.runs.get` round-trip. +2. **Already-running returns run without event**: When `run_started` encounters an already-running run, all worlds return `{ run }` with `event: undefined` (no `events` array) instead of throwing. The runtime detects this by checking for `result.event === undefined`. This avoids an extra `world.runs.get` round-trip. -3. **Events in 200 response**: We only return events on the 200 path (first caller). - On the already-running path, we fall back to the normal `events.list` call. This is - correct because only on 200 can we be certain we know the full event history. +3. **Events in 200 response**: We only return events on the 200 path (first caller). On the already-running path, we fall back to the normal `events.list` call. This is correct because only on 200 can we be certain we know the full event history. -4. **Conditional 404 retry on Run.returnValue**: Only when `resilientStart = true` - (run_created failed). Normal runs fail fast on 404. +4. **Conditional 404 retry on Run.returnValue**: Only when `resilientStart = true` (run_created failed). Normal runs fail fast on 404. ## Known concerns -### Cold-start race on Vercel (observed in CI) +### Cold-start race on Vercel -On Vercel, the parallel dispatch can cause the queue message to be processed before -`run_created` completes, if `run_created` hits a cold-start lambda. Confirmed via -Datadog: the `run_started` request hit a warm lambda (23ms) while `run_created` hit -a cold lambda (727ms), even though `run_created` arrived at the edge 116ms earlier. -When this happens: +On Vercel, the parallel dispatch can cause the queue message to be processed before `run_created` completes, if `run_created` hits a cold-start lambda. When this happens: 1. The runtime's resilient start path creates the run from `run_started`. 2. The original `run_created` arrives and gets 409 (EntityConflictError). 3. `start()` treats the 409 as success (the run exists). -This is handled correctly. The `resilientStart` flag is NOT set on the Run instance -in this case (409 is not a retryable error), so `returnValue` fails fast on 404. +The `resilientStart` flag is NOT set on the Run instance in this case (409 is not a retryable error), so `returnValue` fails fast on 404. -### Local Prod test flakiness (resolved) +### Atomicity of run entity creation -On world-local, the queue's async IIFE can deliver the message before -`events.create(run_created)` finishes writing to the shared filesystem. The -resilient start path should handle this, but Local Prod tests showed occasional -runs stuck at `pending` (no `run_started` event), and Windows CI showed -"Unconsumed event in event log" errors from duplicate `run_created` events. +The normal `run_created` path and the resilient start path can race on creating the run entity. In `world-local`, both paths use `writeExclusive` (O_CREAT|O_EXCL) — atomic at the OS level, so exactly one writer wins and the other gets EEXIST. The normal path throws `EntityConflictError` on conflict (handled by `start()` as 409); the resilient start path re-reads the run from disk on conflict. -**Root cause:** A TOCTOU race between the normal `run_created` path and the -resilient start path. Both used `writeJSON` which checks existence with -`fs.access()` (non-atomic), so both could pass the check and write separate -`run_created` events with different event IDs. Fixed by switching both paths to -`writeExclusive` (O_CREAT|O_EXCL) — see retrospective items 12 and 16. +In `world-postgres`, the resilient start path uses `onConflictDoNothing` plus a re-read on conflict for the same effect, with the same outcome on either side of the race. -## Follow-up work - -- [x] ~~Investigate Local Prod test flakiness~~ — resolved via `writeExclusive` - for run entity creation (retrospective items 12, 16). -- [ ] Monitor the Datadog metric in production to understand how often the fallback is hit. -- [x] ~~Events optimization for re-enqueue cycles~~ — decided against. The - already-running path returns early without writing an event, so preloading - events there would require an extra filesystem/DB query on every re-enqueue. - More importantly, on Vercel with at-least-once delivery, multiple lambdas can - process the same run concurrently — the event snapshot could be stale or - incomplete. The runtime's fallback to `events.list` is the correct behavior - for re-enqueue cycles. -- [x] ~~CborTransport pass-through~~ — refactored. `encode()`/`decode()` now - live inside `CborTransport.serialize()`/`deserialize()`, matching the pattern - used by TypedJsonTransport (world-local) and the inline transport - (world-postgres). Call sites pass plain objects instead of pre-encoded buffers. - -## Development retrospective - -Chronological log of mistakes, misunderstandings, and reverted approaches during -development. Included for future reference when working on similar cross-cutting -runtime changes. - -### 1. Uint8Array corruption through JSON queue transport - -The initial implementation passed `runInput.input` (a `Uint8Array`) directly through -the queue payload. `Uint8Array` doesn't survive `JSON.stringify` — it becomes -`{"0":72,"1":101,...}`. This corrupted the workflow input when the resilient start -path tried to recreate the run from the queue-delivered data. - -Caught by the `spawnWorkflowFromStepWorkflow` e2e test and the `world-testing` -embedded tests, which failed with "Invalid input" from devalue's `unflatten()`. - -Three approaches were tried before landing on the final solution: - -1. **Base64 encoding** (`btoa`/`atob`) — worked but fragile. The decode side used - `typeof runInput.input === 'string'` as a discriminant, which was flagged as - dangerous since non-binary inputs could also be strings. -2. **`Array.from()`/`new Uint8Array()`** — replaced base64 with a plain number array. - Two problems: (a) 3x JSON size regression vs base64, and (b) `Array.isArray()` - false-positives on v1Compat runs where `dehydrateWorkflowArguments` returns - devalue's flat Array format. -3. **CBOR + BufferTransport** (final) — world-vercel CBOR-encodes the queue payload; - world-local and world-postgres use a `TypedJsonTransport` with a tagged envelope. - -### 2. Forgot to commit world-postgres transport fix (twice) - -After fixing world-local and world-vercel queue transports, the same `JsonTransport` -corruption bug existed in world-postgres. The fix was written during a session but -never committed — lost when the working directory was reset via stash/checkout. This -happened twice. The fix only landed on the third attempt when it was committed and -pushed immediately. All 14 Postgres e2e jobs failed each time. - -### 3. Incorrect diagnosis of Vercel Prod 409 errors - -Multiple Vercel Prod e2e tests failed with `EntityConflictError: Workflow run with -ID wrun_... already exists` on `run_created`. The initial assumption was that VQS -couldn't deliver the queue message fast enough to beat the `run_created` call. - -Datadog logs showed otherwise: the `run_created` request arrived at Vercel's edge -116ms before `run_started`, but `run_created` hit a cold-start lambda (727ms) while -`run_started` hit a warm one (23ms). Cold starts can invert expected execution order. - -### 4. Removed EntityConflictError catch, then had to restore it - -The `workflowEntrypoint` error handler originally caught both `EntityConflictError` -and `RunExpiredError`. When adding the "already-running returns run without event" -behavior, `EntityConflictError` was removed from the catch since the new worlds -wouldn't throw it. Reviewer flagged this: old worlds or world-vercel hitting an -older workflow-server could still throw it. The catch was restored. - -### 5. Duplicate `startedAt` check - -After refactoring the `run_started` flow, a `workflowRun.startedAt` null check -existed both inside the `try` block and after the `catch` block. The second was -unreachable. Removed after review. - -### 6. WORKFLOW_SERVER_URL_OVERRIDE left set - -During development, `WORKFLOW_SERVER_URL_OVERRIDE` was set to a test URL pointing -at the workflow-server preview deployment and accidentally committed. The Vercel -bot flagged this. Reset to empty string. - -### 7. e2e test assertion was wrong - -The resilient start e2e test stubbed `world.events.create` and asserted -`createCallCount >= 2`. But the stub only intercepts calls from the test runner -process — the server uses its own world. `createCallCount` was always 1. Changed -to `expect(createCallCount).toBe(1)`. - -### 8. Misattributed Local Prod timeouts as "pre-existing" - -Local Prod tests showed 60-second timeouts across various tests. Initially dismissed -as CI flakes. Checking main's CI showed all Local Prod tests pass on main — the -timeouts are caused by our changes. Should have compared against main immediately. - -### 9. Attempted to revert parallel dispatch - -After identifying Local Prod timeouts, `start()` was partially reverted back to -sequential dispatch. The user pointed out that parallel dispatch is the core value -proposition of the PR. The revert was undone. - -### 10. WorkflowRunNotFoundError retry was unconditional - -The initial `pollReturnValue` retry on `WorkflowRunNotFoundError` applied to all -`Run` instances. A user calling `getRun()` with a wrong ID would wait 10 seconds -before getting a 404. Fixed by adding a `resilientStart` flag: only retries when -`run_created` actually failed. - -### 11. Changeset `minor` vs `patch` - -The changeset was created with `"@workflow/core": minor`. Reviewer flagged this as -violating repo rules ("all changes should be patch"). Changed after discussion. - -### 12. world-local TOCTOU race causing duplicate `run_created` events (Windows CI) - -The resilient start path AND the normal `run_created` path in `world-local/events-storage.ts` -both used `writeJSON` to create the run entity. `writeJSON` checks file existence with -`fs.access()` then writes via temp+rename — a classic TOCTOU race. On the local world, -the queue delivers via an async IIFE in the same event loop, so `events.create(run_created)` -and `events.create(run_started)` (with resilient start) run concurrently: - -1. Both paths call `fs.access(runPath)` → ENOENT (file doesn't exist yet) -2. Both proceed to write → the last `fs.rename` wins -3. Both succeed → both write their own `run_created` event with different event IDs -4. During replay, the consumer sees two `run_created` events → "Unconsumed event" error - -This caused consistent failures in `world-testing` embedded tests on Windows CI (`hooks`, -`supports null bytes in step results`, `retriable and fatal errors` — all timing out at -60s with "Unconsumed event in event log" errors). Linux CI was not affected because the -timing was different enough that the race window was rarely hit. - -Fixed by switching BOTH paths to `writeExclusive` (O_CREAT|O_EXCL), which is atomic at -the OS level — exactly one writer wins, the other gets EEXIST. The normal `run_created` -path throws `EntityConflictError` on conflict (handled by `start()` as 409). The resilient -start path re-reads the run from disk on conflict. Either way, only one `run_created` -event is written. - -### 13. Non-atomic run + run_created event in world-postgres resilient path - -The resilient start path in `world-postgres/storage.ts` did two separate writes (run -insert, then event insert) without a transaction. If the process crashed between them, -the run would exist without a `run_created` event — an inconsistent event log. - -A `drizzle.transaction()` wrapper was attempted but dropped due to TypeScript inference -issues with drizzle's transaction callback and the insert builder's overloads. The current -fix keeps the two writes sequential but adds the same conflict-aware re-read pattern as -world-local: when `onConflictDoNothing` produces no result (run already existed), the run -is re-read so downstream logic sees the real state. The narrow crash window between the -two writes is acceptable — if the run insert succeeds but the event insert crashes, the -run exists and `run_started` will still proceed normally (the event log will be missing a -`run_created` entry, but the run itself is functional). - -### 14. Missing `WorkflowRunStatus` span attribute after parallel refactor - -The `start()` span previously set `Attribute.WorkflowRunStatus(result.run.status)`, but -this was dropped in the parallel refactor because `result.run` is only available when -`runCreatedResult` fulfilled. The attribute is now conditionally set when the result is -available. In the resilient start case (run_created failed), the attribute is omitted -rather than erroring. - -### 15. `run_started` eventData leak in world-postgres result - -The `...data` spread in the result construction leaked `eventData` from `run_started` -into the returned event object. Storage was already correct (`storedEventData` is -`undefined` for `run_started`), but the returned result carried the input data. While -harmless (the runtime doesn't use `result.event.eventData`), it was restored to match -the pre-refactor behavior where eventData was explicitly stripped from the result. - -### 16. Normal `run_created` path also needed `writeExclusive` (Windows CI) - -The initial TOCTOU fix (item 12) only changed the resilient start path to use -`writeExclusive`. The normal `run_created` entity write still used `writeJSON` which -checks existence with `fs.access()` then writes via temp+rename — not atomic. On -Windows CI, the local queue's async IIFE delivered fast enough for both paths to pass -their existence checks simultaneously, producing two `run_created` events with different -event IDs. The events consumer saw the duplicate as "Unconsumed event in event log," -causing `hooks`, `supports null bytes in step results`, and `retriable and fatal errors` -tests to time out at 60s. Fixed by also switching the normal `run_created` entity write to -`writeExclusive`, making both paths use the same atomic gate. - -### 17. CborTransport was a pass-through wrapper - -`world-vercel/queue.ts` had `CborTransport` implementing `Transport` with a -no-op `serialize` (identity function) and a `deserialize` that reassembled chunks into -a Buffer without decoding. The actual CBOR `encode()`/`decode()` calls happened at the -call sites — `queue()` pre-encoded before calling `client.send()`, and the handler -post-decoded after receiving from `client.handleCallback()`. This violated the transport -abstraction (every other transport does its encoding inside serialize/deserialize) and -meant the call site had to remember to pre-encode. Refactored to move `encode()`/`decode()` -into the transport methods and changed the type from `Transport` to -`Transport`. - -## Follow-up work (additional) - -- [x] ~~**CborTransport is a pass-through**~~ — Resolved. Moved `encode()`/`decode()` - into `CborTransport.serialize()`/`CborTransport.deserialize()`. The transport is now - self-contained: call sites pass plain objects, and the handler receives decoded objects. - See retrospective item 17. +The narrow crash window in `world-postgres` between the run insert and the event insert is acceptable — if the run insert succeeds but the event insert crashes, the run exists and `run_started` will still proceed normally (the event log will be missing a `run_created` entry, but the run itself is functional). diff --git a/docs/content/docs/v5/changelog/eager-processing.mdx b/docs/content/docs/v5/changelog/eager-processing.mdx index 141e656c81..c854f1e65f 100644 --- a/docs/content/docs/v5/changelog/eager-processing.mdx +++ b/docs/content/docs/v5/changelog/eager-processing.mdx @@ -8,12 +8,11 @@ type: overview **Date**: March 2026 -This is a major internal architecture change to how Workflow DevKit executes workflows and steps on the Vercel platform. It reduces function invocations and queue overhead by executing steps _inline_ within the same function invocation as the workflow replay, rather than dispatching every step to a separate function via the queue. +This is a major internal architecture change to how Workflow DevKit executes workflows and steps. It reduces function invocations and queue overhead by executing steps _inline_ within the same function invocation as the workflow replay, rather than dispatching every step to a separate function via the queue. ## Previous Architecture -The previous architecture used two separate routes, -each backed by its own queue trigger: +The previous architecture used two separate routes, each backed by its own queue trigger: ``` Queue: __wkf_workflow_* --> /.well-known/workflow/v1/flow (workflow replay in VM) @@ -69,118 +68,64 @@ suspension with pending operations A serial workflow with 10 steps now completes in **1 function invocation**. -## Inline Step Execution - -After the workflow suspends with pending steps, the handler executes one step inline: - -1. Create `step_started` event -2. Hydrate step input from the event log -3. Look up the step function via `getStepFunction(stepName)` -4. Execute the step function -5. Create `step_completed` or `step_failed` event -6. Loop back to workflow replay - -This logic lives in `executeStep()` in `packages/core/src/runtime/step-executor.ts`. - ## Background Steps (Parallel Execution) -When a workflow suspends with multiple pending steps (e.g., from `Promise.all`), the handler: - -1. Creates `step_created` events for all pending steps -2. Queues N-1 steps back to `__wkf_workflow_*` with `stepId` and `stepName` in the message payload -3. Executes 1 step inline -4. Loops back to replay +When a workflow suspends with multiple pending steps (e.g., from `Promise.all`), the handler creates `step_created` events for all of them, queues N-1 back to `__wkf_workflow_*` with `stepId` and `stepName` in the message payload, executes 1 step inline, and loops back to replay. -Each background step message is handled by a separate function invocation of the same handler. When a message arrives with `stepId` and `stepName`, the handler executes that specific step, then checks if all parallel steps from the batch are done by loading the event log and comparing `step_created` events against terminal events (`step_completed`/`step_failed`): +Each background step message is handled by a separate function invocation of the same handler. When a message arrives with `stepId` and `stepName`, the handler executes that specific step, then checks if all parallel steps from the batch are done by comparing `step_created` events against terminal events (`step_completed`/`step_failed`): -- **All steps done**: The handler replays the workflow inline, continuing the execution loop without a queue roundtrip. Events are loaded with a cursor so subsequent loop iterations can use incremental loading. +- **All steps done**: The handler replays the workflow inline, continuing the execution loop without a queue roundtrip. - **Steps still pending**: The handler returns without queuing a continuation. The last handler to complete its step will see all steps done and replay inline. -- **Pending ops (stream writes)**: The handler queues a continuation and returns, so `waitUntil` can flush the pending stream data to the server. +- **Pending ops (stream writes)**: The handler queues a continuation and returns, so `waitUntil` can flush the pending stream data. ### Convergence After Parallel Steps -When multiple background steps complete near-simultaneously, multiple handlers may observe "all steps done" and attempt to advance the workflow concurrently. The event-sourced architecture plus the invariants described below ensure safe convergence: +When multiple background steps complete near-simultaneously, multiple handlers may observe "all steps done" and attempt to advance the workflow concurrently. The event-sourced architecture plus the invariants below ensure safe convergence: -- **`step_created` idempotency** --- duplicate creates return 409; exactly one handler owns each step -- **`step_completed` / `step_failed` idempotency** --- only the first invocation to record a terminal result wins -- **Queue idempotency keys** --- background step messages use `correlationId` as idempotency key -- **Deterministic replay** --- all invocations produce the same result given the same event log +- **`step_created` idempotency** — duplicate creates return 409; exactly one handler owns each step +- **`step_completed` / `step_failed` idempotency** — only the first invocation to record a terminal result wins +- **Queue idempotency keys** — background step messages use `correlationId` as idempotency key +- **Deterministic replay** — all invocations produce the same result given the same event log ### Single Inline Executor Per Step -Inline step execution combined with background-step dispatch introduces a new coordination requirement that the V1 handler did not face: when multiple handlers reach the same `Promise.all` batch concurrently, we need to guarantee that each step body runs at most once via the inline path. Without that guarantee, the event log accumulates duplicate `step_started` events (including some written *after* `step_completed`, which orphans them on replay) and step bodies run redundantly. +Inline step execution combined with background-step dispatch introduces a new coordination requirement: when multiple handlers reach the same `Promise.all` batch concurrently, we need to guarantee that each step body runs at most once via the inline path. Without that guarantee, the event log accumulates duplicate `step_started` events (including some written *after* `step_completed`, which orphans them on replay) and step bodies run redundantly. The design enforces a simple invariant: **exactly one handler owns each step, and only the owner may execute it inline**. Ownership is established by the atomicity of `step_created`: -1. **Atomic `step_created`** --- the world's `events.create('step_created', correlationId=X)` is serialized per-correlationId. Exactly one concurrent caller succeeds; the rest receive `EntityConflictError`. In production worlds (Postgres, Vercel) this is enforced at the SQL/DB layer. In `world-local`, a per-step in-process async mutex in `packages/world-local/src/storage/events-storage.ts` wraps every step lifecycle event's check-and-write so the same guarantee holds for dev. -2. **Suspension handler reports ownership** --- `handleSuspension()` returns `createdStepCorrelationIds: Set`, populated only for `step_created` writes that actually succeeded (not those that caught 409). -3. **Inline execution is gated on ownership** --- the runtime loop in `packages/core/src/runtime.ts` picks its inline step from `pendingSteps.filter(s => createdStepCorrelationIds.has(s.correlationId))`. A handler that didn't win any `step_created` race performs no inline execution. -4. **Queueing is unconditional** --- for every pending step except the one being inline-executed, the handler enqueues a background step message with `idempotencyKey: correlationId`. This matches V1's enqueue pattern and is what makes crash recovery work: if a prior handler wrote `step_created` but crashed before enqueueing, a later handler (from flow-message redelivery or `reenqueueActiveRuns`) will enqueue the orphaned step. Concurrent handlers' redundant enqueues dedupe on the idempotency key. +1. **Atomic `step_created`** — `events.create('step_created', correlationId=X)` is serialized per-correlationId in every world. Exactly one concurrent caller succeeds; the rest receive `EntityConflictError`. +2. **Suspension handler reports ownership** — only `step_created` writes that actually succeeded (not those that caught 409) count toward ownership. +3. **Inline execution is gated on ownership** — a handler that didn't win any `step_created` race performs no inline execution. +4. **Queueing is unconditional** — for every pending step except the one being inline-executed, the handler enqueues a background step message with `idempotencyKey: correlationId`. This is what makes crash recovery work: if a prior handler wrote `step_created` but crashed before enqueueing, a later handler will enqueue the orphaned step. Concurrent handlers' redundant enqueues dedupe on the idempotency key. -Together these give: every `step_created` event has exactly one inline executor (possibly zero if the inline path was skipped due to crash) **and** at least one queued dispatch (from whichever handler first reaches the suspension path after the `step_created` is visible). Step bodies are never executed concurrently, and `step_started` events never land in the log after `step_completed` for the same step. Event-log replay sees clean subscriber-matched sequences. With this invariant in place, the earlier `onUnconsumedEvent` skip logic for step/hook/wait lifecycle events was removed — any unconsumed event now immediately fatals as a corrupted event log (its original purpose before the V2 work). +Together these give: every `step_created` event has exactly one inline executor **and** at least one queued dispatch. Step bodies are never executed concurrently, and `step_started` events never land in the log after `step_completed` for the same step. -**Retry semantics are preserved**: the per-step mutex in `world-local` only rejects `step_started` when the step is already in a *terminal* state (`completed` / `failed`). A step that is currently running (status=`running`) still accepts a second `step_started` write with an incremented attempt counter — this is how queue redelivery after a SIGKILL mid-execution legitimately re-runs the step. The previously-documented "attempt counter inflation" failure mode is therefore no longer reachable via the concurrent-inline path; see "Concurrent `step_started` Inflating Attempt Counter" below for the complementary executor-side guard that still catches edge cases (e.g., postgres retries under high contention). +**Retry semantics are preserved**: a step that is currently running (status=`running`) still accepts a second `step_started` write with an incremented attempt counter — this is how queue redelivery after a SIGKILL mid-execution legitimately re-runs the step. ## Incremental Event Loading The handler caches the event log in memory across loop iterations. Instead of re-fetching the entire event log on each replay: -1. **First iteration**: full load via `getAllWorkflowRunEventsWithCursor()`, which returns both the events and the final pagination cursor -2. **Subsequent iterations**: `getNewWorkflowRunEvents(runId, cursor)` fetches only events created after the saved cursor and appends them to the cached array +1. **First iteration**: full load, returning both the events and the final pagination cursor +2. **Subsequent iterations**: fetch only events created after the saved cursor and append them to the cached array For a 10-step serial workflow completing in one invocation, the 10th replay loads ~2 new events instead of re-fetching all ~30. -### Server-Side Cursor Fix - -The incremental loading depends on the server returning a cursor even on the final page of results (`hasMore: false`). Previously, `workflow-server` returned `cursor: null` when there were no more pages. This was fixed in the `peter/fix-end-cursor` branch to always return an `eid:` cursor when there are events, aligning with `world-local` and `world-postgres` behavior. - -If a World implementation does not return a cursor after the initial load, the handler logs an error and falls back to a full reload. +Incremental loading depends on the World returning a cursor even on the final page of results. If a World implementation does not return a cursor after the initial load, the handler logs an error and falls back to a full reload. ## Timeout Handling -The inline execution loop checks wall-clock time before each replay iteration. If the elapsed time exceeds a configurable threshold (default: 110 seconds, for a 120-second function limit), the handler re-schedules itself via the queue and returns. - -The threshold is configurable via the `WORKFLOW_V2_TIMEOUT_MS` environment variable. +The inline execution loop checks wall-clock time before each replay iteration. If the elapsed time exceeds a configurable threshold (default: 110 seconds, for a 120-second function limit), the handler re-schedules itself via the queue and returns. Configurable via `WORKFLOW_V2_TIMEOUT_MS`. If a single step takes longer than the timeout threshold, the step runs to completion (or SIGKILL) — there is no interruption mechanism for in-progress step execution. This is the same behavior as the previous architecture. ## Queue Message Changes -The `WorkflowInvokePayload` schema has two new optional fields: - -{/*@skip-typecheck - snippet, not runnable code*/} - -```typescript -stepId: z.string().optional() -stepName: z.string().optional() -``` - -When `stepId` is present, the handler executes that specific step before (or instead of) replaying the workflow. Background steps are queued with both `stepId` and `stepName` set, so the handler knows which step function to call without loading the event log. Previously, `stepName` was resolved by loading all events and searching for the `step_created` event matching the `stepId` — an O(N) operation on the full event history for every background step arrival. +The `WorkflowInvokePayload` schema has two new optional fields: `stepId` and `stepName`. When `stepId` is present, the handler executes that specific step before (or instead of) replaying the workflow. Background steps are queued with both set, so the handler knows which step function to call without loading the event log. Previously, `stepName` was resolved by loading all events and searching for the `step_created` event matching the `stepId` — an O(N) operation on the full event history for every background step arrival. The queue trigger configuration uses `WORKFLOW_QUEUE_TRIGGER` on the `__wkf_workflow_*` topic. The `__wkf_step_*` topic and its separate trigger are no longer generated. -## Builder Changes - -### Base Builder - -New method `createCombinedBundle()` in `packages/builders/src/base-builder.ts`: - -1. Builds the step registrations bundle (same esbuild + SWC step mode as before) -2. Builds the workflow VM code string (same esbuild + SWC workflow mode as before) -3. Generates a combined route file that imports the step registrations and uses `workflowEntrypoint(workflowCode)` - -No changes to the SWC plugin were needed. The two-pass build approach (separate step and workflow SWC modes) still applies. - -### Framework Builders - -All framework builders were updated to use `createCombinedBundle()`: - -- **Next.js** (eager and deferred/lazyDiscovery): replaces separate step + flow route generation -- **NestJS, Nitro, Standalone**: replaces separate `createStepsBundle()` + `createWorkflowsBundle()` calls -- **SvelteKit, Astro**: same, plus post-processing regex updated to match `workflowEntrypoint` -- **Vercel Build Output API** (used by Nitro/Astro production): single `flow.func/` with `WORKFLOW_QUEUE_TRIGGER` - -### Generated File Layout +## Generated File Layout ``` .well-known/workflow/v1/ @@ -196,37 +141,19 @@ All framework builders were updated to use `createCombinedBundle()`: The `step/` directory is no longer generated. -## Suspension Handler - -`handleSuspension()` in `packages/core/src/runtime/suspension-handler.ts` creates events for all pending operations (hooks, step events, wait events) but does **not** queue step messages. It returns the pending step items so the handler can decide which to execute inline vs. queue to background. - -## Concerns and Edge Cases +## Design Notes and Tradeoffs ### Parent→Child Polling Holds Worker Slots `Run#returnValue` is implemented as a polling step: the workflow awaits the child run's terminal status inside a step body. In worker-based worlds (notably `world-postgres`), each such poll occupies a queue worker slot until the child run finishes. Parent workflows that fan out to many child runs — recursive workflows like `fibonacciWorkflow` are the obvious case — can therefore consume a large fraction of available workers just holding positions in `Promise.all([...children.map(c => c.returnValue)])`. -If `queueConcurrency` is smaller than the peak number of concurrent parent polls plus the workers needed for any in-flight children, the system deadlocks: every slot is held by a parent waiting on a child, but no child can acquire a slot to start. An earlier iteration of this work mitigated the deadlock runtime-side by detecting step context (via `contextStorage.getStore()`) and throwing `TooEarlyError` to re-enqueue the polling step, freeing the worker. That runtime guard has been removed — the responsibility now lies with worker-pool sizing. +If `queueConcurrency` is smaller than the peak number of concurrent parent polls plus the workers needed for any in-flight children, the system deadlocks: every slot is held by a parent waiting on a child, but no child can acquire a slot to start. -For `world-postgres`, the default `queueConcurrency` is set to **50**, which is comfortably above the ~24 concurrent polls `fibonacciWorkflow(6)` produces at peak. Workflows that fan out more aggressively must raise this ceiling. `packages/core/src/runtime/run.ts`, the `queueConcurrency` option on `createWorld()` for `world-postgres`, and the `fibonacciWorkflow` fixture in `workbench/example/workflows/99_e2e.ts` all carry pointers to this caveat. +For `world-postgres`, the default `queueConcurrency` is set to **50**. Workflows that fan out more aggressively must raise this ceiling. -**Follow-up**: Replace the worker-pool sizing requirement with a polling design that does not occupy a worker slot. Options under consideration: (a) restore a runtime-side `TooEarlyError` re-enqueue path but make it visible to the user (rather than the silent guard the earlier iteration shipped), (b) move child-completion polling out of the step body into the suspension layer so a parent waiting on a child does not consume queue capacity at all, or (c) emit a `run_completed` notification on the parent's stream/queue so the parent only resumes when the child actually finishes. The current `queueConcurrency=50` default is a workaround, not a long-term answer — workflows with deep recursion or large fan-out can still exhaust workers regardless of how high we set the ceiling. +To prevent deadlock when polling is executed inline by the step executor, `Run#pollReturnValue()` detects when it's running inside a step executor and throws `TooEarlyError` instead of polling in a blocking loop. The step executor handles `TooEarlyError` by re-queueing the step with a 1-second delay, freeing the worker. Unlike `RetryableError`, `TooEarlyError` does NOT count against `maxRetries`, so polling steps can retry indefinitely until the child completes. -### VM Sandboxing - -Workflow code still runs in a Node.js VM for determinism and sandboxing. Step code runs in the Node.js host context. The only change is that both happen within the same function invocation. - -### Bundle Size and Cold Start - -The combined bundle is larger (contains both step code and workflow VM code). Cold start time increases slightly. The reduction in total function invocations more than compensates. - -### Step Retries - -When an inline step fails with retries remaining: - -- `RetryableError` with explicit `retryAfter` delay: re-queue to self with `stepId` and delay -- Transient errors with immediate retry: re-queue to self with `stepId` (delay = 1s) -- `FatalError`: fail immediately +**Follow-up**: Replace the worker-pool sizing requirement with a polling design that does not occupy a worker slot. Options under consideration: moving child-completion polling out of the step body into the suspension layer, or emitting a `run_completed` notification on the parent's stream/queue so the parent only resumes when the child actually finishes. ### Mixed Suspensions @@ -236,360 +163,87 @@ A suspension may contain steps, hooks, and waits simultaneously. The handler cre - **Steps + at least one wait**: every step is queued (no inline execution). The handler returns with the wait timeout. Whichever lands first — a step's continuation or the wait timer — drives the next replay. - **Hooks / waits only**: handler returns with the wait timeout (or no timeout, for hook-only suspensions). The next continuation is driven by external resume or the wait timer. -The "no inline when there's a wait" carve-out is necessary to preserve `Promise.race(step, sleep)` semantics. Inline `await executeStep(...)` blocks the handler for the full step duration, and `wait_completed` events are only created on the *next* loop iteration's "complete elapsed waits" pass — so a longer-running step would always swallow the shorter sleep and `Promise.race` would resolve incorrectly. Queueing the step in this case lets the wait timer drive a continuation in parallel, matching V1's behavior where each step ran in a separate function invocation. +The "no inline when there's a wait" carve-out is necessary to preserve `Promise.race(step, sleep)` semantics. Inline `await executeStep(...)` blocks the handler for the full step duration, and `wait_completed` events are only created on the *next* loop iteration's "complete elapsed waits" pass — so a longer-running step would always swallow the shorter sleep and `Promise.race` would resolve incorrectly. Queueing the step in this case lets the wait timer drive a continuation in parallel. Pure step suspensions (without waits) still benefit from inline execution; the carve-out only costs an extra queue roundtrip when a step and a sleep coexist. -### Hook Conflicts - -If a hook conflict is detected during suspension handling, the handler breaks the loop and returns `{ timeoutSeconds: 0 }` for immediate re-invocation, same as the previous behavior. - -### Encryption Key Resolution - -Encryption keys are resolved once before the inline execution loop starts (after the run status is confirmed as `running`) and reused across all iterations. Background step executions resolve the key independently. The key does not change within a run. - -## Framework Support - -All framework integrations have been updated: Next.js (eager and deferred/lazyDiscovery), NestJS, SvelteKit, Astro, Nitro/Nuxt/Hono/Express/Vite, and CLI standalone. The Vercel Build Output API builder (used by Nitro and Astro for production deploys) also uses the combined bundle with `WORKFLOW_QUEUE_TRIGGER`. - -## Non-Next.js Integration Challenges - -### Module Scope Duplication in Re-Bundled Output - -Builders that use `bundleFinalOutput: true` (standalone CLI, Vercel Build Output API, NestJS) produce a single file where esbuild re-bundles the step registrations and the workflow runtime together. esbuild creates isolated module scopes for each source module, even within the same output file. This meant `registerStepFunction` and `getStepFunction` operated on different `Map` instances — steps were registered into one Map but looked up from another. - -**Fix**: The step function registry (`registeredSteps` Map in `@workflow/core/private`) and the step context storage (`contextStorage` AsyncLocalStorage in `@workflow/core/step/context-storage`) were changed from module-scoped variables to `globalThis` singletons using `Symbol.for`. This ensures all esbuild module scopes share the same instances. The pattern was already used in the codebase for the World singleton and the class serialization registry. - -### Workflow Package CJS Export Condition - -The `workflow` package's root export has `"require": "./dist/typescript-plugin.cjs"` for TypeScript editor plugin loading. When esbuild bundles with CJS format, it resolves `import { defineHook } from 'workflow'` via the `require` condition, getting the TS plugin instead of the API. - -**Fix**: Added a `"node"` condition (`"node": "./dist/index.js"`) before the `"require"` condition in the workflow package's exports. esbuild with `conditions: ['node']` matches `"node"` first and uses the correct API entry. TypeScript's plugin loader doesn't use `conditions: ['node']`, so it still falls through to `"require"` for the TS plugin. - -### Local World Concurrent Replay Interference - -The local development world (`world-local`) processes queue messages with high concurrency (default: 1000). With the V2 combined handler, parallel steps generate multiple workflow continuation messages. When these are processed concurrently, each triggers a replay that sees in-flight events from other concurrent replays. This causes "unconsumed event" errors because the event consumer encounters events that don't match any subscriber in the current replay state. - -In production (Vercel), this doesn't happen — each function invocation is isolated with its own event loading. - -**Fix**: The `EventsConsumer`'s `onUnconsumedEvent` callback (see "Concurrent Replay Interference with Multi-Batch Workflows" below) handles the concurrent event visibility issue. The V2 inline replay optimization (where the last background step to complete replays inline instead of queuing) further reduces concurrent replays. Redundant step executions from concurrent handlers are harmless due to `step_completed` idempotency — only the first completion wins. - -### ESM `bundleFinalOutput` and Dynamic Require Errors - -When `bundleFinalOutput: true` is used with ESM format, esbuild bundles CJS dependencies (like `debug`) into the output. CJS `require()` calls are wrapped in esbuild's `__require` polyfill, which throws "Dynamic require of X is not supported" in ESM contexts where `require` is undefined. This affected all ESM-based framework builders (Nitro, NestJS, SvelteKit, Astro) that were switched to `bundleFinalOutput: true` during the V2 migration. - -**Fix**: ESM builders use `bundleFinalOutput: false` with `externalizeNonSteps: true`, matching the pre-V2 behavior. The framework's own bundler (Vite, Rollup, Turbopack) handles dependency resolution. The standalone CLI and Vercel Build Output API builders use `bundleFinalOutput: true` with ESM output plus a `createRequire(import.meta.url)` banner (see "V2 Combined Bundle Switched from CJS to ESM" below) so CJS dependencies can still call `require()` for Node.js builtins. - -### Rollup Tree-Shaking of Step Registrations - -When `bundleFinalOutput: false` is used with Nitro's rollup pipeline, the step registrations bundle (`steps.mjs`) only contains side-effect code (`registerStepFunction` calls) with no exports. Rollup tree-shakes the entire module because it has no used exports, removing all step registrations from the production bundle. This causes "Step not found" errors at runtime. - -**Fix**: The steps bundle now exports a sentinel value (`export const __steps_registered = true`), and the combined route file imports it (`import { __steps_registered } from './steps.mjs'`). This gives rollup a used binding to track, preventing it from dropping the module and its side effects. - -### Concurrent Replay Interference with Multi-Batch Workflows (historic) - -An earlier iteration of the V2 work hit "Unconsumed event in event log" errors when multiple concurrent handlers raced into the same batch boundary. The diagnosis at the time was that concurrent handlers could see events the current replay hadn't reached yet, and the mitigation was a skip path in `onUnconsumedEvent` that tolerated step/hook/wait lifecycle events whose correlationId had a matching `step_created` / `hook_created` / earlier `wait_completed` in the log. - -Later work on the "Single Inline Executor Per Step" invariant (described above) identified the actual root cause: duplicate `step_started` events were being written *after* `step_completed` on the same step, because the local world's `step_started` was not atomic w.r.t. terminal state and the main loop was re-picking already-queued steps for inline execution. Fixing those at the source (per-step mutex in `world-local` + ownership-gated inline dispatch + unconditional queueing with idempotency keys) eliminated the unconsumed-step-event path entirely, and fixing the `wait_completed` cursor bug (the main loop manually pushed `wait_completed` events without advancing `eventsCursor`, so the next incremental fetch re-returned them as local-array duplicates) eliminated the wait case. +### VM Sandboxing -**Current state**: the `onUnconsumedEvent` skip logic has been removed. Any unconsumed event now fatals the run with `CORRUPTED_EVENT_LOG`, matching the original contract from PR #1055. Incremental event loading in `runtime.ts` dedupes by `eventId` to tolerate any residual manual pushes. +Workflow code still runs in a Node.js VM for determinism and sandboxing. Step code runs in the Node.js host context. The only change is that both happen within the same function invocation. -### Stale V1 Artifacts in Build Caches +### Bundle Size and Cold Start -SvelteKit and Astro's build caches (including Vercel's) may preserve the old V1 `step/` route directory from previous builds. When the V2 builder runs, it no longer generates step routes, but the stale files remain and cause build failures (e.g., importing the removed `stepEntrypoint`). Additionally, SvelteKit's `beforeExit` hook that patches `.vc-config.json` files for Vercel deployments was still trying to configure the non-existent `step.func/` directory. +The combined bundle is larger (contains both step code and workflow VM code). Cold start time increases slightly. The reduction in total function invocations more than compensates. -**Fix**: SvelteKit and Astro builders now clean up stale V1 step route directories during build. SvelteKit's Vercel deployment hook was updated to only configure the combined `flow.func/` directory. +### Step Retries -### Next.js Canary Turbopack and Temp Files +When an inline step fails with retries remaining: -The deferred (lazyDiscovery) Next.js builder writes build artifacts with a `.temp` extension to avoid HMR churn, then copies them to their final names. The V2 migration created `__step_registrations.route.js.temp` in the `app/` directory. Canary Turbopack rejects this file as an "Unknown module type" because the `.temp` extension has no associated loader. +- `RetryableError` with explicit `retryAfter` delay: re-queue to self with `stepId` and delay +- Transient errors with immediate retry: re-queue to self with `stepId` (delay = 1s) +- `FatalError`: fail immediately -**Fix**: The step registrations file is written directly to its final name (`__step_registrations.js`) since it doesn't need the temp-file HMR mechanism. Only the route file uses temp naming. +### Encryption Key Resolution -### Concurrent `step_started` Inflating Attempt Counter +Encryption keys are resolved once before the inline execution loop starts (after the run status is confirmed as `running`) and reused across all iterations. Background step executions resolve the key independently. The key does not change within a run. -When the V2 handler dispatches N parallel steps as background messages, each background step completion queues a workflow continuation. Up to N continuations may replay concurrently, and each may attempt to start the same not-yet-completed step (since `step_started` succeeds for already-running steps). Each call atomically increments the `attempt` counter. With N=5 parallel steps, the attempt counter can reach 5 on the first genuine execution — exceeding the default `maxRetries + 1 = 4` threshold and prematurely failing the step with "exceeded max retries". +### Module Scope Duplication in Re-Bundled Output -This is the same known limitation described in "Convergence After Parallel Steps" above, but with a concrete failure mode: `promiseRaceStressTestWorkflow` (which uses 5 parallel steps with `Promise.race`) consistently failed on Postgres tests. +Builders that re-bundle the combined output into a single file (standalone CLI, Vercel Build Output API, NestJS) produce a layout where esbuild creates isolated module scopes for each source module, even within the same output file. Without intervention this means `registerStepFunction` and `getStepFunction` operate on different `Map` instances — steps are registered into one Map but looked up from another. -**Fix**: The max retries check in `executeStep()` now only enforces when `step.error` exists — distinguishing actual retries (failed → retry with error) from concurrent first-attempt races (multiple handlers start the same step simultaneously without any prior failure). Concurrent starts are harmless since `step_completed` idempotency ensures only the first completion wins. +The step function registry and the step context storage are `globalThis` singletons (via `Symbol.for`) to ensure all module scopes share the same instances. The same pattern is used for the World singleton and the class serialization registry. ### Inline Step Execution with Pending Stream Operations When a step's arguments or return value include serialized streams (e.g., `WritableStream` from `getWritable()`, or AI SDK streaming steps), the serialization layer creates background `flushablePipe` operations that pipe data to S3. These ops are tracked in an `ops` array and need to complete before the stream data is readable by external consumers. -In V1, each step ran in a separate function invocation. After the step completed, `waitUntil(ops)` kept the function alive to flush the ops. The function then returned, giving `waitUntil` exclusive event loop time. - -**Current state**: `executeStep()` attempts a 500ms `Promise.race` between the ops settling and a timeout. If ops settle in time (data confirmed on server), it returns `hasPendingOps: false` and the V2 handler continues the inline loop. If ops don't settle in 500ms (e.g., `WritableStream` kept open across steps), it returns `hasPendingOps: true` and the V2 handler breaks the loop and queues a continuation so `waitUntil` can flush them. - -**Earlier attempts that failed** (before the flush waiter fix below): - -1. **500ms inline ops await without flush waiters** — The same 500ms race, but `WorkflowServerWritableStream` used a buffered 10ms flush timer: the `flushablePipe`'s `pendingOps` reached 0 when the buffered `write()` returned (instant), but the actual S3 HTTP write hadn't started yet. The ops appeared settled but data wasn't on S3. Multiple approaches to fix the timing (delaying `pollWritableLock`, closing the writable to trigger flush, adding a post-settle delay) all failed or caused other issues (deadlocks, premature stream closure). - -2. **Root cause of the buffered write issue**: `WorkflowServerWritableStream.write()` buffers chunks and schedules a flush via `setTimeout(flush, 10ms)`. The `flushablePipe` calls `await writer.write(chunk)` which returns immediately (data buffered). `pendingOps--` fires before the 10ms timer. The `pollWritableLock` sees `pendingOps === 0` and resolves `state.promise`. The ops appear settled, but data is still in the buffer. - -3. **Why this only affects Vercel Prod**: On local (world-local), stream writes go to the filesystem — effectively instant. On Vercel (world-vercel), writes go through HTTP to workflow-server → S3, adding 50-100ms latency. The buffered write returns instantly but the HTTP round-trip is deferred. When the V2 loop continues and the function eventually returns, `waitUntil` may not have enough time to flush. - -**Follow-up**: The flush-waiter design described under "Buffered Stream Flush with Waiter Promises" below is the landed fix and resolves the buffered-write race. The remaining work is to shrink the 500ms inline-ops budget once we have confidence that the flush-waiter path settles deterministically across all worlds (today the budget is a defensive ceiling, not a tuned latency target), and to surface a stronger contract for "ops settled" — currently a 500ms timeout means "probably settled, give up and queue a continuation", which is correct but coarse. A signaled "ops drained" event from the world layer would let `executeStep()` proceed without the timeout in the common case, removing latency for streaming workflows whose ops settle in well under 500ms. - -### CJS `module.exports` Collision in BOA Bundles (RESOLVED) - -The Vercel Build Output API (BOA) builder creates a single CJS bundle via `createCombinedBundle` with `bundleFinalOutput: true`. The combined route file imports the steps bundle: - -```js -import { __steps_registered } from './__step_registrations.js'; -import { workflowEntrypoint } from 'workflow/runtime'; -export const POST = workflowEntrypoint(workflowCode); -``` - -When esbuild re-bundles this into CJS, the steps bundle's code is inlined. If the steps bundle is also CJS format, it contains its own `module.exports = __toCommonJS(...)` at the top level. esbuild sometimes inlines CJS modules **without** a `__commonJS()` wrapper (the heuristic depends on the module's detected format). When unwrapped, the steps bundle's `module.exports` assignment executes at the top level and **overwrites** the combined route's `module.exports`, removing the `POST` handler export. - -**Symptoms**: The Vercel deployment builds and starts successfully, but the `POST` handler is missing from the function's exports. Queue messages are delivered to the function but nothing processes them. All e2e tests hang indefinitely. +In V1, each step ran in a separate function invocation. After the step completed, `waitUntil(ops)` kept the function alive to flush the ops. In V2, the inline execution loop continues immediately after the step body returns — so we need to know whether to keep looping or break out and let `waitUntil` flush. -**Debugging steps that led to the root cause**: +`executeStep()` attempts a 500ms `Promise.race` between the ops settling and a timeout. If ops settle in time (data confirmed on server), it returns `hasPendingOps: false` and the V2 handler continues the inline loop. If ops don't settle in 500ms (e.g., `WritableStream` kept open across steps), it returns `hasPendingOps: true` and the V2 handler breaks the loop and queues a continuation so `waitUntil` can flush them. -1. Tested the CJS bundle locally with `node -e "require('./index.js')"` — confirmed 92 steps registered, but `module.exports` only contained `{ __steps_registered }`, not `{ POST }`. -2. Found two `module.exports` assignments in the bundle: line ~45K (from the combined route, exporting `POST`) and line ~95K (from the inlined steps bundle, exporting `__steps_registered`). The second overwrites the first. -3. Compared with the standalone builder's bundle which had the same steps code wrapped in `__commonJS()` — esbuild's wrapper prevents the inner `module.exports` from leaking. - -**Fix**: When `bundleFinalOutput` is true, build the steps bundle in **ESM format** regardless of the final output format. The final esbuild pass converts everything to CJS correctly. ESM steps don't have `module.exports`, so there's no collision. The combined route's `export const POST` becomes the sole `module.exports` entry. - -### Step Error Source Maps on BOA Deployments - -The V2 combined CJS bundle (`bundleFinalOutput: true`) loses original source file names during re-bundling. Error stack traces show `/var/task/index.js` instead of `99_e2e.ts`. The `hasStepSourceMaps()` utility was updated to return `false` for BOA-builder frameworks (Express, Fastify, Hono, Nitro, Nuxt, Vite, Astro, Example) on Vercel preview, aligning test expectations with the actual bundle behavior. - -### CLI Health Check Port Mismatch - -The CLI `health` command defaults to `http://localhost:3000` when `WORKFLOW_LOCAL_BASE_URL` is not set. Different frameworks use different ports (Astro: 4321, SvelteKit: 5173). The e2e test passed `WORKFLOW_LOCAL_BASE_URL` via the spawn env, but the CLI's `getEnvVars()` function had a fixed list of env vars that didn't include `WORKFLOW_LOCAL_BASE_URL`. The env var was set but never read. - -**Fix**: Added `WORKFLOW_LOCAL_BASE_URL` to the CLI's `getEnvVars()` return object. +**Follow-up**: Shrink the 500ms inline-ops budget once we have confidence that the flush-waiter path settles deterministically across all worlds. A signaled "ops drained" event from the world layer would let `executeStep()` proceed without the timeout in the common case. ### Buffered Stream Flush with Waiter Promises -`WorkflowServerWritableStream` buffers writes and flushes via a 10ms `setTimeout` for batching. Previously, `write()` returned immediately after buffering, causing the `flushablePipe`'s `pendingOps` counter to reach 0 before data actually reached the server. The V2 inline loop saw ops as settled prematurely and broke on every step with `WritableStream` serialization. +`WorkflowServerWritableStream` buffers writes and flushes via a 10ms `setTimeout` for batching. Naively, `write()` could return immediately after buffering, but that would cause the `flushablePipe`'s `pendingOps` counter to reach 0 before data actually reached the server — the V2 inline loop would see ops as settled prematurely and produce data-loss races on every step with `WritableStream` serialization. -**Fix**: `write()` now returns a promise that resolves only after the scheduled flush completes. Multiple writes within the 10ms window still share a single batched HTTP request (the batching optimization is preserved). Each write registers a `{resolve, reject}` pair in a `flushWaiters` array. When the `setTimeout` fires and `flush()` completes the HTTP round-trip, all waiters are resolved (or rejected on error). This makes `pendingOps` accurately reflect server-side data state while keeping network-efficient batching. +`write()` returns a promise that resolves only after the scheduled flush completes. Multiple writes within the 10ms window still share a single batched HTTP request (the batching optimization is preserved). Each write registers a `{resolve, reject}` pair in a `flushWaiters` array. When the `setTimeout` fires and `flush()` completes the HTTP round-trip, all waiters are resolved (or rejected on error). This makes `pendingOps` accurately reflect server-side data state while keeping network-efficient batching. -The 500ms inline ops await in the step executor can now distinguish between: - -- **Steps where ops settle** (data on server, ~200ms after lock release + flush) → continue loop inline -- **Steps where ops don't settle** (WritableStream kept open across steps) → break loop - -### Lock-Release Polling Interval Lowered to 10ms +### Lock-Release Polling Interval `flushablePipe`'s `pollWritableLock` / `pollReadableLock` use `setInterval` to detect when a user releases their stream lock without closing the stream — the Web Streams API has no event for that state. The V2 step executor's `opsSettled` race waits for this poll to resolve after each writable-bearing step body returns, so the polling interval sits on the critical path of every streaming step. -The interval was originally 100ms. Measuring a synthetic workflow with 5 sequential streaming steps (each step receives a shared `WritableStream` argument, writes a few chunks, releases the writer with the stream still open — the same pattern `doStreamStep` / `writeToolOutputToUI` / `writeFinishChunk` use in `DurableAgent.chat`) produced a per-step wait distribution clustered between 22–100ms with a mean of ~58ms. That matches the analytical prediction for a periodic poll with uniformly random offset relative to step return: ~half the interval. Across the 5 steps, polling alone added ~290ms of latency to the workflow even though no step actually had pending I/O — the writes were already flushed, the writer lock was already released, and we were just waiting for the next tick to notice. - -**Fix**: dropped the polling interval from 100ms to 10ms in `packages/core/src/flushable-stream.ts`. Per-step wait drops from ~50ms average to ~5ms (a 10× improvement, expected to scale linearly with the number of writable-bearing steps in a workflow). For `DurableAgent.chat` with one tool call (4 writable-bearing steps), this removes ~180ms from the streaming chat response's critical path. Per-tick work is just `writable.locked` plus a `getWriter()`/`releaseLock()` probe — both microsecond-scale, so 10× more ticks during a stream's lifetime is not measurable in practice. - -**Follow-up**: Replace the polling entirely with an event-driven release signal — wrap the writable returned from the `WritableStream` reviver with a writer that fires on `releaseLock()` — bringing the wait to ~0ms. The 10ms polling interval is the cheap path that captures most of the available win without the structural change, but every writable-bearing step still pays a ~5ms tax that the event-driven design would eliminate. The structural change is also worth pursuing because it removes a source of timing drift between `world-local` (filesystem-instant) and `world-vercel` (HTTP-deferred) — both would see truly synchronous lock-release detection rather than periodic-poll detection. - -### Event Consumer Skip Logic Was Too Broad For Wait Replays - -The V2 handler needs some tolerance for out-of-order replay, especially around step events created by concurrent continuations. An early follow-up broadened that fallback to all wait lifecycle events too, so `onUnconsumedEvent` would skip `wait_created` and the first `wait_completed` whenever they matched a known wait. In the BOA-backed previews that broke `hookDisposeTestWorkflow`: once the first run disposed its hook and went into `sleep('5s')`, a replay could skip the live wait event before `sleep()` registered its subscriber, leaving the run stuck forever at `wait_created`. - -**Fix**: Keep the step/hook replay tolerance, but narrow the wait fallback to the one case we actually need: duplicate `wait_completed` events that appear *after* an earlier completion for the same wait. The `hookDispose` e2e was also updated to poll for hook registration/disposal instead of relying on fixed 3-5 second sleeps, which made the Vercel preview timing less brittle. - -### TooEarlyError Retry Delay in Step Executor - -The `executeStep()` function handles `TooEarlyError` (thrown when a step's `retryAfter` timestamp hasn't been reached yet) by returning a `retry` result with a timeout. The original implementation used a stale access pattern `(err as any).meta?.retryAfter` copied from an older error shape. The `TooEarlyError` class (from `@workflow/errors`) has `retryAfter` as a direct property (number of seconds), not nested under `.meta`. The stale pattern always evaluated to `undefined`, falling back to a 1-second delay regardless of the server's actual retry-after value. - -**Fix**: Changed to `err.retryAfter ?? 1`, matching the correct pattern used in `step-handler.ts`. - -### Health Check Endpoint JSON Response - -The `withHealthCheck()` wrapper in `helpers.ts` was updated (on main) to return a JSON response with `{ healthy, endpoint, specVersion, workflowCoreVersion }` instead of a plain text string. The V2 branch's e2e test still expected `Content-Type: text/plain` and a text body after merging main, causing the "health check endpoint (HTTP)" test to fail across all frameworks and environments. - -**Fix**: Updated the e2e test to expect `Content-Type: application/json` and validate the JSON body structure, including a `specVersion >= SPEC_VERSION_CURRENT` range assertion. - -### V2 Combined Bundle Switched from CJS to ESM - -The V2 combined bundle was initially emitted as CJS by the standalone CLI and Vercel Build Output API builders, while `main` had already moved those outputs to ESM in [#1562](https://github.com/vercel/workflow/pull/1562). Staying on CJS meant `import.meta.url` was polyfilled (often producing the wrong path in re-bundled contexts), and the `world-testing` server had to import from `flow.js` via `createRequire` to force CJS semantics on what was really a CJS bundle. - -**Fix**: Align V2 with `main`'s ESM defaults: - -1. The BOA builder emits `__step_registrations.mjs` and `index.mjs`, writes `"type": "module"` in `package.json`, and sets `handler: "index.mjs"` in `.vc-config.json`. -2. The standalone builder no longer overrides `format`; it inherits the base builder's `'esm'` default. -3. The standalone config outputs `step.mjs` / `flow.mjs` instead of `.js`. -4. The `world-testing` server uses a native `import { POST } from '../.well-known/workflow/v1/flow.mjs'` instead of `createRequire`. -5. `createCombinedBundle`'s final esbuild pass (for `bundleFinalOutput: true`) now prepends the same `createRequire(import.meta.url)` banner used by the workflow/webhook bundles so CJS dependencies that call `require()` for Node.js builtins (for example the `events` module referenced by bundled libraries) still resolve at runtime. -6. To avoid a duplicate `__createRequire` declaration, the inner steps bundle that gets inlined by the final pass skips the banner — only the outer bundle emits it. This is threaded through via a new `skipEsmRequireBanner` option on `createStepsBundle`. - -### World specVersion in Health Check Responses - -The `getWorldHandlers()` return value was updated on main to include `specVersion` (the World's declared spec version). The V2 handler destructures this as `worldSpecVersion` and passes it to `handleHealthCheckMessage()` for inclusion in queue-based health check responses. This was merged alongside the V2 timeout configuration. - -### Async World Singleton Drift After Merge - -The later `main` merge changed `getWorld()` and `getWorldHandlers()` to be asynchronous promise-backed singletons, but the eager-processing branch still had synchronous call sites in the V2 runtime path. That left `packages/core/src/runtime.ts` and `packages/core/src/runtime/helpers.ts` trying to access `.events` on a `Promise`, which failed typecheck immediately after the merge. - -**Fix**: Rebases the V2 workflow entrypoint onto the async world API by lazily awaiting `getWorldHandlers()` when wiring the queue handler and awaiting `getWorld()` at the remaining runtime/helper call sites. This preserves the inline replay loop while matching `main`'s new world initialization contract. - -### Lazy World Loading for Next.js Production Builds - -After the async world merge, `packages/core/src/runtime/world.ts` still eagerly imported both `@workflow/world-local` and `@workflow/world-vercel`, and it initialized `createRequire()` from `process.cwd() + '/package.json'` at module load time. In the Next.js production build jobs that caused the generated flow route to pull `@workflow/world-vercel` and its `debug` dependency into local builds, then fail during page-data collection with `module.createRequire failed parsing argument` and `Dynamic require of "tty" is not supported`. - -**Fix**: Switched the runtime world loader to use `createRequire(import.meta.url)` and moved the local/Vercel world imports behind the existing async `createWorld()` branches. Local Next.js builds now only load the selected world implementation at runtime instead of bundling both worlds eagerly into the route module. - -### Deferred Next.js Builds Re-Ran Eager Discovery - -The later merge also pulled `BaseBuilder.createCombinedBundle()` into the deferred Next.js path without a way to pass the already-discovered workflow/step/serde entry sets. As a result, `packages/next/src/builder-deferred.ts` quietly fell back to `discoverEntries()` during production builds, re-emitting `Discovering workflow directives ...` and failing the local build tests that assert deferred mode avoids eager input-graph scans. - -**Fix**: Threaded explicit `discoveredEntries` through `createCombinedBundle()` and passed the deferred builder's tracked workflow/step/serde file sets into that call. Deferred Next.js builds now reuse the socket/cache-driven discovery state instead of re-running the base eager discovery pass. - -### Deferred Package Steps Fell Back to Compiled `dist/` Files - -Once deferred discovery stopped re-running the base eager scan, some package-provided steps were only being rediscovered from built artifacts such as `packages/ai/dist/agent/durable-agent.js`. Those compiled files no longer carried every nested `'use step'` directive, so local production Next.js builds could miss registrations like `@workflow/ai/agent`'s `closeStream` helper and fail at runtime with "step is not registered in the current deployment". - -**Fix**: The deferred Next.js builder now rewrites discovered workspace package paths from `dist/` back to their matching `src/` files when those sources exist. That keeps deferred bundling pointed at the directive-bearing source modules instead of their compiled output. - -### Workspace Source Step IDs Lost Export Subpaths - -Switching deferred builds over to workspace `src/` files fixed the missing nested directives, but it exposed a second mismatch in the SWC manifest path logic. `packages/builders/src/module-specifier.ts` only matched package exports against the on-disk file being transformed, so `packages/ai/src/agent/durable-agent.ts` was assigned `@workflow/ai@...` while the runtime still referenced the exported subpath id `@workflow/ai/agent@...`. Local Next.js agent runs then failed with "Step `step//@workflow/ai/agent@...//closeStream` is not registered" even though the source file was finally back in the bundle. - -**Fix**: `resolveModuleSpecifier()` now treats workspace source files as the source-backed form of their exported `dist/` targets when deriving step ids. That preserves package export subpaths like `@workflow/ai/agent` for id generation while still bundling the directive-bearing `src/` modules. - -### Tarball-Staged Next.js Builds Still Lost Package Step Sources - -The local production and Postgres Next.js jobs stage the workbenches by packing workspace packages into tarballs and installing those tarballs into a temporary `node_modules` tree. Deferred discovery was already willing to rewrite workspace `packages/*/dist/*` files back to `src/*`, but the tarballed `@workflow/ai` package did not publish its `src/` tree and the base builder still treated `node_modules/@workflow/*/src/*` as ordinary package imports. That meant the staged CI path fell back to `dist/` again and dropped nested steps like `@workflow/ai/agent`'s `closeStream`, even after the workspace build path had been fixed. - -**Fix**: Publish `packages/ai/src` in the tarball, treat source-backed `node_modules/@workflow/*/src/*` files like external workspace source files when generating bundle imports, and extend deferred transitive step discovery to follow bare `workflow` / `@workflow/*` package imports during non-watch builds. - -### Vercel Step Source Map Expectations Were Too Optimistic +The interval was lowered from 100ms to 10ms. Per-step wait drops from ~50ms average to ~5ms (scaling linearly with the number of writable-bearing steps in a workflow). Per-tick work is just `writable.locked` plus a `getWriter()`/`releaseLock()` probe — microsecond-scale, so 10× more ticks is not measurable in practice. -Merging `main` also pulled in a newer `hasStepSourceMaps()` expectation for Vercel preview deployments. On this branch, the non-Next workbench previews still emit step stacks without source filenames like `99_e2e.ts` or `helpers.ts`, so the Vercel step-error assertions regressed across the BOA-backed workbench matrix even though runtime behavior was otherwise unchanged. +**Follow-up**: Replace polling entirely with an event-driven release signal — wrap the writable returned from the `WritableStream` reviver with a writer that fires on `releaseLock()` — bringing the wait to ~0ms. This would also remove a source of timing drift between worlds with synchronous storage (`world-local`) and worlds with HTTP-deferred storage (`world-vercel`). -**Fix**: Revert the Vercel step source map expectation to the conservative branch behavior so preview e2e only asserts source filenames where this branch actually preserves them. Concretely, `hasStepSourceMaps()` returns `false` for *every* framework on Vercel deployments. Re-applying the blanket Vercel carve-out is what keeps the 11-framework `e2e-vercel-prod` matrix green while V2 source-map coverage catches up. The same pipeline regression also affects `nextjs-webpack` in local dev: pre-V2, webpack dev mode imported step sources directly so error stacks named `99_e2e.ts` / `helpers.ts`; under V2 the step bundle is inlined into the combined flow route and webpack's re-bundling collapses those filenames out of the dev-mode source maps. The helper now returns `false` for `nextjs-webpack` regardless of `DEV_TEST_CONFIG`. +### Concurrent `step_started` and Attempt Counter -**Follow-up**: Wire up consumable inline source maps in the V2 step bundle across the framework integrations — the BOA-backed ones (Astro, Express, Fastify, Hono, Nitro, Nuxt, Vite, plus the standalone `example`), `nextjs-turbopack`, and `nextjs-webpack` (both dev and prod). The plan is to let each builder's `createCombinedBundle()` call carry an esbuild source-map pipeline that survives the framework's downstream re-bundling step, and then re-introduce the per-framework matrix in `hasStepSourceMaps()` so error stacks correctly point at `99_e2e.ts` / `helpers.ts` everywhere. Tracking this as a deferred follow-up rather than blocking the V2 cutover, since the runtime behavior is unaffected — only the surfaced filenames in step error stacks differ. +When the V2 handler dispatches N parallel steps as background messages, each background step completion queues a workflow continuation. Up to N continuations may replay concurrently, and each may attempt to start the same not-yet-completed step (since `step_started` succeeds for already-running steps). Each call atomically increments the `attempt` counter — so with N=5 parallel steps, the counter can reach 5 on the first genuine execution. -### Community Worlds Still Used The Pre-`world.streams` API - -The Redis community-world benchmark still loads an external world package that has not adopted the newer `world.streams.*` interface yet. Once the eager-processing changes exercised stream writes through the modern namespace consistently, that adapter started failing with `Cannot read properties of undefined (reading 'writeMulti')` before the benchmark could even start. - -**Decision**: Community world adapters must implement the `world.streams.*` interface. The runtime legacy stream normalization (`normalizeLegacyWorld`) was removed. Community world e2e tests are skipped until the adapters are updated. - -### Build Output API Flow Handler Drift - -The Vercel Build Output API builder still emitted the combined flow function as `index.js`, but the surrounding metadata kept pointing at `index.mjs`. That mismatch meant BOA-based preview deployments published neither `/.well-known/workflow/v1/flow` nor the public manifest, so the Vercel production e2e suite collapsed into manifest `404` errors immediately after deployment. - -**Fix**: Updated `packages/builders/src/vercel-build-output-api.ts` to point both `.vc-config.json` and manifest extraction at `flow.func/index.js`, which matches the CommonJS file the builder actually writes. - -### Async World Loading Broke Custom Target Worlds - -The first lazy-world-loading fix switched package resolution over to `createRequire(import.meta.url)` globally. That solved the Next.js bundling problem for built-in worlds, but it also made custom targets like `@workflow/world-postgres` resolve relative to `@workflow/core` instead of the consuming app. Local Postgres tests then failed at startup with `Cannot find module '@workflow/world-postgres'`. - -**Fix**: The runtime now creates the package resolver lazily from `process.cwd()/package.json` when possible, falling back to `import.meta.url` only when the app root cannot be resolved. That keeps custom world modules app-relative without reintroducing the eager module-load failure in Next.js builds. - -### Core Logger Still Pulled `debug` Into Webpack Flow Routes - -Even after lazy world loading stopped eagerly importing `@workflow/world-vercel`, the generated Next.js webpack flow route still evaluated `packages/core/src/logger.ts` at module load. That file had a top-level `import debug from 'debug'`, which in turn pulled `debug/src/node` and its `tty` dynamic require into `/.well-known/workflow/v1/flow`. Webpack then failed during page-data collection with `Dynamic require of "tty" is not supported`. - -**Fix**: Replace the static `debug` dependency in the core logger with lightweight `process.env.DEBUG` matching plus `console.debug`. That keeps verbose opt-in logging for local debugging without forcing webpack to bundle `debug` and its Node-only terminal helpers into the flow route. - -### Deferred Next.js Builder Helper Drift After Merge - -Merging `main` into the eager-processing branch pulled in a set of helper methods for copied-step import rewriting in `packages/next/src/builder-deferred.ts`, but the corresponding call sites were not present on this branch yet. That left `getRelativeImportSpecifier`, `getStepCopyFileName`, and `rewriteRelativeImportsForCopiedStep` orphaned, and `@workflow/next` failed to build with `TS6133` unused-private-member errors immediately after the merge. - -**Fix**: Removed the orphaned helper methods during merge resolution and kept the existing deferred-builder behavior unchanged. The copied-step import-rewrite work should land as a complete change set rather than a partial backport from `main`. - -### Next.js React Step Fixture and `eval('require(...)')` - -The `nextjs-webpack` e2e suite still failed after the merge in `workflows/8_react_render.tsx`, where the step intentionally did `eval('require("react-dom/server")')` to avoid Next.js linting rules around importing `react-dom/server` directly. That pattern was brittle under webpack rebundling: even though the intermediate step bundle had a `createRequire(import.meta.url)` banner, the rebundled route still failed at runtime with `TypeError: require is not a function`. - -**Fix**: Updated the React-rendering step fixture in both Next.js workbenches to use `await import('react-dom/server')` instead. The test still exercises server-side React rendering inside a step, but no longer depends on bundler-specific `eval('require(...)')` behavior. - -## Inline Execution Verification Tests - -The `@workflow/world-testing` package includes invocation-counting tests that verify the V2 inline loop behavior for each workflow pattern: - -| Workflow Pattern | Expected Invocations | Why | -|-----------------|---------------------|-----| -| Sequential steps (3 adds) | **1** | All steps execute inline | -| Sequential steps + WritableStream | **1** | Ops settle via flush waiter promises (500ms race) | -| Sleep (1s) + step | **2** | Sleep requires queue round-trip | -| Promise.all (2 steps) | **2-3** | Background step + inline replay after all steps done | - -The test server tracks flow handler invocations per `runId` via an internal counter. Each test asserts the exact invocation count after the workflow completes. - -### world-testing Flow Invocation Counting Missed Wrapped Queue Payloads - -The inline-execution assertions in `packages/world-testing` count how many times the flow handler runs by inspecting the queue callback body and extracting `runId`. After the queue callback shape drifted, some worlds were only exposing the workflow payload under `body.payload.runId`, so the helper recorded `0` invocations even when the workflow completed correctly. That showed up in CI as the Postgres inline-execution spec failing its "single flow invocation" assertion. - -**Fix**: Accept both top-level `runId` and nested `payload.runId` when tracking flow invocations in the embedded test server. - -### Turbopack NFT Tracing Errors in V2 Combined Flow Route - -The V2 combined flow route imports the step registrations bundle (`__step_registrations.js`), which esbuild produces as a monolithic file. On `main`, step registrations live in a separate route (`step/route.js`), so Turbopack traces them independently. In V2, Turbopack traces the step registrations through the flow route's import graph, encountering `world.ts` code with `process.cwd()`, dynamic `import()` calls to `@workflow/world-local`/`@workflow/world-vercel`, and `createRequire()` patterns — all of which trigger fatal NFT (Node File Trace) errors. - -**Fix**: Introduced `get-world-lazy.ts`, a globalThis `Symbol.for`-based accessor that replaces the static `import { getWorld } from './runtime/world.js'` in all step-side modules (`serialization.ts`, `run.ts`, `helpers.ts`, `start.ts`, `resume-hook.ts`). This breaks the static import chain from step code to `world.ts`, preventing esbuild from bundling `world.ts` (and its transitive deps) into the step registrations. The step registrations bundle dropped from ~37k lines to ~6.6k lines (matching `main`), with zero `process.cwd()` or world package references. - -The `getWorldLazy()` function reads from the globalThis world singleton cache (populated by the runtime's `getWorld()` on first call). When the cache is empty (e.g., `start()` called from application code before any workflow runs), it falls back to a dynamic `import()` of `world.js` to initialize the world. - -Additional changes for Turbopack compatibility: -- Removed `stepEntrypoint` re-export from `runtime.ts` (V2 doesn't use separate step routes) -- Lazy-loaded `getPort` via `createRequire` with opaque specifier to prevent `@workflow/utils/get-port` filesystem operations from being traced -- `getRuntimeRequire()` uses `process.cwd()` as primary resolution base (for custom world packages like `@workflow/world-postgres` that are app-level deps, not `@workflow/core` deps), with `import.meta.url` fallback - -### Cold-Start `MODULE_NOT_FOUND: './world.js'` From `getWorldLazy` Fallback - -The `getWorldLazy()` design assumed one of two paths would always succeed: either `globalThis[GetWorldFnKey]` is populated (because some prior code reached `world.ts`'s module body), or the dynamic `import('./world.js')` fallback resolves at runtime. - -Both assumptions break for routes that consume `start` (or any other `getWorldLazy` consumer) without going through the queue-driven flow handler first: - -1. Webpack and Turbopack tree-shake the named import `{ getWorld } from './runtime/world.js'` out of `runtime.ts` once a consumer only uses `start`. `world.ts` is dropped from the bundle entirely, so its module-load `globalThis[GetWorldFnKey] ??= getWorld` registration never fires. -2. The dynamic-import fallback inside `get-world-lazy.ts` builds the specifier `./world.js` at runtime to evade bundler tracing — but webpack inlines `get-world-lazy.js` into the route bundle, so the relative specifier resolves against `/var/task//.next/server/app//route.js` where no sibling `world.js` exists. Node throws `MODULE_NOT_FOUND`. - -The symptom: the very first request that goes through `start()` on a cold serverless invocation fails. Once any other code path (typically the queue-driven `/.well-known/workflow/v1/flow` route, which uses `getWorld` directly via `workflowEntrypoint`) has loaded `world.ts`, subsequent `start()` calls succeed for the rest of the process lifetime — making the failure flake-shaped: hard to reproduce in dev where everything tends to be warmed, but reliable on first user traffic into a fresh function instance. - -**Fix**: Added `@workflow/core/runtime/world-init`, a server-only side-effect module that imports `./world.js` purely for its module-load side effect (the globalThis registration). It's exported via package conditions: - -- `default` → `./dist/runtime/world-init.js` (real, loads `world.ts`) -- `workflow` → `./dist/workflow/world-init-stub.js` (empty, used by VM/step bundles) - -`packages/workflow/src/api.ts` (the host file behind `workflow/api`'s `default` condition) imports it for its side effect. The matching VM/step entry `api-workflow.ts` does not, so `world.ts` and its server-only deps (`@workflow/world-local`, `@workflow/world-vercel`, `cbor-x`, …) stay out of the workflow sandbox bundle. - -Reverification: built bundles for `vade-review` (Next.js webpack) show `createLocalWorld`/`createVercelWorld`/`GetWorldFnKey` present in the route's vendor chunk for `@workflow/core` (zero before the fix), and the workflow VM bundle's `flow/route.js` and `__step_registrations.js` continue to have zero references to either the world-init module or `world.ts`. Cold-start `POST /api/review/submit` succeeds on the first request after a fresh server boot — the regression case. - -The dynamic-import fallback in `get-world-lazy.ts` is preserved as defense-in-depth for environments outside the documented configurations (CJS test runners, scripts that import deeply into `@workflow/core` without going through `workflow/api`). - -### Run#returnValue Worker Deadlock in V2 Inline Execution - -When a workflow calls `start()` to spawn child workflows (e.g., `fibonacciWorkflow`), the parent's `Run#returnValue` step polls the child's completion status in a blocking loop (`while (true) { ... sleep(1000) ... }`). In V2, this step is executed inline by the step executor, holding a worker thread slot. If the child workflow's queue message is waiting for the same worker pool, the parent blocks the child from starting — a classic deadlock. - -**Fix**: `Run#pollReturnValue()` detects whether it's running inside a step executor (via `contextStorage.getStore()`) and, if so, throws `TooEarlyError` instead of polling in a blocking loop. `TooEarlyError` is handled specially by the step executor — it returns `{ type: 'retry', timeoutSeconds }` which re-queues the step with a 1-second delay, freeing the worker to process child workflows. Unlike `RetryableError`, `TooEarlyError` does NOT count against `maxRetries`, so polling steps can retry indefinitely until the child completes. - -When called from outside a step (e.g., test code, API routes), `pollReturnValue()` retains the original blocking loop behavior for backward compatibility. +The max retries check in `executeStep()` only enforces when `step.error` exists — distinguishing actual retries (failed → retry with error) from concurrent first-attempt races (multiple handlers start the same step simultaneously without any prior failure). Concurrent starts are harmless since `step_completed` idempotency ensures only the first completion wins. ### Unconsumed Event Check Two-Phase Drain -After merging `main`, the `EventsConsumer`'s unconsumed event check was updated with a two-phase promise queue drain: yield once after the first drain (via `setTimeout(0)`) so cross-VM promise chains can append follow-up async work, then re-drain before checking. This improves timing for scenarios like `step_completed` → for-await loop resume → next hook hydration. The V2 `onUnconsumedEvent` skip logic (returning `true` to advance past known-safe events) was preserved through the merge. +The `EventsConsumer`'s unconsumed event check uses a two-phase promise queue drain: yield once after the first drain (via `setTimeout(0)`) so cross-VM promise chains can append follow-up async work, then re-drain before checking. This improves timing for scenarios like `step_completed` → for-await loop resume → next hook hydration. The check additionally arms a `DEFERRED_CHECK_DELAY_MS = 100` `setTimeout` after the second drain, since Node.js does not guarantee that `setTimeout(0)` fires after all cross-context microtasks settle. Any `subscribe()` call arriving during that 100ms window cancels the check via version invalidation + `clearTimeout`, so the delay only adds latency to genuine corruption — never to the happy path. -**Follow-up**: 100ms is a heuristic chosen empirically to cover cross-VM microtask propagation under the workflow runtime's worst-case scheduling. A deterministic settlement signal — for example, a "VM idle" callback exposed by the workflow VM bridge that fires only after all pending cross-context promise chains have resolved — would let the consumer fire the unconsumed-event check immediately on quiescence instead of waiting for a wall-clock timeout. That would tighten corruption detection (no spurious 100ms wait) and remove the only remaining wall-clock heuristic from the V2 inline loop's correctness path. +**Follow-up**: 100ms is a heuristic chosen empirically. A deterministic settlement signal — for example, a "VM idle" callback exposed by the workflow VM bridge that fires only after all pending cross-context promise chains have resolved — would let the consumer fire the unconsumed-event check immediately on quiescence instead of waiting for a wall-clock timeout. + +### Lazy World Loading -### Nitro Builder Atomic File Writes +Static imports of `world-local` and `world-vercel` from the runtime caused two distinct build-time issues: Next.js production builds pulled both worlds (including their Node-only deps like `debug`'s `tty` requires) into the route module, and Turbopack's NFT (Node File Trace) errored on `process.cwd()` and dynamic `import()` patterns it couldn't statically analyze. -After merging `main`, the Nitro builder now uses atomic temporary files (UUID-suffixed `.tmp` files) for build output, renaming them into place only after all builds succeed. This prevents partial/inconsistent output during dev HMR when a build fails mid-way. The V2 `createCombinedBundle` call was adapted to use this pattern. +A `getWorldLazy()` accessor (backed by a `globalThis` `Symbol.for` cache) replaces the static import in step-side modules. This breaks the static import chain from step code to `world.ts`, preventing both worlds from being bundled into the step registrations. -## Final Status +Because tree-shaking can otherwise drop `world.ts`'s module-load registration entirely, a server-only side-effect module (`@workflow/core/runtime/world-init`) imports `./world.js` purely for its module-load side effect. It's wired via package conditions: -All framework integrations pass across all test environments: +- `default` → real, loads `world.ts` +- `workflow` → empty stub, used by VM/step bundles -| Test Suite | Frameworks | Status | -|-----------|-----------|--------| -| Unit Tests | core | 581/581 | -| Embedded Tests | world-testing | 9/9 (including inline execution) | -| Local Dev | 14 frameworks | All pass | -| Local Prod | 14 configurations | All pass | -| Postgres | 14 frameworks | All pass | -| Vercel Prod | 11 frameworks | All pass | -| Vercel Deployments | 15 projects | All succeed | -| Community Worlds | Turso, MongoDB, Redis | All pass | -| Windows | e2e | Pass | +This guarantees the world is loaded for routes that consume `start()` without going through the queue-driven flow handler first, while keeping `world.ts` and its server-only deps out of the workflow sandbox bundle. -Known remaining flakes (same as main): +### Community Worlds and the `world.streams` API -- `webhookWorkflow` / `hookWorkflow` — timing-sensitive hook delivery +Community world adapters must implement the `world.streams.*` interface. The runtime legacy stream normalization was removed as part of this work. diff --git a/docs/content/docs/v5/changelog/resilient-start.mdx b/docs/content/docs/v5/changelog/resilient-start.mdx index 653c439e0f..8560762807 100644 --- a/docs/content/docs/v5/changelog/resilient-start.mdx +++ b/docs/content/docs/v5/changelog/resilient-start.mdx @@ -7,321 +7,69 @@ description: Overhaul run start logic to tolerate world storage unavailability, ## Motivation -When `world` storage is unavailable but the queue is up, -`start()` previously failed entirely because `world.events.create(run_created)` -is called before `world.queue()`. This change decouples run creation from queue -dispatch so that runs can still be accepted when storage is degraded. +When `world` storage is unavailable but the queue is up, `start()` previously failed entirely because `world.events.create(run_created)` is called before `world.queue()`. This change decouples run creation from queue dispatch so that runs can still be accepted when storage is degraded. -Additionally, the runtime previously called `world.runs.get(runId)` before -`run_started`, adding an extra round-trip. By always calling `run_started` -directly, we save that round-trip and can return pre-loaded events in the -response to skip the initial `events.list` call, reducing TTFB. +Additionally, the runtime previously called `world.runs.get(runId)` before `run_started`, adding an extra round-trip. By always calling `run_started` directly, we save that round-trip and can return pre-loaded events in the response to skip the initial `events.list` call, reducing TTFB. ## Design -### `start()` changes (packages/core) +### `start()` changes -- `world.events.create` (run_created) and `world.queue` are now called **in parallel** - via `Promise.allSettled`. -- If `events.create` errors with **429 or 5xx**, we log a warning saying that run - creation failed but the run was accepted — creation will be re-tried async by the - runtime when it processes the queue message. The returned `Run` instance is marked - with `resilientStart = true`. -- If `events.create` errors with **409** (EntityConflictError), the run already exists - (e.g., the queue handler's resilient start path created it first due to a cold-start - race). This is treated as success. +- `world.events.create` (run_created) and `world.queue` are now called **in parallel** via `Promise.allSettled`. +- If `events.create` errors with **429 or 5xx**, we log a warning saying that run creation failed but the run was accepted — creation will be re-tried async by the runtime when it processes the queue message. The returned `Run` instance is marked with `resilientStart = true`. +- If `events.create` errors with **409** (EntityConflictError), the run already exists (e.g., the queue handler's resilient start path created it first due to a cold-start race). This is treated as success. - If `world.queue` fails, we still throw — the run truly failed and was not enqueued. -- The queue invocation now receives all the run inputs (`input`, `deploymentId`, - `workflowName`, `specVersion`, `executionContext`) via `runInput` so the runtime can - create the run later if needed. -- When the runtime re-enqueues itself, it does **not** pass these inputs — only the - first queue cycle carries them. +- The queue invocation now receives all the run inputs (`input`, `deploymentId`, `workflowName`, `specVersion`, `executionContext`) via `runInput` so the runtime can create the run later if needed. +- When the runtime re-enqueues itself, it does **not** pass these inputs — only the first queue cycle carries them. -### `workflowEntrypoint` changes (packages/core) +### `workflowEntrypoint` changes -- When calling `world.events.create` with `run_started`, we now also always pass the - run input that was sent through the queue, if available. The response will still be on off: - - **200 with event (now running)**: As usual, but the server could have used the run input to create the run if it didn't exist yet. The response will be opaque to the runtime. - - **200 without event (already running)**: As usual - - **409 or 410 (already finished)**: As usual +- When calling `world.events.create` with `run_started`, we now also always pass the run input that was sent through the queue, if available. The world is responsible for creating the run if it doesn't already exist. -### `Run.returnValue` polling (packages/core) +### `Run.returnValue` polling -- When `resilientStart` is true on the Run instance (run_created failed), the - `pollReturnValue` loop retries on `WorkflowRunNotFoundError` up to 3 times - (1s + 3s + 6s = 10s total) to give the queue time to deliver and the runtime - to create the run via `run_started`. -- When `resilientStart` is false (normal path), 404 fails immediately — no delay - for the common case of a wrong run ID. +- When `resilientStart` is true on the Run instance (run_created failed), the `pollReturnValue` loop retries on `WorkflowRunNotFoundError` up to 3 times (1s + 3s + 6s = 10s total) to give the queue time to deliver and the runtime to create the run via `run_started`. +- When `resilientStart` is false (normal path), 404 fails immediately — no delay for the common case of a wrong run ID. -### World / workflow-server changes +### World contract changes -- Posting `run_started` to a **non-existent** run is now allowed when the run input is - sent along with the payload. The server: - 1. Creates a `run_created` event first (so the event log is consistent). - 2. Strips the input from the `run_started` event data (it lives on `run_created`). - 3. Then creates the `run_started` event normally. - 4. Emits a log and a Datadog metric (`workflow_server.resilient_start.run_created_via_run_started`) - to track when this fallback path is hit. -- When `run_started` encounters an **already-running** run, all worlds return `{ run }` - with `event: undefined` instead of throwing. No duplicate event is created. +- Posting `run_started` to a **non-existent** run is now allowed when the run input is sent along with the payload. The world creates a `run_created` event first (so the event log is consistent), then creates the `run_started` event normally. +- When `run_started` encounters an **already-running** run, all worlds return `{ run }` with `event: undefined` instead of throwing. No duplicate event is created. ### Queue transport changes -`Uint8Array` values (the serialized workflow input in `runInput`) don't survive plain -JSON serialization. Each world uses a transport that preserves binary data: +`Uint8Array` values (the serialized workflow input in `runInput`) don't survive plain JSON serialization. Each world uses a transport that preserves binary data: -- **world-vercel**: CBOR transport — CBOR-encodes the entire queue payload into a - `Buffer` and uses `BufferTransport` from `@vercel/queue`. Uint8Array survives natively. -- **world-local**: `TypedJsonTransport` — uses the existing `jsonReplacer`/`jsonReviver` - from `fs.ts` that encode Uint8Array as `{ __type: 'Uint8Array', data: '' }`. -- **world-postgres**: Inline typed JSON transport — same tagged-envelope approach as - world-local, inlined since world-postgres doesn't import from world-local. +- **world-vercel**: CBOR transport — CBOR-encodes the entire queue payload into a `Buffer` and uses `BufferTransport` from `@vercel/queue`. Uint8Array survives natively. +- **world-local**: `TypedJsonTransport` — encodes Uint8Array as `{ __type: 'Uint8Array', data: '' }`. +- **world-postgres**: Inline typed JSON transport — same tagged-envelope approach as world-local. ## Decisions -1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to - minimize latency in the happy path. +1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to minimize latency in the happy path. -2. **Already-running returns run without event**: When `run_started` encounters an - already-running run, all worlds return `{ run }` with `event: undefined` (no - `events` array) instead of throwing. The runtime detects this by checking for - `result.event === undefined`. This avoids an extra `world.runs.get` round-trip. +2. **Already-running returns run without event**: When `run_started` encounters an already-running run, all worlds return `{ run }` with `event: undefined` (no `events` array) instead of throwing. The runtime detects this by checking for `result.event === undefined`. This avoids an extra `world.runs.get` round-trip. -3. **Events in 200 response**: We only return events on the 200 path (first caller). - On the already-running path, we fall back to the normal `events.list` call. This is - correct because only on 200 can we be certain we know the full event history. +3. **Events in 200 response**: We only return events on the 200 path (first caller). On the already-running path, we fall back to the normal `events.list` call. This is correct because only on 200 can we be certain we know the full event history. -4. **Conditional 404 retry on Run.returnValue**: Only when `resilientStart = true` - (run_created failed). Normal runs fail fast on 404. +4. **Conditional 404 retry on Run.returnValue**: Only when `resilientStart = true` (run_created failed). Normal runs fail fast on 404. ## Known concerns -### Cold-start race on Vercel (observed in CI) +### Cold-start race on Vercel -On Vercel, the parallel dispatch can cause the queue message to be processed before -`run_created` completes, if `run_created` hits a cold-start lambda. Confirmed via -Datadog: the `run_started` request hit a warm lambda (23ms) while `run_created` hit -a cold lambda (727ms), even though `run_created` arrived at the edge 116ms earlier. -When this happens: +On Vercel, the parallel dispatch can cause the queue message to be processed before `run_created` completes, if `run_created` hits a cold-start lambda. When this happens: 1. The runtime's resilient start path creates the run from `run_started`. 2. The original `run_created` arrives and gets 409 (EntityConflictError). 3. `start()` treats the 409 as success (the run exists). -This is handled correctly. The `resilientStart` flag is NOT set on the Run instance -in this case (409 is not a retryable error), so `returnValue` fails fast on 404. +The `resilientStart` flag is NOT set on the Run instance in this case (409 is not a retryable error), so `returnValue` fails fast on 404. -### Local Prod test flakiness (resolved) +### Atomicity of run entity creation -On world-local, the queue's async IIFE can deliver the message before -`events.create(run_created)` finishes writing to the shared filesystem. The -resilient start path should handle this, but Local Prod tests showed occasional -runs stuck at `pending` (no `run_started` event), and Windows CI showed -"Unconsumed event in event log" errors from duplicate `run_created` events. +The normal `run_created` path and the resilient start path can race on creating the run entity. In `world-local`, both paths use `writeExclusive` (O_CREAT|O_EXCL) — atomic at the OS level, so exactly one writer wins and the other gets EEXIST. The normal path throws `EntityConflictError` on conflict (handled by `start()` as 409); the resilient start path re-reads the run from disk on conflict. -**Root cause:** A TOCTOU race between the normal `run_created` path and the -resilient start path. Both used `writeJSON` which checks existence with -`fs.access()` (non-atomic), so both could pass the check and write separate -`run_created` events with different event IDs. Fixed by switching both paths to -`writeExclusive` (O_CREAT|O_EXCL) — see retrospective items 12 and 16. +In `world-postgres`, the resilient start path uses `onConflictDoNothing` plus a re-read on conflict for the same effect, with the same outcome on either side of the race. -## Follow-up work - -- [x] ~~Investigate Local Prod test flakiness~~ — resolved via `writeExclusive` - for run entity creation (retrospective items 12, 16). -- [ ] Monitor the Datadog metric in production to understand how often the fallback is hit. -- [x] ~~Events optimization for re-enqueue cycles~~ — decided against. The - already-running path returns early without writing an event, so preloading - events there would require an extra filesystem/DB query on every re-enqueue. - More importantly, on Vercel with at-least-once delivery, multiple lambdas can - process the same run concurrently — the event snapshot could be stale or - incomplete. The runtime's fallback to `events.list` is the correct behavior - for re-enqueue cycles. -- [x] ~~CborTransport pass-through~~ — refactored. `encode()`/`decode()` now - live inside `CborTransport.serialize()`/`deserialize()`, matching the pattern - used by TypedJsonTransport (world-local) and the inline transport - (world-postgres). Call sites pass plain objects instead of pre-encoded buffers. - -## Development retrospective - -Chronological log of mistakes, misunderstandings, and reverted approaches during -development. Included for future reference when working on similar cross-cutting -runtime changes. - -### 1. Uint8Array corruption through JSON queue transport - -The initial implementation passed `runInput.input` (a `Uint8Array`) directly through -the queue payload. `Uint8Array` doesn't survive `JSON.stringify` — it becomes -`{"0":72,"1":101,...}`. This corrupted the workflow input when the resilient start -path tried to recreate the run from the queue-delivered data. - -Caught by the `spawnWorkflowFromStepWorkflow` e2e test and the `world-testing` -embedded tests, which failed with "Invalid input" from devalue's `unflatten()`. - -Three approaches were tried before landing on the final solution: - -1. **Base64 encoding** (`btoa`/`atob`) — worked but fragile. The decode side used - `typeof runInput.input === 'string'` as a discriminant, which was flagged as - dangerous since non-binary inputs could also be strings. -2. **`Array.from()`/`new Uint8Array()`** — replaced base64 with a plain number array. - Two problems: (a) 3x JSON size regression vs base64, and (b) `Array.isArray()` - false-positives on v1Compat runs where `dehydrateWorkflowArguments` returns - devalue's flat Array format. -3. **CBOR + BufferTransport** (final) — world-vercel CBOR-encodes the queue payload; - world-local and world-postgres use a `TypedJsonTransport` with a tagged envelope. - -### 2. Forgot to commit world-postgres transport fix (twice) - -After fixing world-local and world-vercel queue transports, the same `JsonTransport` -corruption bug existed in world-postgres. The fix was written during a session but -never committed — lost when the working directory was reset via stash/checkout. This -happened twice. The fix only landed on the third attempt when it was committed and -pushed immediately. All 14 Postgres e2e jobs failed each time. - -### 3. Incorrect diagnosis of Vercel Prod 409 errors - -Multiple Vercel Prod e2e tests failed with `EntityConflictError: Workflow run with -ID wrun_... already exists` on `run_created`. The initial assumption was that VQS -couldn't deliver the queue message fast enough to beat the `run_created` call. - -Datadog logs showed otherwise: the `run_created` request arrived at Vercel's edge -116ms before `run_started`, but `run_created` hit a cold-start lambda (727ms) while -`run_started` hit a warm one (23ms). Cold starts can invert expected execution order. - -### 4. Removed EntityConflictError catch, then had to restore it - -The `workflowEntrypoint` error handler originally caught both `EntityConflictError` -and `RunExpiredError`. When adding the "already-running returns run without event" -behavior, `EntityConflictError` was removed from the catch since the new worlds -wouldn't throw it. Reviewer flagged this: old worlds or world-vercel hitting an -older workflow-server could still throw it. The catch was restored. - -### 5. Duplicate `startedAt` check - -After refactoring the `run_started` flow, a `workflowRun.startedAt` null check -existed both inside the `try` block and after the `catch` block. The second was -unreachable. Removed after review. - -### 6. WORKFLOW_SERVER_URL_OVERRIDE left set - -During development, `WORKFLOW_SERVER_URL_OVERRIDE` was set to a test URL pointing -at the workflow-server preview deployment and accidentally committed. The Vercel -bot flagged this. Reset to empty string. - -### 7. e2e test assertion was wrong - -The resilient start e2e test stubbed `world.events.create` and asserted -`createCallCount >= 2`. But the stub only intercepts calls from the test runner -process — the server uses its own world. `createCallCount` was always 1. Changed -to `expect(createCallCount).toBe(1)`. - -### 8. Misattributed Local Prod timeouts as "pre-existing" - -Local Prod tests showed 60-second timeouts across various tests. Initially dismissed -as CI flakes. Checking main's CI showed all Local Prod tests pass on main — the -timeouts are caused by our changes. Should have compared against main immediately. - -### 9. Attempted to revert parallel dispatch - -After identifying Local Prod timeouts, `start()` was partially reverted back to -sequential dispatch. The user pointed out that parallel dispatch is the core value -proposition of the PR. The revert was undone. - -### 10. WorkflowRunNotFoundError retry was unconditional - -The initial `pollReturnValue` retry on `WorkflowRunNotFoundError` applied to all -`Run` instances. A user calling `getRun()` with a wrong ID would wait 10 seconds -before getting a 404. Fixed by adding a `resilientStart` flag: only retries when -`run_created` actually failed. - -### 11. Changeset `minor` vs `patch` - -The changeset was created with `"@workflow/core": minor`. Reviewer flagged this as -violating repo rules ("all changes should be patch"). Changed after discussion. - -### 12. world-local TOCTOU race causing duplicate `run_created` events (Windows CI) - -The resilient start path AND the normal `run_created` path in `world-local/events-storage.ts` -both used `writeJSON` to create the run entity. `writeJSON` checks file existence with -`fs.access()` then writes via temp+rename — a classic TOCTOU race. On the local world, -the queue delivers via an async IIFE in the same event loop, so `events.create(run_created)` -and `events.create(run_started)` (with resilient start) run concurrently: - -1. Both paths call `fs.access(runPath)` → ENOENT (file doesn't exist yet) -2. Both proceed to write → the last `fs.rename` wins -3. Both succeed → both write their own `run_created` event with different event IDs -4. During replay, the consumer sees two `run_created` events → "Unconsumed event" error - -This caused consistent failures in `world-testing` embedded tests on Windows CI (`hooks`, -`supports null bytes in step results`, `retriable and fatal errors` — all timing out at -60s with "Unconsumed event in event log" errors). Linux CI was not affected because the -timing was different enough that the race window was rarely hit. - -Fixed by switching BOTH paths to `writeExclusive` (O_CREAT|O_EXCL), which is atomic at -the OS level — exactly one writer wins, the other gets EEXIST. The normal `run_created` -path throws `EntityConflictError` on conflict (handled by `start()` as 409). The resilient -start path re-reads the run from disk on conflict. Either way, only one `run_created` -event is written. - -### 13. Non-atomic run + run_created event in world-postgres resilient path - -The resilient start path in `world-postgres/storage.ts` did two separate writes (run -insert, then event insert) without a transaction. If the process crashed between them, -the run would exist without a `run_created` event — an inconsistent event log. - -A `drizzle.transaction()` wrapper was attempted but dropped due to TypeScript inference -issues with drizzle's transaction callback and the insert builder's overloads. The current -fix keeps the two writes sequential but adds the same conflict-aware re-read pattern as -world-local: when `onConflictDoNothing` produces no result (run already existed), the run -is re-read so downstream logic sees the real state. The narrow crash window between the -two writes is acceptable — if the run insert succeeds but the event insert crashes, the -run exists and `run_started` will still proceed normally (the event log will be missing a -`run_created` entry, but the run itself is functional). - -### 14. Missing `WorkflowRunStatus` span attribute after parallel refactor - -The `start()` span previously set `Attribute.WorkflowRunStatus(result.run.status)`, but -this was dropped in the parallel refactor because `result.run` is only available when -`runCreatedResult` fulfilled. The attribute is now conditionally set when the result is -available. In the resilient start case (run_created failed), the attribute is omitted -rather than erroring. - -### 15. `run_started` eventData leak in world-postgres result - -The `...data` spread in the result construction leaked `eventData` from `run_started` -into the returned event object. Storage was already correct (`storedEventData` is -`undefined` for `run_started`), but the returned result carried the input data. While -harmless (the runtime doesn't use `result.event.eventData`), it was restored to match -the pre-refactor behavior where eventData was explicitly stripped from the result. - -### 16. Normal `run_created` path also needed `writeExclusive` (Windows CI) - -The initial TOCTOU fix (item 12) only changed the resilient start path to use -`writeExclusive`. The normal `run_created` entity write still used `writeJSON` which -checks existence with `fs.access()` then writes via temp+rename — not atomic. On -Windows CI, the local queue's async IIFE delivered fast enough for both paths to pass -their existence checks simultaneously, producing two `run_created` events with different -event IDs. The events consumer saw the duplicate as "Unconsumed event in event log," -causing `hooks`, `supports null bytes in step results`, and `retriable and fatal errors` -tests to time out at 60s. Fixed by also switching the normal `run_created` entity write to -`writeExclusive`, making both paths use the same atomic gate. - -### 17. CborTransport was a pass-through wrapper - -`world-vercel/queue.ts` had `CborTransport` implementing `Transport` with a -no-op `serialize` (identity function) and a `deserialize` that reassembled chunks into -a Buffer without decoding. The actual CBOR `encode()`/`decode()` calls happened at the -call sites — `queue()` pre-encoded before calling `client.send()`, and the handler -post-decoded after receiving from `client.handleCallback()`. This violated the transport -abstraction (every other transport does its encoding inside serialize/deserialize) and -meant the call site had to remember to pre-encode. Refactored to move `encode()`/`decode()` -into the transport methods and changed the type from `Transport` to -`Transport`. - -## Follow-up work (additional) - -- [x] ~~**CborTransport is a pass-through**~~ — Resolved. Moved `encode()`/`decode()` - into `CborTransport.serialize()`/`CborTransport.deserialize()`. The transport is now - self-contained: call sites pass plain objects, and the handler receives decoded objects. - See retrospective item 17. +The narrow crash window in `world-postgres` between the run insert and the event insert is acceptable — if the run insert succeeds but the event insert crashes, the run exists and `run_started` will still proceed normally (the event log will be missing a `run_created` entry, but the run itself is functional). From 1fb6c0c9240e8d3f879da10c0f2ea8cf69a098a6 Mon Sep 17 00:00:00 2001 From: Peter Wielander Date: Fri, 22 May 2026 09:21:01 +0200 Subject: [PATCH 2/2] wording Signed-off-by: Peter Wielander --- packages/core/src/runtime/helpers.ts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/core/src/runtime/helpers.ts b/packages/core/src/runtime/helpers.ts index 582975af57..761df62bfb 100644 --- a/packages/core/src/runtime/helpers.ts +++ b/packages/core/src/runtime/helpers.ts @@ -354,7 +354,7 @@ export async function loadWorkflowRunEvents( // Preserve the last non-null cursor across pages. A World may // legitimately return `{ data: [], cursor: null, hasMore: false }` // on a trailing empty page — for example when the previous page's - // underlying DynamoDB query hit `Limit` exactly and returned a + // underlying DB query hit the limit exactly and returned a // `LastEvaluatedKey` "just in case". Overwriting with that null // would lose the position past the last real event we loaded and // force the runtime into the "no cursor after initial load" full-