Conversation
🦋 Changeset detectedLatest commit: 188fa43 The changes in this PR will be included in the next version bump. This PR includes changesets to release 29 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughIntroduces a durable Session primitive end-to-end: a new Prisma Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
2210fe2 to
4cadc19
Compare
Durable, typed, bidirectional I/O primitive that outlives a single run.
Ship target is agent/chat use cases; run-scoped streams.pipe/streams.input
are untouched and do not create Session rows.
Postgres
- New Session table: id, friendlyId, externalId, type (plain string),
denormalised project/environment/organization scalar columns (no FKs),
taskIdentifier, tags String[], metadata Json, closedAt, closedReason,
expiresAt, timestamps
- Point-lookup indexes only (friendlyId unique, (env, externalId) unique,
expiresAt). List queries are served from ClickHouse so Postgres stays
minimal and insert-heavy.
Control-plane API
- POST /api/v1/sessions create (idempotent via externalId)
- GET /api/v1/sessions list with filters (type, tag,
taskIdentifier, externalId, status
ACTIVE|CLOSED|EXPIRED, period/from/to)
and cursor pagination, ClickHouse-backed
- GET /api/v1/sessions/:session retrieve — polymorphic: `session_` prefix
hits friendlyId, otherwise externalId
- PATCH /api/v1/sessions/:session update tags/metadata/externalId
- POST /api/v1/sessions/:session/close terminal close (idempotent)
Realtime (S2-backed)
- PUT /realtime/v1/sessions/:session/:io returns S2 creds
- GET /realtime/v1/sessions/:session/:io SSE subscribe
- POST /realtime/v1/sessions/:session/:io/append server-side append
- S2 key format: sessions/{friendlyId}/{out|in}
Auth
- sessions added to ResourceTypes. read:sessions:{id},
write:sessions:{id}, admin:sessions:{id} scopes work via existing JWT
validation.
ClickHouse
- sessions_v1 ReplacingMergeTree table
- SessionsReplicationService mirrors RunsReplicationService exactly:
logical replication with leader-locked consumer, ConcurrentFlushScheduler,
retry with exponential backoff + jitter, identical metric shape.
Dedicated slot + publication (sessions_to_clickhouse_v1[_publication]).
- SessionsRepository + ClickHouseSessionsRepository expose list, count,
tags with cursor pagination keyed by (created_at DESC, session_id DESC).
- Derived status (ACTIVE/CLOSED/EXPIRED) computed from closed_at + expires_at;
in-memory fallback on list results to catch pre-replication writes.
Verification
- Webapp typecheck 10/10
- Core + SDK build 3/3
- sessionsReplicationService.test.ts integration tests 2/2 (insert + update
round-trip via testcontainers)
- Live round-trip against local dev: create -> retrieve (friendlyId and
externalId) -> out.initialize -> out.append x2 -> in.send -> out.subscribe
(receives records) -> close -> ClickHouse sessions_v1 shows the replicated
row with closed_reason
- Live list smoke: tag, type, status CLOSED, externalId, and cursor pagination
…te/update The session_ prefix identifies internal friendlyIds. Allowing it in a user-supplied externalId would misroute subsequent GET/PATCH/close requests through resolveSessionByIdOrExternalId to a friendlyId lookup, returning null or the wrong session. Reject at the schema boundary so both routes surface a clean 422.
Without allowJWT/corsStrategy, frontend clients holding public access tokens hit 401 on GET /api/v1/sessions and browser preflights fail. Matches the single-session GET/PATCH/close routes and the runs list endpoint.
- Derive isCached from the upsert result (id mismatch = pre-existing row) instead of doing a separate findFirst first. The pre-check was racy — two concurrent first-time POSTs could both return 201 with isCached: false. Using the returned row's id is atomic and saves a round-trip. - Scope the list endpoint's authorization to the standard action/resource pattern (matches api.v1.runs.ts): task-scoped JWTs can list sessions filtered by their task, and broader super-scopes (read:sessions, read:all, admin) authorize unfiltered listing. - Log and swallow unexpected errors on POST rather than returning the raw error.message. Prisma/internal messages can leak column names and query fragments.
Give Session channels run-engine waitpoint semantics so a task can
suspend while idle on a session channel and resume when an external
client sends a record — parallel to what streams.input offers
run-scoped streams.
Webapp
- POST /api/v1/runs/:runFriendlyId/session-streams/wait — creates a
manual waitpoint attached to {sessionId, io} and race-checks the S2
stream starting at lastSeqNum so pre-arrived data fires it
immediately. Mirrors the existing input-stream waitpoint route.
- sessionStreamWaitpointCache.server.ts — Redis set keyed on
{sessionFriendlyId, io}, drained atomically on each append so
concurrent multi-tab waiters all wake together.
- realtime.v1.sessions.$session.$io.append now drains pending
waitpoints after every record lands and completes each with the
appended body.
- S2RealtimeStreams.readSessionStreamRecords — session-channel
parallel of readRecords, feeds the race-check path.
Core
- CreateSessionStreamWaitpoint request/response schemas alongside
the existing Session CRUD schemas. Server API contract only —
the client ApiClient + SDK wrapper ship on the AI-chat branch.
Two fixes needed by browser clients hitting the public session API
(TriggerChatTransport's direct accessToken path, WebSocket-less
session drivers, anything origin'd off the dashboard):
- POST /api/v1/sessions: allowJWT: true + corsStrategy: "all" on
the action. Pre-fix, the create endpoint only accepted secret-key
auth, so any browser-originated sessions.create(...) 401'd. The
loader (list) already had these; matches that shape.
- POST /realtime/v1/sessions/:session/:io/append: export both
{ action, loader } so Remix routes the OPTIONS preflight to the
route builder's CORS handler. With only { action } exported, the
preflight returns 400 'No loader for route' and Chrome surfaces
the follow-up POST as net::ERR_FAILED. Same pattern as
/api/v1/tasks/:id/trigger (which already exports both).
Validated by an end-to-end UI smoke on references/ai-chat:
new chat → send → streamed assistant reply in ~4s → second turn
reuses the same session + run, lastEventId advances 10 → 21.
f4406d7 to
4f2c0e7
Compare
Nine fixes from CodeRabbit + Devin review:
- api.v1.sessions.$session.close.ts:
- Export { action, loader } so CORS preflight reaches the route
builder's OPTIONS handler. Same fix already applied to the
append route — Devin caught that I'd missed this one. Without
the loader, browser clients hitting POST /close fail preflight.
- Switch to `prisma.session.updateMany({ where: { id, closedAt:
null }, ... })` so concurrent closes can't overwrite the
original `closedAt` / `closedReason`. Loser hits count === 0 and
re-reads the winning row — closedness is write-once at the DB
level. (CodeRabbit: TOCTOU.)
- entry.server.tsx:
Wrap the async `sessionsReplicationInstance.shutdown` in a sync
handler with `.catch(...)`. SIGTERM/SIGINT fire during process
teardown and a rejection from `_replicationClient.stop()` would
become an unhandled promise rejection. Matches the pattern in
`dynamicFlushScheduler.server.ts`. (CodeRabbit: unhandled rejection
risk.)
- api.v1.runs.$runFriendlyId.session-streams.wait.ts:
- Swallowed race-check catch now logs `warn` with
sessionFriendlyId / io / waitpointId / error. Silent failures in
the S2-read / engine-complete / cache-remove path were
indistinguishable from the expected cache-drain-on-append fast
path.
- Outer 500 path no longer forwards `error.message` (Prisma /
engine / S2 internals could leak). Logs server-side and returns
a generic "Something went wrong"; 422 ServiceValidationError
path unchanged. (CodeRabbit: info-leak + logging gap.)
- realtime.v1.sessions.$session.$io.ts:
Add `method: "PUT"` to the route config so the route builder
enforces method validation before the handler runs. Removed the
now-redundant `request.method !== "PUT"` check inside the handler.
(CodeRabbit: defense-in-depth.)
- services/sessionsRepository/sessionsRepository.server.ts:
`ISessionsRepository` is now a `type` alias, per repo coding
guideline ("use types over interfaces"). Structural-typing means
implementing classes don't need source changes. (CodeRabbit.)
- services/sessionStreamWaitpointCache.server.ts:
Replace separate SADD + PEXPIRE with a single atomic Lua script.
Solves two distinct concerns at once:
1. Partial-failure window (CodeRabbit): if SADD succeeded and
PEXPIRE failed, the key would persist with no TTL. The Lua
script fails both or succeeds both.
2. TTL-race (Devin, twice): each waitpoint registers with its own
`ttlMs` derived from the caller's timeout. The old code called
PEXPIRE unconditionally, so a short-TTL registration would
shrink the shared key's TTL below a longer-TTL sibling —
evicting the sibling from Redis and degrading the append-path
fast drain to engine-timeout-only. The script only PEXPIREs if
the new TTL is greater than the current PTTL (or the key has
no TTL yet), so the key lives as long as the longest-TTL
member.
Outstanding: one unresolved thread asking to rename
`CloseSessionRequestBody.reason` → `closedReason` for symmetry with
the DB column. Holding that for an API-taste call — will follow up.
Validated: `pnpm run typecheck --filter webapp` clean.
Devin catch on #3417 — the ClickHouse sessions list was slicing `sessionIds.slice(1, size + 1)` on the backward path, which skipped the item closest to the cursor and surfaced the sentinel (the `size+1`th item that proves hasMore=true) to the user. Trace, with items c01…c11 and cursor=c07 (page size 3): - Backward query: `session_id > c07 ORDER BY ASC LIMIT 4` → `[c08, c09, c10, c11]`. Legitimate content is the first three (`[c08, c09, c10]`); `c11` is the sentinel. - Previous slice: `[c09, c10, c11]` → displayed DESC `[c11, c10, c09]` — user never sees c08, sees sentinel c11 instead. Fix: collapse both directions to `sessionIds.slice(0, size)`. The sentinel is always the last item regardless of direction, so the two branches had no reason to diverge. Cursor computations (`previousCursor = reversed.at(1)`, `nextCursor = reversed.at(size)`) already line up with the corrected slice — no change needed there. Verified: webapp typecheck clean.
/realtime/v1/sessions/:session/:io=out now peeks the tail record in S2
at connection time. When the tail chunk is trigger:turn-complete, the
agent has finished a turn and is either idle-waiting on .in or has
exited — either way no more chunks will arrive without further user
action. In that case the downstream S2 read switches to wait=0 so the
SSE drains and closes in ~1s instead of long-polling for 60s, and the
response carries X-Session-Settled: true so the client can tell the
close is terminal rather than a normal 60s cycle.
Mid-turn tails (streaming UIMessageChunks in flight) fall through to
the existing wait=60 long-poll. Crashed-mid-turn is indistinguishable
from live-streaming at this point and gets the same 60s retry loop as
today — that's a separate hardening, not in scope here.
The peek uses GET /records?tail_offset=1&count=1&wait=0 (single-digit
ms on S2), then unwraps the agent-side envelope written by
StreamsWriterV2: record.body parses to {data: <chunk>, id}, where
<chunk> is the raw UIMessageChunk object. No double-parse on data.
404 / 416 from the peek (stream never written / empty stream) short-
circuit to settled=false so first-connect on a freshly-created session
keeps the long-poll semantics the agent's first chunks depend on.
Verified end-to-end against an idle chat-agent-smoke session: caught-
up reconnect (Last-Event-ID = tail) closes in 1.08s with the header;
behind reconnect (Last-Event-ID < tail) drains remaining records then
closes in 0.94s with the header; empty-stream reconnect keeps the 60s
long-poll behavior unchanged.
Session is now the run manager for chat.agent and any future task-bound
session. Atomically creates the row + triggers the first run + tracks
the current run via optimistic claim, with a SessionRun audit log for
provenance.
Schema:
- Session gains `taskIdentifier`, `triggerConfig` (JSON), `currentRunId`
(non-FK), `currentRunVersion` (monotonic int for optimistic claim).
- New SessionRun audit table — one row per run a session triggers,
with `reason: "initial" | "continuation" | "upgrade" | "manual"`.
Lifecycle:
- `POST /api/v1/sessions`: idempotent on `(env, externalId)`, refreshes
triggerConfig on cache hit, runs `ensureRunForSession` (probe +
optimistic claim), returns a session-scoped PAT. JWT auth path
dropped — secret-key only. The customer's server is the only entry
point for session creation.
- `POST /api/v1/sessions/:s/end-and-continue`: server-orchestrated
handoff (cancels current run, triggers a fresh one, swaps
currentRunId via `updateMany where currentRunVersion`). Powers
`chat.requestUpgrade()` from inside the agent runtime.
- `POST /realtime/v1/sessions/:s/:io/append`: probe + ensureRunForSession
before append so messages arriving while no run is alive boot one
transparently.
Cross-form addressing on write paths:
- `createActionApiRoute` now runs `findResource` before `authorization`,
matching `createLoaderApiRoute`. Action routes get an optional
`resource` argument on `authorization.resource()` —
backwards-compatible (existing 4-arg callbacks unchanged).
- Append + end-and-continue use the new ordering to authorize against
`{paramSession, friendlyId, externalId}` so a JWT minted for either
form authorizes either URL form.
Helpers:
- `mintSessionToken.server.ts`: server-side session-PAT factory
(`read:sessions:{key} + write:sessions:{key}`, 1h TTL).
- `sessionRunManager.server.ts`: `ensureRunForSession` (probe + claim)
and `swapSessionRun` (force handoff with optimistic claim +
cancel-on-loss).
Pre-mutation existence reads switched to `$replica` (close, end-and-
continue, PATCH).
Three fixes after pushing the Sessions-as-run-manager commit:
- `api.v1.sessions.$session.end-and-continue.ts` was destructuring only
`{ action }` from `createActionApiRoute`, which means Remix had no
handler for OPTIONS preflight on this route. Browser CORS would 405.
Sibling routes (`close.ts`) already export `{ action, loader }`. Fix:
destructure and export both.
- `ensureRunForSession`'s pathological "lost the claim race AND the
winner's run was already terminal" branch recursed without bound. In
practice progress through the run engine bounds it, but a misconfigured
task that crashes before being dequeued could blow the stack. Add a
hidden `_attempt` counter, throw `SessionRunManagerError` once it
exceeds 3.
- `sessionsReplicationService.test.ts` was failing in CI because the
sessions-as-run-manager schema migration made `taskIdentifier` and
`triggerConfig` required on `Session`. The two `prisma.session.create`
calls in the test predate the migration. Add the now-required fields
to both fixtures.
Two fixes from Devin review on the sessions-as-run-manager commit: - `SessionItem.currentRunId`'s contract is the `run_*` friendlyId, but `serializeSession` returns the raw Prisma cuid. The `POST /sessions` create path overrides correctly via a TaskRun lookup, but GET, PATCH, and the three return paths in close.ts were passing the cuid through. A consumer using `currentRunId` from those endpoints in a downstream `GET /api/v1/runs/:runId` call would 404. Add a `serializeSessionWithFriendlyRunId` helper next to `serializeSession` that resolves via `$replica.taskRun.findFirst` (TaskRun friendlyIds are immutable, so replica lag is harmless), and switch the five affected return sites to use it. List endpoints stay on `serializeSession` to avoid N+1 lookups when paginating. The create endpoint keeps its existing manual lookup because it also needs the friendlyId for the response's `runId` field, and `session.currentRunId` is stale relative to the post-`ensureRunForSession` claim outcome. - Drop dead `lastChunkType` recomputation in `streamResponseFromSessionStream`. The variable was bound but never used; the conditional below it re-evaluated the same expression. Use the bound value in the condition.
Collapse `session-out-settled-signal.md` and `sessions-public-api-cors.md` into the single `session-primitive.md`, and rewrite that one to a high- level two-sentence summary that covers everything actually shipping in this PR (sessions-as-run-manager, end-and-continue, waitpoints, etc.). The CORS/JWT-on-create story is also out of date now that POST /api/v1/sessions is secret-key only.
…friendlyId Switch the two read-after-write taskRun lookups (POST /api/v1/sessions and POST /api/v1/sessions/:s/end-and-continue) from $replica back to prisma. Both reads happen immediately after triggering a run on the writer; replica lag would null the result and turn a successful create into a 500, or fall back to leaking the internal cuid in the end-and-continue response.
…n sessionRunManager The lost-race re-read in ensureRunForSession and swapSessionRun reads the Session row that the winner just wrote on the writer. Reading from $replica could return pre-race state and either (1) cause ensureRunForSession to recurse with a stale currentRunVersion, fail the next claim, and waste runs until max-attempts; or (2) cause swapSessionRun to return swapped: false with the calling run's own id, misleading the caller into thinking it is still authoritative.
The S2 record envelope wraps the agent-written chunk as
{data: <chunkAsString>, id: partId} because StreamsWriterV2 hands
appendPart an already-stringified chunk. The peek-settled check
treated envelope.data as an object, so typeof === 'object' always
returned false and the trigger:turn-complete sentinel was never
matched. Reconnect-on-reload silently degraded to the full long-poll
path. Parse envelope.data once more so the type discriminator is
surfaced.
… run lookup Same read-after-write pattern as the other lost-race re-reads: the run was just triggered on the writer milliseconds before, so a $replica.findFirst can return null due to replication lag. The null silently no-ops the cancellation and leaks an orphan run that no session will ever claim.
When the upsert path returns a previously-closed row, return 409 before ensureRunForSession fires. Otherwise we'd trigger a fresh run on a closed session that can't receive .in input (append handler rejects writes to closed sessions), wasting compute on a run that exits the moment it tries to read. close is one-way; callers must use a different externalId to start a new session.
The race-check in api.v1.runs.$runFriendlyId.session-streams.wait was selecting the realtime stream instance via run.realtimeStreamsVersion, but session streams are always v2 (S2) — the writer (appendPartToSessionStream) and the SSE subscribe both hardcode v2. For a v1 run the race-check silently fell back to a non-S2 instance, the instanceof check missed, and the optimization was skipped. Hardcode v2 for parity with the rest of the session surface.
…ized routes createActionApiRoute now runs findResource before authorization so the auth scope check can expand to alternate identifiers of the resolved resource (Sessions are addressable by both friendlyId and externalId). Side-effect: an authenticated-but-underscoped caller could probe resource existence by observing 404 vs 403. Mask the 404 as 403 with the same response shape as the auth-failed branch when the route declares authorization, so the two cases are indistinguishable to callers without scopes. Routes without authorization keep returning 404.
Previous fix unconditionally returned 403 when findResource was null on
a route with authorization, breaking PRIVATE-key callers (e.g. server
SDK) hitting the existing api.v2.runs.cancel route — they always pass
authorization but the new code returned 403 with a factually wrong
message ('Unauthorized: missing required scopes') even though they had
full permissions.
New ordering: run authorization first (with the resolved resource as
the 5th arg, so cross-form session auth still works), then check
resource-null → 404. This gives:
- PRIVATE key + missing resource: auth passes → 404 (correct)
- Underscoped JWT + missing resource: auth fails (resource not in
scope) → 403 (no info leak vs existing resource)
- Underscoped JWT + existing resource: auth fails → 403 (unchanged)
Only auth callbacks that destructure the resource (loader for
realtime.v1.sessions.$session.$io) need to handle null — they all
already do, since findResource was already nullable in pre-PR
loaders.
## Summary 8 new features, 18 improvements, 11 bug fixes. ## Breaking changes - Add server-side deprecation gate for deploys from v3 CLI versions (gated by `DEPRECATE_V3_CLI_DEPLOYS_ENABLED`). v4 CLI deploys are unaffected. ([#3415](#3415)) ## Improvements - Add `--no-browser` flag to `init` and `login` to skip auto-opening the browser during authentication. Also error loudly when `init` is run without `--yes` under non-TTY stdin (previously default-and-exited silently, leaving the project half-initialized). Both commands now show an `Examples` section in `--help`. ([#3483](#3483)) - Add `isReplay` boolean to the run context (`ctx.run.isReplay`), derived from the existing `replayedFromTaskRunFriendlyId` database field. Defaults to `false` for backwards compatibility. ([#3454](#3454)) - Redact the `resolveWaitpoint` runtime log so it only emits `id` and `type` instead of the full completed waitpoint. Previously the log printed the entire waitpoint (including `output`) to stdout in production runs, which could leak sensitive payloads. The value returned by `wait.forToken()` is unchanged. ([#3490](#3490)) - Add `SessionId` friendly ID generator and schemas for the new durable Session primitive. Exported from `@trigger.dev/core/v3/isomorphic` alongside `RunId`, `BatchId`, etc. Ships the `CreateSessionStreamWaitpoint` request/response schemas alongside the main Session CRUD. ([#3417](#3417)) - Truncate large error stacks and messages to prevent OOM crashes. Stack traces are capped at 50 frames (keeping top 5 + bottom 45 with an omission notice), individual stack lines at 1024 chars, and error messages at 1000 chars. Applied in parseError, sanitizeError, and OTel span recording. ([#3405](#3405)) ## Server changes These changes affect the self-hosted Docker image and Trigger.dev Cloud: - Add a "Back office" tab to `/admin` and a per-organization detail page at `/admin/back-office/orgs/:orgId`. The first action available on that page is editing the org's API rate limit: admins can save a `tokenBucket` override (refill rate, interval, max tokens) and see a plain-English preview of the resulting sustained rate and burst allowance. Writes are audit-logged via the server logger. ([#3434](#3434)) - Optional `DEPLOY_REGISTRY_ECR_DEFAULT_REPOSITORY_POLICY` env var to apply a default repository policy when the webapp creates new ECR repos ([#3467](#3467)) - Ship the Errors page to all users, with a polish + bug-fix pass: pinned "No channel" item in the Slack alert channel picker, viewer-timezone alert timestamps via Slack's `<!date^>` token, Activity sparkline peak tooltip, centered loading spinner and bug-icon empty state on the error detail page, ellipsis on the Configure alerts trigger. ([#3477](#3477)) - Configure the set of machine presets to build boot snapshots for at deploy time via `COMPUTE_TEMPLATE_MACHINE_PRESETS` (CSV of preset names, default `small-1x`). Use `COMPUTE_TEMPLATE_MACHINE_PRESETS_REQUIRED` (CSV, default = full PRESETS list) to scope which preset failures fail a required-mode deploy. Optional preset failures are logged and don't block the deploy. ([#3492](#3492)) - Regenerating a RuntimeEnvironment API key no longer invalidates the previous key immediately. The old key is recorded in a new `RevokedApiKey` table with a 24 hour grace window, and `findEnvironmentByApiKey` falls back to it when the submitted key doesn't match any live environment. The grace window can be ended early (or extended) by updating `expiresAt` on the row. ([#3420](#3420)) - Add the `Session` primitive — a durable, task-bound, bidirectional I/O channel that outlives a single run and acts as the run manager for `chat.agent`. Ships the Postgres `Session` + `SessionRun` tables, ClickHouse `sessions_v1` + replication service, the `sessions` JWT scope, and the public CRUD + realtime routes (`/api/v1/sessions`, `/realtime/v1/sessions/:session/:io`) including `end-and-continue` for server-orchestrated run handoffs and session-stream waitpoints. ([#3417](#3417)) - Add `KUBERNETES_POD_DNS_NDOTS_OVERRIDE_ENABLED` flag (off by default) that overrides the cluster default and sets `dnsConfig.options.ndots` on runner pods (defaulting to 2, configurable via `KUBERNETES_POD_DNS_NDOTS`). Kubernetes defaults pods to `ndots: 5`, so any name with fewer than 5 dots — including typical external domains like `api.example.com` — is first walked through every entry in the cluster search list (`<ns>.svc.cluster.local`, `svc.cluster.local`, `cluster.local`) before being tried as-is, turning one resolution into 4+ CoreDNS queries (×2 with A+AAAA). Using a lower `ndots` value reduces DNS query amplification in the `cluster.local` zone. Note: before enabling, make sure no code path relies on search-list expansion for names with dots ≥ the configured value — those names will hit their as-is form first and could resolve externally before falling back to the cluster search path. ([#3441](#3441)) - Vercel integration option to disable auto promotions ([#3376](#3376)) - Make it clear in the admin that feature flags are global and should rarely be changed. ([#3408](#3408)) - Admin worker groups API: add GET loader and expose more fields on POST. ([#3390](#3390)) - Add 60s fresh / 60s stale SWR cache to `getEntitlement` in `platform.v3.server.ts`. Eliminates a synchronous billing-service HTTP round trip on every trigger. Reuses the existing `platformCache` (LRU memory + Redis) pattern already used for `limits` and `usage`. Cache key is `${orgId}`. Errors return a permissive `{ hasAccess: true }` fallback (existing behavior) and are also cached to prevent thundering-herd on billing outages. ([#3388](#3388)) - Show a `MicroVM` badge next to the region name on the regions page. ([#3407](#3407)) - Increase default maximum project count per organization from 10 to 25 ([#3409](#3409)) - Merge execution snapshot creation into the dequeue taskRun.update transaction, reducing 2 DB commits to 1 per dequeue operation ([#3395](#3395)) - Add per-worker Node.js heap metrics to the OTel meter — `nodejs.memory.heap.used`, `nodejs.memory.heap.total`, `nodejs.memory.heap.limit`, `nodejs.memory.external`, `nodejs.memory.array_buffers`, `nodejs.memory.rss`. Host-metrics only publishes RSS, which overstates V8 heap by the external + native footprint; these give direct heap visibility per cluster worker so `NODE_MAX_OLD_SPACE_SIZE` can be sized against observed heap peaks rather than RSS. ([#3437](#3437)) - Tag Prisma spans with `db.datasource: "writer" | "replica"` so monitors and trace queries can distinguish the writer pool from the replica pool. Applies to all `prisma:engine:*` spans (including `prisma:engine:connection` used by the connection-pool monitors) and the outer `prisma:client:operation` span. ([#3422](#3422)) - Clarify the cross-region intent in the Terraform and AI-prompt helpers on the Add Private Connection page. Both already default `supported_regions` to `["us-east-1", "eu-central-1"]`; added an inline comment / parenthetical so the user understands why both regions are listed (Trigger.dev runs in both, so the service must be consumable from either). ([#3465](#3465)) - Add `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` flag (default off) to route the Prisma reads inside `RunEngine.getSnapshotsSince` through the read-only replica client. Offloads the snapshot polling queries (fired by every running task runner) from the primary. When disabled, behavior is unchanged. ([#3423](#3423)) - Stop creating TaskRunTag records and _TaskRunToTaskRunTag join table entries during task triggering. The denormalized runTags string array on TaskRun already stores tag names, making the M2M relation redundant write overhead. ([#3369](#3369)) - Stop writing per-tick state (`lastScheduledTimestamp`, `nextScheduledTimestamp`, `lastRunTriggeredAt`) on `TaskSchedule` and `TaskScheduleInstance`. The schedule engine now carries the previous fire time forward via the worker queue payload, eliminating ~270K dead-tuple-driven autovacuums per year on these hot tables and the associated `IO:XactSync` mini-spikes on the writer. Customer-facing `payload.lastTimestamp` semantics are unchanged. ([#3476](#3476)) - Replace the expensive DISTINCT query for task filter dropdowns with a dedicated TaskIdentifier registry table backed by Redis. Environments migrate automatically on their next deploy, with a transparent fallback to the legacy query for unmigrated environments. Also fixes duplicate dropdown entries when a task changes trigger source, and adds active/archived grouping for removed tasks. Moves BackgroundWorkerTask reads in the trigger hot path to the read replica. ([#3368](#3368)) - Public Access Tokens (PATs) minted before an API key rotation now keep working during the 24h grace window. `validatePublicJwtKey` falls back to any non-expired `RevokedApiKey` rows for the signing environment when the primary signature check against the env's current `apiKey` fails. The fallback query only runs on the failure path, so the hot success path is unchanged. ([#3464](#3464)) - Batch items that hit the environment queue size limit now fast-fail without retries and without creating pre-failed TaskRuns. ([#3352](#3352)) - Show the cancel button in the runs list for runs in `DEQUEUED` status. `DEQUEUED` was missing from `NON_FINAL_RUN_STATUSES` so the list hid the button even though the single run page allowed it. ([#3421](#3421)) - Reduce 5xx feedback loops on hot debounce keys by quantizing `delayUntil`, adding an unlocked fast-path skip, and gracefully handling redlock contention in `handleDebounce` so the SDK no longer retries into a herd. ([#3453](#3453)) - Fix RSS memory leak in the realtime proxy routes. `/realtime/v1/runs`, `/realtime/v1/runs/:id`, and `/realtime/v1/batches/:id` called `fetch()` into Electric with no abort signal, so when a client disconnected mid long-poll, undici kept the upstream socket open and buffered response chunks that would never be consumed — retained only in RSS, invisible to V8 heap tooling. Thread `getRequestAbortSignal()` through `RealtimeClient.streamRun/streamRuns/streamBatch` to `longPollingFetch` and cancel the upstream body in the error path. Isolated reproducer showed ~44 KB retained per leaked request; signal propagation releases it cleanly. ([#3442](#3442)) - Fix memory leak where every aborted SSE connection pinned the full request/response graph on Node 20, caused by `AbortSignal.any()` in `sse.ts` retaining its source signals indefinitely (see nodejs/node#54614, nodejs/node#55351). Also clear the `setTimeout(abort)` timer in `entry.server.tsx` so successful HTML renders don't pin the React tree for 30s per request. ([#3430](#3430)) - Preserve filters on the queues page when submitting modal actions. ([#3471](#3471)) - Fix Redis connection leak in realtime streams and broken abort signal propagation. **Redis connections**: Non-blocking methods (ingestData, appendPart, getLastChunkIndex) now share a single Redis connection instead of creating one per request. streamResponse still uses dedicated connections (required for XREAD BLOCK) but now tears them down immediately via disconnect() instead of graceful quit(), with a 15s inactivity fallback. **Abort signal**: request.signal is broken in Remix/Express due to a Node.js undici GC bug (nodejs/node#55428) that severs the signal chain when Remix clones the Request internally. Added getRequestAbortSignal() wired to Express res.on("close") via httpAsyncStorage, which fires reliably on client disconnect. All SSE/streaming routes updated to use it. ([#3399](#3399)) - Prevent dashboard crash (React error #31) when span accessory item text is not a string. Filters out malformed accessory items in SpanCodePathAccessory instead of passing objects to React as children. ([#3400](#3400)) - Upgrade Remix packages from 2.1.0 to 2.17.4 to address security vulnerabilities in React Router ([#3372](#3372)) - Fix Vercel integration settings page (remove redundant section toggles) and improve the Vercel onboarding flow so the modal closes after connecting a GitHub repo and the marketplace `next` URL is preserved across the GitHub app install redirect. ([#3424](#3424)) <details> <summary>Raw changeset output</summary> # Releases ## @trigger.dev/build@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## trigger.dev@4.4.5 ### Patch Changes - Add `--no-browser` flag to `init` and `login` to skip auto-opening the browser during authentication. Also error loudly when `init` is run without `--yes` under non-TTY stdin (previously default-and-exited silently, leaving the project half-initialized). Both commands now show an `Examples` section in `--help`. ([#3483](#3483)) - Updated dependencies: - `@trigger.dev/core@4.4.5` - `@trigger.dev/build@4.4.5` - `@trigger.dev/schema-to-json@4.4.5` ## @trigger.dev/core@4.4.5 ### Patch Changes - Add `isReplay` boolean to the run context (`ctx.run.isReplay`), derived from the existing `replayedFromTaskRunFriendlyId` database field. Defaults to `false` for backwards compatibility. ([#3454](#3454)) - Redact the `resolveWaitpoint` runtime log so it only emits `id` and `type` instead of the full completed waitpoint. Previously the log printed the entire waitpoint (including `output`) to stdout in production runs, which could leak sensitive payloads. The value returned by `wait.forToken()` is unchanged. ([#3490](#3490)) - Add `SessionId` friendly ID generator and schemas for the new durable Session primitive. Exported from `@trigger.dev/core/v3/isomorphic` alongside `RunId`, `BatchId`, etc. Ships the `CreateSessionStreamWaitpoint` request/response schemas alongside the main Session CRUD. ([#3417](#3417)) - Truncate large error stacks and messages to prevent OOM crashes. Stack traces are capped at 50 frames (keeping top 5 + bottom 45 with an omission notice), individual stack lines at 1024 chars, and error messages at 1000 chars. Applied in parseError, sanitizeError, and OTel span recording. ([#3405](#3405)) ## @trigger.dev/python@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` - `@trigger.dev/build@4.4.5` - `@trigger.dev/sdk@4.4.5` ## @trigger.dev/react-hooks@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/redis-worker@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/rsc@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/schema-to-json@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` ## @trigger.dev/sdk@4.4.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.5` </details> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
What this gives Trigger.dev users
A new first-class primitive, Session, for durable, task-bound, bidirectional I/O that outlives any single run. Sessions are the run manager for
chat.agentgoing forward, and they unblock anything else that needs "one identifier, many runs over time" with a stable channel pair the client can write to and subscribe to.Use cases unblocked
chatIdviaexternalId), turns 1..N attach to the same Session, the UI subscribes once and keeps receiving output as new runs take over..in, the client writes to.in, the server enforces no-writes-after-close..outafter the task finishes to replay history.How it works (Session-as-run-manager)
A Session row is task-bound (
taskIdentifier+triggerConfigare required) and owns its current run viacurrentRunId+currentRunVersionfor optimistic claim. Three trigger paths:POST /api/v1/sessionscreates the row and triggers the first run synchronously.POST /realtime/v1/sessions/:session/in/appendchecks if the current run is alive; if it has terminated (idle exit, crash, etc.), the server triggers a new run before processing the append.POST /api/v1/sessions/:session/end-and-continue, called by the running agent, triggers a fresh run and atomically swapscurrentRunId. Used bychat.requestUpgrade()for version handoffs.Every triggered run is recorded in the
SessionRunaudit table with a reason (initial,continuation,upgrade,manual).Public API surface
Control plane
POST /api/v1/sessions— create. Idempotent on(env, externalId). Triggers the first run, returns the session and a session-scoped public access token. Returns 409 if the upserted row is already closed.GET /api/v1/sessions/:session— retrieve by friendlyId (session_abc...) or by your own externalId (server disambiguates by prefix).GET /api/v1/sessions— list with filters (type,tag,taskIdentifier,externalId, derivedstatusACTIVE/CLOSED/EXPIRED, created-at range) and cursor pagination. Backed by ClickHouse.PATCH /api/v1/sessions/:session— update tags / metadata / externalId.POST /api/v1/sessions/:session/close— terminate. Idempotent, hard-blocks new server-brokered writes.POST /api/v1/sessions/:session/end-and-continue— agent-only handoff to a fresh run.Realtime
PUT /realtime/v1/sessions/:session/:io— initialize a channel. Returns S2 credentials in headers so high-throughput clients can write direct to S2.GET /realtime/v1/sessions/:session/:io— SSE subscribe. Supports Last-Event-ID resume and an opt-inX-Peek-Settled: 1header that fast-closes the stream when the upstream is already settled (trigger:turn-complete), eliminating long-poll wait on reconnect-on-reload paths.POST /realtime/v1/sessions/:session/:io/append— server-side appends.POST /api/v1/runs/:runFriendlyId/session-streams/wait— runs wait on a session stream as a waitpoint, with a race-check to avoid suspending if data already landed.Auth scopes
sessionsis a new resource type.read:sessions:{id},write:sessions:{id},admin:sessions:{id}flow through the existing JWT validator. Session-scoped public access tokens minted by the server replace browser-held trigger-task tokens for chat-style flows — the browser never sees a run identifier or a run-scoped token in steady state.What's coming after this PR
@trigger.dev/sdkprerelease alongside this server deploy. Customers using the prereleasechat.agentwill follow the upgrade guide.Implementation notes
Sessiontable: scalar scoping columns (projectId,runtimeEnvironmentId,environmentType,organizationId) without FKs, matching the January TaskRun FK-removal decision. Point-lookup indexes only — list queries go to ClickHouse. Terminal markers (closedAt,expiresAt) are write-once.sessions_v1: ReplacingMergeTree, partitioned by month, ordered by(org_id, project_id, environment_id, created_at, session_id). Tags indexed viatokenbf_v1skip index.SessionsReplicationService: mirrorsRunsReplicationServiceexactly — leader-locked logical replication consumer,ConcurrentFlushScheduler, retry with exponential backoff + jitter, identical metric shape. Dedicated slot + publication so the two consume independently.sessions/{addressingKey}/{out|in}. The existingruns/{runId}/{streamId}key format for run-scoped streams is untouched.ensureRunForSessiontriggers a run upfront (cheap to cancel if it loses the race), then attempts anupdateManykeyed oncurrentRunVersion. Loser cancels its triggered run and reuses the winner's. No DB lock held across the trigger.What did NOT change
Run-scoped
streams.pipe/streams.inputand the existing/realtime/v1/streams/{runId}/...routes are unchanged. Sessions are net-new — not a reshaping of the current streams API.Deploy notes
SESSION_REPLICATION_CLICKHOUSE_URLandSESSION_REPLICATION_ENABLED=1to enable the replication consumer.Sessiontable needsREPLICA IDENTITY FULLset on the prod source DB before the publication is created (same one-time DDL we did forTaskRun). Required for delete events to carry full column values.GET /api/v1/sessions/:sessionloader (a JWT minted for either form authorizes both URL forms). Action routes are URL-form-specific, matching how the SDK mints PATs.Verification
apps/webapp/test/sessionsReplicationService.test.ts— round-trip tests for insert/update/delete through Postgres logical replication into ClickHouse via testcontainers..out.initialize+.out.appendx2 +.in.send+.out.subscribeover SSE, list with all filter combinations + pagination,end-and-continueswap,X-Peek-Settledfast-close (verified in browser via reconnect-on-reload and via curl). Replicated row lands in ClickHouse within ~1s.prismawriter, info-leak on auth-routes masked as 403, peek-settled discriminator parsing fix, etc.).Test plan
pnpm run typecheck --filter webapppnpm run test --filter webapp ./test/sessionsReplicationService.test.ts --runSESSION_REPLICATION_CLICKHOUSE_URLandSESSION_REPLICATION_ENABLED=1. Confirm the slot and publication auto-create on boot.POST /api/v1/sessionsand verify the row replicates totrigger_dev.sessions_v1within a couple of seconds.POST /api/v1/sessions/:id/close, then confirmPOST /realtime/v1/sessions/:id/out/appendreturns 400.externalIdonPOST /api/v1/sessionsand confirm 409.GET /realtime/v1/sessions/:id/outwithX-Peek-Settled: 1after a turn completes and confirmX-Session-Settled: trueresponse header + immediate close.