Skip to content

Add core NATS tunnel transport + optimistic link presence#3854

Open
tlgimenes wants to merge 67 commits into
mainfrom
tlgimenes/dbos-primitives-overview
Open

Add core NATS tunnel transport + optimistic link presence#3854
tlgimenes wants to merge 67 commits into
mainfrom
tlgimenes/dbos-primitives-overview

Conversation

@tlgimenes

@tlgimenes tlgimenes commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What is this contribution about?

Two related changes to the Studio ↔ desktop-link path.

1. @decocms/tunnel — fetch-like HTTP over core NATS with streaming request/response bodies. Studio and the link daemon connect through POST /api/links/session using scoped, short-lived NATS credentials; JetStream is kept out of the tunnel transport.

2. Optimistic link presence — replaces the old heartbeat / KV-claim presence (which produced "CLI running but UI says offline" false-negatives) with a live probe:

  • The daemon serves GET /api/links/status over the tunnel ({ hostname, capabilities, cliVersion }).
  • The frontend polls it (via LINK_CURRENT_GET / /api/links/me, ~5s) to drive the desktop indicator, feature-gating, and the sandboxProviderKind it sends.
  • The backend is optimistic — no liveness gate on dispatch. resolveDispatchTarget just normalizes the kind, POST /messages no longer 409s on an offline desktop, and a work-publish that can't reach the daemon fails the run (forceFailIfInProgress) instead of hanging.
  • Deleted: the presence heartbeat loop, TunnelPresenceSubscriber, the studio_links KV claim registry, resolve-default-provider-kind, and the links.presence.* publish + its credential grant. A tunnel inter-frame idle timeout replaces the claim-watch as the in-flight abort.

Design + plan: docs/superpowers/specs/2026-06-12-link-presence-tunnel-status-design.md.

How to Test

  1. bun run fmt && bun run check && bun run lint
  2. bun test apps/mesh/src/links apps/mesh/src/link-daemon apps/mesh/src/tools/links apps/mesh/src/sandbox packages/tunnel/src
  3. E2E (apps/mesh/e2e/tests/link-tunnel.spec.ts): indicator flips online/offline via the live probe; an offline user-desktop send is accepted (202) and the run settles failed (no 409).

Migration Notes

  • Public tunnel deploys: set tunnel.nats.publicUrl, tunnel.nats.publicEnabled, tunnel.nats.sessionTtlSeconds, plus NATS_OPERATOR_JWT / NATS_ACCOUNT_JWT / NATS_ACCOUNT_SIGNING_KEY to mint daemon sessions; expose NATS websockets in the NATS subchart.
  • Behavior change: an offline desktop is now surfaced via the frontend probe (compose disabled) + an optimistic fail-fast run error — not a 409 on POST /messages.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • Documentation is updated (if needed)
  • No breaking changes

Summary by cubic

Moves Studio↔desktop-link traffic to an HTTP-over-NATS tunnel with streaming bodies and a live status probe, and routes run IO through a durable, seq-deduped stream processed by a leader‑elected projector. Also fixes a getPodId() import/scoping regression in projector/heartbeat wiring.

  • New Features

    • Added @decocms/tunnel (fetch over NATS) and @nats-io/jwt.
    • New POST /api/links/session mints host‑scoped tunnel credentials or a token; returns connection.urls, credentials/token, expiresAt, and tunnelHostname (503 when disabled).
    • Live presence via daemon GET /api/links/status over the tunnel; backs /api/links/me and LINK_CURRENT_GET; the web app polls ~5s.
    • Optimistic desktop dispatch: /messages no longer 409‑gates on liveness; if a tunnel publish can’t reach the daemon, the thread gate fails the run. LINK_DISCONNECT just sends a shutdown frame.
    • Shared ingest pipeline: ingestRun publishes raw chunks to DECOPILOT_STREAMS with Nats-Msg-Id = ${runId}:${fenceToken}:${seq}, seq‑dedups replays, and drives hooks only; mintRunFenceToken() isolates turns; the projector keys accumulators by (runId, fenceToken).
    • Stream + projector hardening: DECOPILOT_STREAMS is file‑backed with a 30‑min retention SLA and a 2‑min dedup window; a single‑active projector (pod‑heartbeat leader election) writes threads.title and terminal status, marks zero‑part runs failed, and exports lag and poison‑run metrics.
  • Migration

    • Helm: set tunnel.nats.publicUrl, tunnel.nats.publicEnabled, and tunnel.nats.sessionTtlSeconds (populates NATS_PUBLIC_URL, NATS_TUNNEL_PUBLIC_ENABLED, NATS_TUNNEL_SESSION_TTL_SECONDS).
    • Provide NATS credentials for daemon sessions: NATS_OPERATOR_JWT, NATS_ACCOUNT_JWT, NATS_ACCOUNT_SIGNING_KEY.
    • Expose NATS WebSockets in the NATS subchart to match publicUrl.

Written for commit 508740b. Summary will update on new commits.

Review in cubic

@tlgimenes tlgimenes force-pushed the tlgimenes/dbos-primitives-overview branch 2 times, most recently from 80bfca6 to 7161b4d Compare June 12, 2026 19:06
@tlgimenes tlgimenes force-pushed the tlgimenes/dbos-primitives-overview branch from 9600332 to c05c8cf Compare June 12, 2026 19:47
tlgimenes and others added 4 commits June 12, 2026 21:17
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gate)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…esence

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… the daemon

Optimistic dispatch publishes the work item without a pre-flight liveness
gate; when the daemon is offline the publish throws tunnel_no_first_frame.
Previously that propagated and DBOS retried it, leaving the run stuck
in_progress. Now the thread gate self-fails the run (forceFailIfInProgress)
so it settles terminal — surfacing the error the way the e2e expects.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tlgimenes tlgimenes changed the title Add core NATS tunnel transport Add core NATS tunnel transport + optimistic link presence Jun 13, 2026
tlgimenes and others added 25 commits June 14, 2026 09:27
…ives-overview

# Conflicts:
#	apps/mesh/src/link-daemon/cluster-connection-pull.test.ts
…ives-overview

# Conflicts:
#	apps/mesh/src/api/routes/decopilot/dispatch-run.ts
#	apps/mesh/src/settings/resolve-config.ts
…ives-overview

# Conflicts:
#	apps/mesh/src/link-daemon/handle-local-dispatch.ts
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rts)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + SLA retention

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + SLA retention

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…umer

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ith status writes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…only)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arkers

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ns/epochs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ector flag

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or flag

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ives-overview

# Conflicts:
#	apps/mesh/src/api/app.ts
#	apps/mesh/src/api/routes/decopilot/orphan-recovery.ts
#	bun.lock
…of scope)

The post-merge re-added pod-heartbeat construction + projector leadership
referenced POD_ID in closures that don't enclose its declaration, causing
TS2304 in a clean build (a stale tsbuildinfo masked it locally). Call the
module-scoped getPodId() directly at those sites.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
origin/main #3917 removed the getPodId import + POD_ID const from app.ts;
the clean auto-merge applied that removal while keeping the projector/
heartbeat getPodId() call sites, leaving the name undeclared (TS2304 in
the CI test-merge). Re-add the module-level import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant