Skip to content

feat(dbos): pod dispatch-role split (api/worker) via listenQueues#3937

Open
pedrofrxncx wants to merge 2 commits into
mainfrom
feat/dbos-queue-role-split
Open

feat(dbos): pod dispatch-role split (api/worker) via listenQueues#3937
pedrofrxncx wants to merge 2 commits into
mainfrom
feat/dbos-queue-role-split

Conversation

@pedrofrxncx

@pedrofrxncx pedrofrxncx commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What

Pod dispatch-role split so request-serving and agent-loop execution scale as independent pods off one image, one Helm release, one Argo app.

App (MESH_DISPATCH_ROLE): all (default, unchanged) · worker (dequeue only the thread-gate/automations run queues) · api (dequeue nothing; serve HTTP + enqueue). Wired via DBOSConfig.listenQueues; all omits it → identical to today.

Chart (worker.enabled): renders a second <fullname>-worker Deployment + CPU HPA, same image/DB/auth. dispatchRole sets the main deployment's role.

How the LB can't hit a busy worker

Worker pods carry a distinct app.kubernetes.io/name (<name>-worker), so neither the main Deployment selector nor the Service selector match them. Workers are never Service endpoints → the LB has no path to them; they receive work only by pulling the DBOS queue. /stream is served by the api pods (which tail NATS for chunks workers published). Kubelet health probes hit pods directly, so workers stay live despite being off-Service.

Safety

  • worker.enabled=false (default) renders byte-identical to the current chart (verified with helm template diff) — zero impact until you opt in.
  • Distinct -worker name means no change to the main Deployment's immutable selector and no Service change → no recreate.
  • dispatchRole is an env-only change (rolling, no recreate).
  • Requires ≥1 worker (or all) pod or runs never dispatch.

Rollout

  1. worker.enabled=true → workers come up; main still all, nothing breaks.
  2. Confirm workers dequeue + a real decopilot run executes.
  3. dispatchRole=api → main pods stop running the agent loop.
    Rollback = unset both.

Validation

tsc clean · bun test dispatch-queue automations 31/0 · helm lint clean · helm template worker-off == baseline (byte-identical) · worker-on renders main=api / worker=worker with disjoint selectors + worker HPA.

Why

Profiling showed studio's under-load CPU is streaming throughput (NATS/socket I/O + AI-SDK parsing), not GC/idle-polling/Ajv. Scaling studio replicas to absorb it also multiplies the heavy per-pod footprint (DB pool, DBOS executor + queue polling). This lets the agent-loop workers scale on CPU independently of the request tier.

…eues

Lets one image run as api-only or worker-only so request-serving and
agent-loop execution scale as independent Deployments off the same DB/auth.

- MESH_DISPATCH_ROLE=all (default, unchanged) | worker (dequeue only the
  agent/automation run queues) | api (dequeue nothing; serve HTTP + enqueue).
- Wires DBOSConfig.listenQueues from the role. 'all' omits it → identical to
  today, so it's opt-in and safe.
- Queue names moved to a side-effect-free dispatch-queue/queue-names module so
  index.ts can read them before DBOS.setConfig (which must precede workflow
  registration); existing consumers keep their import paths via re-export.

Scheduled (cron) workflows + enqueueing run on every pod and stay exactly-once
via DBOS's row-locked schedule, so an api pod can fire a cron a worker runs.
Requires >=1 worker/all pod or runs never dispatch.
One release / one Argo app: set worker.enabled=true to render a second
Deployment (<fullname>-worker, MESH_DISPATCH_ROLE=worker) that runs only the
agent/automation run queues, with its own CPU HPA.

Workers carry a distinct app.kubernetes.io/name (<name>-worker), so the main
Deployment selector AND the Service selector don't match them — workers stay
OFF the load balancer (no HTTP ever routed to a busy worker) and receive work
only by pulling the DBOS queue. /stream is served by the api pods (NATS tail);
kubelet health probes hit pods directly, so workers stay live off-Service.

dispatchRole sets the MAIN deployment's role (default ""=all, unchanged; set
"api" once workers exist). Verified: worker.enabled=false renders byte-identical
to before (zero impact); helm lint clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant