feat(dbos): pod dispatch-role split (api/worker) via listenQueues#3937
Open
pedrofrxncx wants to merge 2 commits into
Open
feat(dbos): pod dispatch-role split (api/worker) via listenQueues#3937pedrofrxncx wants to merge 2 commits into
pedrofrxncx wants to merge 2 commits into
Conversation
…eues Lets one image run as api-only or worker-only so request-serving and agent-loop execution scale as independent Deployments off the same DB/auth. - MESH_DISPATCH_ROLE=all (default, unchanged) | worker (dequeue only the agent/automation run queues) | api (dequeue nothing; serve HTTP + enqueue). - Wires DBOSConfig.listenQueues from the role. 'all' omits it → identical to today, so it's opt-in and safe. - Queue names moved to a side-effect-free dispatch-queue/queue-names module so index.ts can read them before DBOS.setConfig (which must precede workflow registration); existing consumers keep their import paths via re-export. Scheduled (cron) workflows + enqueueing run on every pod and stay exactly-once via DBOS's row-locked schedule, so an api pod can fire a cron a worker runs. Requires >=1 worker/all pod or runs never dispatch.
One release / one Argo app: set worker.enabled=true to render a second Deployment (<fullname>-worker, MESH_DISPATCH_ROLE=worker) that runs only the agent/automation run queues, with its own CPU HPA. Workers carry a distinct app.kubernetes.io/name (<name>-worker), so the main Deployment selector AND the Service selector don't match them — workers stay OFF the load balancer (no HTTP ever routed to a busy worker) and receive work only by pulling the DBOS queue. /stream is served by the api pods (NATS tail); kubelet health probes hit pods directly, so workers stay live off-Service. dispatchRole sets the MAIN deployment's role (default ""=all, unchanged; set "api" once workers exist). Verified: worker.enabled=false renders byte-identical to before (zero impact); helm lint clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Pod dispatch-role split so request-serving and agent-loop execution scale as independent pods off one image, one Helm release, one Argo app.
App (
MESH_DISPATCH_ROLE):all(default, unchanged) ·worker(dequeue only thethread-gate/automationsrun queues) ·api(dequeue nothing; serve HTTP + enqueue). Wired viaDBOSConfig.listenQueues;allomits it → identical to today.Chart (
worker.enabled): renders a second<fullname>-workerDeployment + CPU HPA, same image/DB/auth.dispatchRolesets the main deployment's role.How the LB can't hit a busy worker
Worker pods carry a distinct
app.kubernetes.io/name(<name>-worker), so neither the main Deployment selector nor the Service selector match them. Workers are never Service endpoints → the LB has no path to them; they receive work only by pulling the DBOS queue./streamis served by the api pods (which tail NATS for chunks workers published). Kubelet health probes hit pods directly, so workers stay live despite being off-Service.Safety
worker.enabled=false(default) renders byte-identical to the current chart (verified withhelm templatediff) — zero impact until you opt in.-workername means no change to the main Deployment's immutable selector and no Service change → no recreate.dispatchRoleis an env-only change (rolling, no recreate).all) pod or runs never dispatch.Rollout
worker.enabled=true→ workers come up; main stillall, nothing breaks.dispatchRole=api→ main pods stop running the agent loop.Rollback = unset both.
Validation
tsc clean ·
bun test dispatch-queue automations31/0 ·helm lintclean ·helm templateworker-off == baseline (byte-identical) · worker-on renders main=api / worker=worker with disjoint selectors + worker HPA.Why
Profiling showed studio's under-load CPU is streaming throughput (NATS/socket I/O + AI-SDK parsing), not GC/idle-polling/Ajv. Scaling studio replicas to absorb it also multiplies the heavy per-pod footprint (DB pool, DBOS executor + queue polling). This lets the agent-loop workers scale on CPU independently of the request tier.