DAR-345: one active model per machine via coordinator model pools#435
DAR-345: one active model per machine via coordinator model pools#435Gajesh2007 wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…model pools
Enforce exactly one active public model per provider, partitioning the fleet by
demand. A Mac has one shared GPU/unified-memory/bandwidth budget; co-resident
models split it and let a slow/hot model (gemma, ~2.6 tps p50) drag a healthy one
(gpt-oss, ~57 tps) under the TTFT bar. Prod telemetry (Jun 17-20): gpt-oss ttft_429
collapsed during gemma co-residency and recovered to 0% once gemma was re-shed,
while 526K gemma requests were turned away by the blunt global model_shed. Pools
replace that with surgical per-machine isolation so gemma can be safely un-shed
into a bounded, concurrency-capped pool without touching gpt-oss.
All enforcement is DEFAULT-OFF (atomic assignmentGateEnabled + WarmPool
PlacementEnabled/PlacementEnforce) — inert until enabled, staged rollout
(shadow -> provider assign -> static pools+gate -> pool_exhausted 429 -> dynamic).
Coordinator (Go):
- protocol: assign_model / assign_model_status messages (epoch + draining/loading/
succeeded/failed); ProviderSwitchingModel transient marker.
- registry: Provider.Assigned{Model,Epoch,State,At}; SendAssignModel;
AssignProviderModel (monotonic epoch); ApplyAssignModelStatus (epoch-guarded;
failed -> isolated + dispatch cooldown); PoolExhausted (mirrors gate eligibility,
counts unmanaged machines so a mixed-fleet rollout never false-429s).
- scheduler: one isolation gate in providerPassesRoutingGatesLockedEx (the shared
selection/queue-drain/preflight/admit chokepoint) -> no-spillover is structural;
self-route owners bypass.
- placement_controller: pure planPlacement allocator (floor-then-priority
normalization, surplus->deficit switching, anti-thrash via MinDwell /
MaxGlobalPendingLoads / MaxLoadsPerTick / cooldown, unmanaged-source preference),
shadow vs enforce, extends the dormant warm-pool tick.
- admission: shedIfPoolExhausted 429 (uptime-neutral, could-have-served) in BOTH
chat + responses handlers; provider read loop applies assign_model_status;
version gate providerSupportsModelAssignment (>= 0.6.18, fail-closed); pool
transition strings classified capacity-class (no false breaker trips).
- observability: ModelPoolReport (assigned pool sizes, per-provider assignment,
co-residency audit) on /v1/admin/utilization.
Provider (Swift, protocol-symmetric):
- assign_model / assign_model_status types + encode/decode + event/codec wiring.
- handleAssignModelRequest: drain (waitForInflightDrain) -> unload every other
model -> load+warm (ensureModelLoaded) -> status, epoch-guarded, run off the
event loop in a cancellable task. Refuse-don't-swap at the top of
ensureModelLoaded (a managed machine refuses any non-assigned model instead of
LRU-swapping). ProviderCore.version -> 0.6.18 (matches the coordinator gate).
Tests: Go allocator (9 cases), assignment/epoch/pool_exhausted (incl. real
httptest no-spillover + mixed-fleet), observability; Swift cross-language wire
round-trip + handler lifecycle/epoch-guard/refuse. Coordinator full registry+api
suites green; provider darkbloom builds + Swift tests pass. Reviewed (Codex +
independent); all blockers/highs fixed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f66a26d8ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| machines = append(machines, placementMachine{ | ||
| id: id, | ||
| current: p.AssignedModel, |
There was a problem hiding this comment.
Exclude non-serving assignments from pool counts
When an assign_model push fails (or the send fails after AssignProviderModel has already set AssignmentStateLoading), the provider remains bound to AssignedModel but is not routing-eligible. This line still reports that model as the machine's current pool membership, so planPlacement counts it toward plan.current and sees no deficit to refill or retry after the cooldown; a one-machine pool can remain pool_exhausted indefinitely even though the controller comment says it should reconsider after the cooldown.
Useful? React with 👍 / 👎.
| if let assigned = assignedModel, modelId != assigned { | ||
| throw InferenceError.invalidModelDirectory( | ||
| "model '\(modelId)' is not this machine's assigned pool model '\(assigned)'" | ||
| ) |
There was a problem hiding this comment.
Honor assignment rollback before provider refusal
In rollback and bypass scenarios, the coordinator deliberately stops enforcing AssignedModel (for example when WARM_POOL_ASSIGNMENT_GATE is disabled, or for owner self-route), but the provider has no unassign message and keeps assignedModel after the first assign_model. This unconditional check then rejects any non-assigned model the coordinator routes in those modes, so the documented reversible gate/self-route bypass still fails until the provider restarts or gets a different assignment.
Useful? React with 👍 / 👎.
f66a26d to
99dd762
Compare
|
This PR introduces the DAR-345 model-pool assignment feature (coordinator-driven Trust Boundaries Touched
Per-Threat AssessmentT-034 — Provider runs modified code while advertising a trusted identity
The diff does not show the body of
T-036 — Trust level elevated without completing full MDM/MDA chain
However,
T-008 — Provider sends plaintext SSE chunks on encryption failure T-009 — Swift provider excluded from private-request routing due to missing Python flags T-010 — Cancellation not propagated to inference engine T-032 / T-038 / T-041 — New Attack Surface Not Covered by an Existing Threat1. Provider-controlled As noted above: 2. Epoch integer wrap-around is benign in practice but worth noting
3.
Open Findings Resolved by This PRNone of the tracked SEC-* findings are resolved by this diff. The MDA cert-chain caching addresses an operational pain point (providers re-requesting Apple attestation on every restart) within the accepted DAR-326 trade-off already documented in T-036, but SEC-004 (unauthenticated MDM webhook) remains open and is unaffected. 🔐 Threat model: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99dd762a9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // way — an aggregator fails over on the first 429. | ||
| retryAfter = 10 | ||
| } | ||
| s.ddIncr("routing.decisions", []string{"model:" + model, "model_type:" + s.registry.ModelType(model), "outcome:pool_exhausted"}) |
There was a problem hiding this comment.
Record pool-exhausted pressure before returning 429
When this early pool_exhausted path fires, it returns before the normal preflight branch that calls RecordWarmPoolCapacityReject. For a model with zero assigned-and-serving machines and no configured floor, the warm-pool snapshot therefore keeps TargetWarm at 0, so placement receives no demand for that model and the pool can keep returning 429 indefinitely instead of ever assigning a machine.
Useful? React with 👍 / 👎.
| machines := make([]placementMachine, 0, len(r.providers)) | ||
| for id, p := range r.providers { | ||
| p.mu.Lock() | ||
| eligible := p.Status != StatusOffline && p.Status != StatusUntrusted && !p.PrivateOnly && r.manageableLocked(p) |
There was a problem hiding this comment.
Gate placement to routable providers
This eligibility check is looser than the scheduler’s routing gates: an online provider that is still below MinTrustLevel, lacks a fresh challenge/SIP proof, or failed runtime/privacy checks can be assigned to a pool. Once it acks assign_model, the placement current counts and PoolExhausted can treat that unroutable machine as serving the pool, while providerPassesRoutingGatesLockedEx later rejects it for actual requests, so a pool can be satisfied by capacity that public traffic cannot use.
Useful? React with 👍 / 👎.
| if policy.enabled || s.registry == nil { | ||
| return false |
There was a problem hiding this comment.
Bypass pool shed for prefer-owner routing
When PreferOwner is set, this preflight still sheds solely on the public pool because only policy.enabled (exclusive self-route) bypasses it. The scheduler would later treat the caller’s owned provider as selfRouteOwner and bypass assignment isolation for that machine, so a request whose public pool is empty but whose owner machine can serve the model is rejected with 429 before it can use the requested owned route.
Useful? React with 👍 / 👎.
| if s.registry.ApplyAssignModelStatus(providerID, statusMsg.ModelID, statusMsg.Epoch, statusMsg.Status) { | ||
| switch statusMsg.Status { | ||
| case protocol.AssignModelStatusSucceeded: | ||
| s.registry.MarkModelWarm(providerID, statusMsg.ModelID) |
There was a problem hiding this comment.
Replace warm state on assignment success
assign_model:succeeded means the provider has unloaded every model except the assigned one, but MarkModelWarm is additive: it leaves old WarmModels entries and existing BackendCapacity.Slots intact until the next heartbeat. During that window, warm-pool planning and the co-residency report can still treat the old pools as warm on this machine, suppressing demand or reporting false pool membership after an exclusive switch.
Useful? React with 👍 / 👎.
| // minProviderVersionForDesiredModels: a pre-feature provider's strict decoder | ||
| // throws on the unknown assign_model type. KEEP THIS IN SYNC with the release | ||
| // that ships Swift assign_model support (ProviderCore.version at that cut). | ||
| const minProviderVersionForModelAssignment = "0.6.18" |
There was a problem hiding this comment.
Keep fallback release at assignment-capable version
This new assignment gate requires providers to report at least 0.6.18, but the no-release-record fallback LatestProviderVersion just above is still 0.6.11 while ProviderCore.version is now 0.6.18. In in-memory/dev coordinators (or any environment before the release row is registered), /version advertises a build that can never satisfy this gate, so older providers will not update into the assign_model-capable cohort and placement will leave them unmanaged.
Useful? React with 👍 / 👎.
| budget := c.config.MaxLoadsPerTick | ||
| if budget <= 0 { | ||
| budget = 1 | ||
| } |
There was a problem hiding this comment.
Honor zero placement switch budget
If operators set EIGENINFERENCE_WARM_POOL_MAX_LOADS_PER_TICK=0 to disable load-issuing/throttled movement, placement enforcement still gets a budget of 1 here and sends assign_model, which drains/unloads/loads a model just like a warm load. This bypasses the existing zero-budget kill switch used by plan() and can unexpectedly move production machines while load issuance was intended to be disabled.
Useful? React with 👍 / 👎.
| _ = await waitForInflightDrain(timeout: Self.assignDrainTimeout) | ||
| guard assignmentEpoch == epoch else { return } // superseded mid-drain | ||
| await unloadModelsExcept(modelId) |
There was a problem hiding this comment.
Stop canceled assignments before unloading
When a newer assign_model cancels the current assignment task while it is waiting for drain, waitForInflightDrain returns false, but this result is ignored; if the canceled task resumes before the newer task updates assignmentEpoch, the guard still passes and it can unload models for the superseded assignment. In that race, a stale switch can tear down the model the newer assignment is trying to keep before the new task repairs state.
Useful? React with 👍 / 👎.
| "provider switching model", | ||
| "assigned pool model", |
There was a problem hiding this comment.
Classify pool refusals before recording failures
These new strings only feed the later inference-failure classifier, but handleInferenceError records provider job failures using its separate capacityRejection predicate before the dispatch path sees this marker. When a managed provider refuses a non-assigned model with assigned pool model, the request did not run and should be rerouted, yet the provider still receives a reputation/job-failure hit.
Useful? React with 👍 / 👎.
Summary
Enforce exactly one active public model per provider, with the coordinator partitioning the fleet into per-model pools. Each managed machine is assigned one model; the scheduler routes only that model to it; pool exhaustion returns an uptime-neutral 429
pool_exhaustedinstead of spilling into another model's machines. A demand-driven placement controller decides which machine holds which model and switches machines between pools conservatively. Closes DAR-345.All enforcement is DEFAULT-OFF (atomic
assignmentGateEnabled+WARM_POOL_PLACEMENT_ENABLED/_ENFORCE) — this PR is inert until flags are flipped, enabling a staged rollout.Why (grounded in prod telemetry, Jun 17–20)
A Mac has one shared GPU / unified-memory / memory-bandwidth budget; co-resident models split it nonlinearly.
model_shedof gemma (526K, allcould_have_served); feat: build Mac private inference core #2 isttft_too_slow(342K, 249K of it on gpt-oss during gemma co-residency).ttft_429collapsed under co-residency, then recovered to 0% once gemma was re-shed — while gpt-oss demand kept growing.Pools replace the blunt global
model_shedwith surgical per-machine isolation, so gemma can be safely un-shed into a bounded, concurrency-capped pool without degrading gpt-oss. Switching is rare by design (regime changes + slow growth, not minute-scale flapping); anti-thrash reuses the warm pool's existingMinDwell(5m) /MaxGlobalPendingLoads/MaxLoadsPerTickknobs.What changed
Coordinator (Go)
protocol:assign_model/assign_model_statusmessages (epoch + draining/loading/succeeded/failed).registry: per-provider assignment state + epoch;AssignProviderModel/ApplyAssignModelStatus(epoch-guarded, failed→isolated+cooldown);PoolExhausted(mirrors gate eligibility — counts unmanaged machines so a mixed-fleet rollout never false-429s).scheduler: one isolation gate inproviderPassesRoutingGatesLockedEx(the shared selection/queue-drain/preflight/admit chokepoint) → no-spillover is structural; self-route owners bypass.placement_controller: pureplanPlacementallocator (floor-then-priority normalization, surplus→deficit switching, anti-thrash, unmanaged-source preference); shadow vs enforce.shedIfPoolExhausted429 in both chat + responses handlers; version gate>= 0.6.18(fail-closed); pool-transition strings classified capacity-class (no false breaker trips).ModelPoolReport(pool sizes, per-provider assignment, co-residency audit) on/v1/admin/utilization.Provider (Swift, protocol-symmetric)
assign_modelhandler: drain → unload-others → load+warm → status, epoch-guarded, off the event loop in a cancellable task; refuse-don't-swap (a managed machine refuses any non-assigned model instead of LRU-swapping).ProviderCore.version→0.6.18.Testing
pool_exhausted(incl. realhttptestno-spillover + mixed-fleet), observability. Fullregistry+apisuites green.darkbloombuilds; Swift tests pass.Notes for reviewers
0.6.18provider version andminProviderVersionForModelAssignmentare the two ends of one gate; bump them together at release.dar-345-model-poolsexists from another effort — worth reconciling before merge.PoolExhaustedcould additionally honor cooldown/breaker/trust gates for more precisepool_exhaustedlabeling; "public model" wording in some comments is build-id in practice.🤖 Generated with Claude Code
Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.