Skip to content

DAR-345: one active model per machine via coordinator model pools#435

Open
Gajesh2007 wants to merge 1 commit into
masterfrom
feat/dar-345-model-pools
Open

DAR-345: one active model per machine via coordinator model pools#435
Gajesh2007 wants to merge 1 commit into
masterfrom
feat/dar-345-model-pools

Conversation

@Gajesh2007

@Gajesh2007 Gajesh2007 commented Jun 21, 2026

Copy link
Copy Markdown
Member

Summary

Enforce exactly one active public model per provider, with the coordinator partitioning the fleet into per-model pools. Each managed machine is assigned one model; the scheduler routes only that model to it; pool exhaustion returns an uptime-neutral 429 pool_exhausted instead of spilling into another model's machines. A demand-driven placement controller decides which machine holds which model and switches machines between pools conservatively. Closes DAR-345.

All enforcement is DEFAULT-OFF (atomic assignmentGateEnabled + WARM_POOL_PLACEMENT_ENABLED/_ENFORCE) — this PR is inert until flags are flipped, enabling a staged rollout.

Why (grounded in prod telemetry, Jun 17–20)

A Mac has one shared GPU / unified-memory / memory-bandwidth budget; co-resident models split it nonlinearly.

  • Two models dominate: gpt-oss-20b ~83% of served traffic, gemma-4-26b ~17%.
  • gpt-oss decodes at 57 tps p50; gemma at 2.6 tps (~22× slower, far below the 15-tps quality bar).
  • The DGInf: Decentralized private AI inference platform #1 rejection is model_shed of gemma (526K, all could_have_served); feat: build Mac private inference core #2 is ttft_too_slow (342K, 249K of it on gpt-oss during gemma co-residency).
  • Temporal proof of isolation: gpt-oss ttft_429 collapsed under co-residency, then recovered to 0% once gemma was re-shed — while gpt-oss demand kept growing.

Pools replace the blunt global model_shed with surgical per-machine isolation, so gemma can be safely un-shed into a bounded, concurrency-capped pool without degrading gpt-oss. Switching is rare by design (regime changes + slow growth, not minute-scale flapping); anti-thrash reuses the warm pool's existing MinDwell (5m) / MaxGlobalPendingLoads / MaxLoadsPerTick knobs.

What changed

Coordinator (Go)

  • protocol: assign_model / assign_model_status messages (epoch + draining/loading/succeeded/failed).
  • registry: per-provider assignment state + epoch; AssignProviderModel/ApplyAssignModelStatus (epoch-guarded, failed→isolated+cooldown); PoolExhausted (mirrors gate eligibility — counts unmanaged machines so a mixed-fleet rollout never false-429s).
  • scheduler: one isolation gate in providerPassesRoutingGatesLockedEx (the shared selection/queue-drain/preflight/admit chokepoint) → no-spillover is structural; self-route owners bypass.
  • placement_controller: pure planPlacement allocator (floor-then-priority normalization, surplus→deficit switching, anti-thrash, unmanaged-source preference); shadow vs enforce.
  • admission: shedIfPoolExhausted 429 in both chat + responses handlers; version gate >= 0.6.18 (fail-closed); pool-transition strings classified capacity-class (no false breaker trips).
  • observability: ModelPoolReport (pool sizes, per-provider assignment, co-residency audit) on /v1/admin/utilization.

Provider (Swift, protocol-symmetric)

  • assign_model handler: drain → unload-others → load+warm → status, epoch-guarded, off the event loop in a cancellable task; refuse-don't-swap (a managed machine refuses any non-assigned model instead of LRU-swapping). ProviderCore.version0.6.18.

Testing

  • Go: allocator (9 cases), assignment/epoch/pool_exhausted (incl. real httptest no-spillover + mixed-fleet), observability. Full registry + api suites green.
  • Swift: cross-language wire round-trip + handler lifecycle/epoch-guard/refuse. darkbloom builds; Swift tests pass.
  • Dual review (Codex + independent); all blockers/highs fixed (status-handler wiring, version-gate, drain-before-unload, refuse ordering, send-failure cleanup).

Notes for reviewers

  • Coordinator deploy is human-only (EigenCloud prod) — not part of this PR.
  • The 0.6.18 provider version and minProviderVersionForModelAssignment are the two ends of one gate; bump them together at release.
  • A separate parallel branch dar-345-model-pools exists from another effort — worth reconciling before merge.
  • Deferred follow-up (advisory, non-blocking): PoolExhausted could additionally honor cooldown/breaker/trust gates for more precise pool_exhausted labeling; "public model" wording in some comments is build-id in practice.

🤖 Generated with Claude Code


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.

@vercel

vercel Bot commented Jun 21, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview Jun 21, 2026 1:44am
d-inference-console-ui-dev Ready Ready Preview Jun 21, 2026 1:44am
d-inference-landing Ready Ready Preview Jun 21, 2026 1:44am

Request Review

…model pools

Enforce exactly one active public model per provider, partitioning the fleet by
demand. A Mac has one shared GPU/unified-memory/bandwidth budget; co-resident
models split it and let a slow/hot model (gemma, ~2.6 tps p50) drag a healthy one
(gpt-oss, ~57 tps) under the TTFT bar. Prod telemetry (Jun 17-20): gpt-oss ttft_429
collapsed during gemma co-residency and recovered to 0% once gemma was re-shed,
while 526K gemma requests were turned away by the blunt global model_shed. Pools
replace that with surgical per-machine isolation so gemma can be safely un-shed
into a bounded, concurrency-capped pool without touching gpt-oss.

All enforcement is DEFAULT-OFF (atomic assignmentGateEnabled + WarmPool
PlacementEnabled/PlacementEnforce) — inert until enabled, staged rollout
(shadow -> provider assign -> static pools+gate -> pool_exhausted 429 -> dynamic).

Coordinator (Go):
- protocol: assign_model / assign_model_status messages (epoch + draining/loading/
  succeeded/failed); ProviderSwitchingModel transient marker.
- registry: Provider.Assigned{Model,Epoch,State,At}; SendAssignModel;
  AssignProviderModel (monotonic epoch); ApplyAssignModelStatus (epoch-guarded;
  failed -> isolated + dispatch cooldown); PoolExhausted (mirrors gate eligibility,
  counts unmanaged machines so a mixed-fleet rollout never false-429s).
- scheduler: one isolation gate in providerPassesRoutingGatesLockedEx (the shared
  selection/queue-drain/preflight/admit chokepoint) -> no-spillover is structural;
  self-route owners bypass.
- placement_controller: pure planPlacement allocator (floor-then-priority
  normalization, surplus->deficit switching, anti-thrash via MinDwell /
  MaxGlobalPendingLoads / MaxLoadsPerTick / cooldown, unmanaged-source preference),
  shadow vs enforce, extends the dormant warm-pool tick.
- admission: shedIfPoolExhausted 429 (uptime-neutral, could-have-served) in BOTH
  chat + responses handlers; provider read loop applies assign_model_status;
  version gate providerSupportsModelAssignment (>= 0.6.18, fail-closed); pool
  transition strings classified capacity-class (no false breaker trips).
- observability: ModelPoolReport (assigned pool sizes, per-provider assignment,
  co-residency audit) on /v1/admin/utilization.

Provider (Swift, protocol-symmetric):
- assign_model / assign_model_status types + encode/decode + event/codec wiring.
- handleAssignModelRequest: drain (waitForInflightDrain) -> unload every other
  model -> load+warm (ensureModelLoaded) -> status, epoch-guarded, run off the
  event loop in a cancellable task. Refuse-don't-swap at the top of
  ensureModelLoaded (a managed machine refuses any non-assigned model instead of
  LRU-swapping). ProviderCore.version -> 0.6.18 (matches the coordinator gate).

Tests: Go allocator (9 cases), assignment/epoch/pool_exhausted (incl. real
httptest no-spillover + mixed-fleet), observability; Swift cross-language wire
round-trip + handler lifecycle/epoch-guard/refuse. Coordinator full registry+api
suites green; provider darkbloom builds + Swift tests pass. Reviewed (Codex +
independent); all blockers/highs fixed.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f66a26d8ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
machines = append(machines, placementMachine{
id: id,
current: p.AssignedModel,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude non-serving assignments from pool counts

When an assign_model push fails (or the send fails after AssignProviderModel has already set AssignmentStateLoading), the provider remains bound to AssignedModel but is not routing-eligible. This line still reports that model as the machine's current pool membership, so planPlacement counts it toward plan.current and sees no deficit to refill or retry after the cooldown; a one-machine pool can remain pool_exhausted indefinitely even though the controller comment says it should reconsider after the cooldown.

Useful? React with 👍 / 👎.

Comment on lines +2771 to +2774
if let assigned = assignedModel, modelId != assigned {
throw InferenceError.invalidModelDirectory(
"model '\(modelId)' is not this machine's assigned pool model '\(assigned)'"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor assignment rollback before provider refusal

In rollback and bypass scenarios, the coordinator deliberately stops enforcing AssignedModel (for example when WARM_POOL_ASSIGNMENT_GATE is disabled, or for owner self-route), but the provider has no unassign message and keeps assignedModel after the first assign_model. This unconditional check then rejects any non-assigned model the coordinator routes in those modes, so the documented reversible gate/self-route bypass still fails until the provider restarts or gets a different assignment.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

This PR introduces the DAR-345 model-pool assignment feature (coordinator-driven assign_model / assign_model_status round-trip) and MDA cert-chain caching for reconnects; no existing security mitigations are weakened, but three areas deserve attention before merge.


Trust Boundaries Touched

Boundary Reason
TB-002 Coordinator ↔ Provider WebSocket New assign_model / assign_model_status message types processed in the provider read-loop
TB-003 Provider operator vs. process assignedModel / assignmentEpoch actor state is mutable via coordinator messages; unloadModelsExcept evicts resident models
TB-005 Coordinator ↔ Apple MDM/MDA restoredMDAChain / SetMDAProofIfHardwareBound allow reuse of a persisted MDA proof across reconnects
TB-009 Apple attestation chain attachCachedMDAProof (referenced but not shown in diff) is the new gatekeeping path for trust-reuse

Per-Threat Assessment

T-034 — Provider runs modified code while advertising a trusted identity
⚠️ Partially touches — needs scrutiny.

coordinator/api/provider.go +479–+507: ApplyAssignModelStatus is called with statusMsg.ModelID and statusMsg.Epoch taken directly from the provider's WebSocket message. A malicious provider can:

  • Send a fabricated assign_model_status with status: succeeded for any modelID and any epoch it chooses.
  • If ApplyAssignModelStatus trusts the provider-supplied modelID without re-checking it matches the coordinator's last assigned model for this provider, it could mark an unintended model warm or clear a pending-load reservation for a model the coordinator never assigned.

The diff does not show the body of ApplyAssignModelStatus (truncated). Before merge, verify that ApplyAssignModelStatus in registry.go cross-checks the incoming (modelID, epoch) pair against the coordinator's own authoritative AssignedModel / AssignmentEpoch for this provider, and rejects any mismatch. If it simply accepts the provider-supplied modelID as ground truth, a malicious provider gains indirect influence over routing state — a trust-boundary violation consistent with T-034.

provider-swift/Sources/ProviderCore/ProviderLoop.swift +1718–+1799: the handleAssignModelRequest epoch guard (if epoch < assignmentEpoch { return }) correctly ignores stale assignments. The assignedModel and assignmentEpoch fields are actor-isolated, which is correct. No regression here from the provider side.

T-036 — Trust level elevated without completing full MDM/MDA chain
ℹ️ Neutral to the existing open finding (SEC-004), but the new fast-path warrants a comment.

coordinator/registry/registry.go +1437–+1454: RestoreProviderState now stages restoredMDAChain from the persisted store record. The comment is explicit that MDAVerified is not set here — the chain is only a candidate until attachCachedMDAProof re-verifies it against Apple's pinned root and re-binds it to the live SE key. This matches the accepted trade-off described for T-036 / DAR-326 in the threat model. The implementation discipline (unexported field, clear comments, no serialisation) is appropriate.

However, attachCachedMDAProof is referenced but not present in this diff. The security guarantee lives entirely in that function. Reviewers should confirm the full attachCachedMDAProof implementation is in this PR (or a parent commit) and that:

  1. x509.Verify is called (not skipped) on the restored chain.
  2. The FreshnessCode OID is re-checked against sha256(currentConnectionSEPublicKey), not the stored value.
  3. The function is called at hardware-grant time, not at reconnect time before attestation completes.

T-008 — Provider sends plaintext SSE chunks on encryption failure
ℹ️ Neutral. The new assignModel / assignModelStatus code path does not touch the streaming inference encryption path in ProviderLoop.swift. The existing open finding (SEC-016) is unchanged.

T-009 — Swift provider excluded from private-request routing due to missing Python flags
ℹ️ Neutral. The pool-assignment feature does not modify the providerSupportsPrivateTextLocked / privacy-capabilities gate. SEC-017 remains open.

T-010 — Cancellation not propagated to inference engine
No regression. handleAssignModelRequest correctly cancels the prior assignmentTask before starting a new one (assignmentTask?.cancel()), and CancellationError is caught and returned without a failure ack. The drain is bounded by assignDrainTimeout (30 s), mirroring the existing update-drain pattern.

T-032 / T-038 / T-041 — main.go changes
ℹ️ Neutral. The additions to coordinator/cmd/coordinator/main.go only log feature-flag state (assignment gate, placement mode) and wire SetAssignmentGateEnabled. No secrets are logged; no HTTP server config is changed. T-038 (ReadHeaderTimeout / MaxHeaderBytes) and T-041 (profile-signing key) are unaffected.


New Attack Surface Not Covered by an Existing Threat

1. Provider-controlled assign_model_status can influence coordinator routing state (extends T-034, not a separate new threat but a new code surface)

As noted above: provider.go +499 calls s.registry.MarkModelWarm, ClearPendingModelLoad, and DrainQueuedRequestsForModel based on a modelID the provider supplies. If the provider can choose an arbitrary modelID in its assign_model_status response, it can attempt to warm or drain models it was never assigned. The epoch guard alone does not prevent this if ApplyAssignModelStatus does not validate the model identity. This needs explicit verification.

2. Epoch integer wrap-around is benign in practice but worth noting

assignmentEpoch is uint64 on the Swift side and uint64 in Go. Practical wrap-around requires 2⁶⁴ assignments to the same provider — not a realistic attack. No action needed.

3. unloadModelsExcept evicts concurrently-active inference sessions

provider-swift/Sources/ProviderCore/ProviderLoop.swift +1791–+1798: unloadModelsExcept calls unloadModel for every model that is not the assigned one. The comment says this runs after a bounded drain (waitForInflightDrain), but the guard is !modelsUnloading.contains($0) — models already being unloaded are skipped, but models with active (not yet drained) inference work on them are not explicitly excluded. If waitForInflightDrain times out (the 30 s bound), unloadModelsExcept will still run and may terminate in-flight consumer sessions on those models. This is a consumer-visible availability issue, not a confidentiality or integrity issue, so it falls under T-029 / T-010 operational risk rather than a new security threat. Worth a comment in the PR for the inference team.


Open Findings Resolved by This PR

None of the tracked SEC-* findings are resolved by this diff. The MDA cert-chain caching addresses an operational pain point (providers re-requesting Apple attestation on every restart) within the accepted DAR-326 trade-off already documented in T-036, but SEC-004 (unauthenticated MDM webhook) remains open and is unaffected.


🔐 Threat model: docs/threat-model.yaml · Updates on each push to this PR

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99dd762a9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// way — an aggregator fails over on the first 429.
retryAfter = 10
}
s.ddIncr("routing.decisions", []string{"model:" + model, "model_type:" + s.registry.ModelType(model), "outcome:pool_exhausted"})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Record pool-exhausted pressure before returning 429

When this early pool_exhausted path fires, it returns before the normal preflight branch that calls RecordWarmPoolCapacityReject. For a model with zero assigned-and-serving machines and no configured floor, the warm-pool snapshot therefore keeps TargetWarm at 0, so placement receives no demand for that model and the pool can keep returning 429 indefinitely instead of ever assigning a machine.

Useful? React with 👍 / 👎.

machines := make([]placementMachine, 0, len(r.providers))
for id, p := range r.providers {
p.mu.Lock()
eligible := p.Status != StatusOffline && p.Status != StatusUntrusted && !p.PrivateOnly && r.manageableLocked(p)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate placement to routable providers

This eligibility check is looser than the scheduler’s routing gates: an online provider that is still below MinTrustLevel, lacks a fresh challenge/SIP proof, or failed runtime/privacy checks can be assigned to a pool. Once it acks assign_model, the placement current counts and PoolExhausted can treat that unroutable machine as serving the pool, while providerPassesRoutingGatesLockedEx later rejects it for actual requests, so a pool can be satisfied by capacity that public traffic cannot use.

Useful? React with 👍 / 👎.

Comment on lines +163 to +164
if policy.enabled || s.registry == nil {
return false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bypass pool shed for prefer-owner routing

When PreferOwner is set, this preflight still sheds solely on the public pool because only policy.enabled (exclusive self-route) bypasses it. The scheduler would later treat the caller’s owned provider as selfRouteOwner and bypass assignment isolation for that machine, so a request whose public pool is empty but whose owner machine can serve the model is rejected with 429 before it can use the requested owned route.

Useful? React with 👍 / 👎.

if s.registry.ApplyAssignModelStatus(providerID, statusMsg.ModelID, statusMsg.Epoch, statusMsg.Status) {
switch statusMsg.Status {
case protocol.AssignModelStatusSucceeded:
s.registry.MarkModelWarm(providerID, statusMsg.ModelID)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Replace warm state on assignment success

assign_model:succeeded means the provider has unloaded every model except the assigned one, but MarkModelWarm is additive: it leaves old WarmModels entries and existing BackendCapacity.Slots intact until the next heartbeat. During that window, warm-pool planning and the co-residency report can still treat the old pools as warm on this machine, suppressing demand or reporting false pool membership after an exclusive switch.

Useful? React with 👍 / 👎.

Comment thread coordinator/api/server.go
// minProviderVersionForDesiredModels: a pre-feature provider's strict decoder
// throws on the unknown assign_model type. KEEP THIS IN SYNC with the release
// that ships Swift assign_model support (ProviderCore.version at that cut).
const minProviderVersionForModelAssignment = "0.6.18"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fallback release at assignment-capable version

This new assignment gate requires providers to report at least 0.6.18, but the no-release-record fallback LatestProviderVersion just above is still 0.6.11 while ProviderCore.version is now 0.6.18. In in-memory/dev coordinators (or any environment before the release row is registered), /version advertises a build that can never satisfy this gate, so older providers will not update into the assign_model-capable cohort and placement will leave them unmanaged.

Useful? React with 👍 / 👎.

Comment on lines +294 to +297
budget := c.config.MaxLoadsPerTick
if budget <= 0 {
budget = 1
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor zero placement switch budget

If operators set EIGENINFERENCE_WARM_POOL_MAX_LOADS_PER_TICK=0 to disable load-issuing/throttled movement, placement enforcement still gets a budget of 1 here and sends assign_model, which drains/unloads/loads a model just like a warm load. This bypasses the existing zero-budget kill switch used by plan() and can unexpectedly move production machines while load issuance was intended to be disabled.

Useful? React with 👍 / 👎.

Comment on lines +1771 to +1773
_ = await waitForInflightDrain(timeout: Self.assignDrainTimeout)
guard assignmentEpoch == epoch else { return } // superseded mid-drain
await unloadModelsExcept(modelId)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop canceled assignments before unloading

When a newer assign_model cancels the current assignment task while it is waiting for drain, waitForInflightDrain returns false, but this result is ignored; if the canceled task resumes before the newer task updates assignmentEpoch, the guard still passes and it can unload models for the superseded assignment. In that race, a stale switch can tear down the model the newer assignment is trying to keep before the new task repairs state.

Useful? React with 👍 / 👎.

Comment on lines +139 to +140
"provider switching model",
"assigned pool model",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Classify pool refusals before recording failures

These new strings only feed the later inference-failure classifier, but handleInferenceError records provider job failures using its separate capacityRejection predicate before the dispatch path sees this marker. When a managed provider refuses a non-assigned model with assigned pool model, the request did not run and should be rerouted, yet the provider still receives a reputation/job-failure hit.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant