Cluster inference: pipeline parallelism + continuous batching across Macs#1
Draft
crypt0fairy wants to merge 2 commits into
Draft
Cluster inference: pipeline parallelism + continuous batching across Macs#1crypt0fairy wants to merge 2 commits into
crypt0fairy wants to merge 2 commits into
Conversation
9a2c769 to
20ef807
Compare
…ster
Adds multi-device clustering so a co-located, individually-attested Apple
Silicon cluster serves one logical provider, splitting a model layer-wise
across nodes over an encrypted MLX ring — without breaking Darkbloom's
operator-blind guarantee.
Core (provider-swift/Sources/ProviderCore/Cluster/):
- Trust: ClusterRoster (coordinator-signed membership, min-member trust),
ClusterHandshake (SIGMA pairwise mutual auth over the roster),
ClusterLinkCrypto (HKDF directional keys + per-token ChaCha20-Poly1305,
AAD-bound to cluster/request/layer/seq).
- Planning: LayerPartition (memory-weighted largest-remainder split),
ClusterPlan (topology from [cluster] config).
- Transport: MLXDistributed (Swift binding over mlx-c distributed collectives),
MLXRingEnvironment, ActivationCodec (tensor<->sealed bytes).
- Decode: ClusterPipeline (the verified lockstep ring loop), ClusterServer
(control-round request broadcast), DistributedInferenceEngine, the
PipelineModelShard seam + the GPT-OSS shard adapter.
- Continuous batching: ClusterBatch{Scheduler,Server,Pipeline,Control}
(~2.5x aggregate throughput, validated cross-Mac).
- Robustness: peer control-loop backs off and exits on a dead ring instead of
busy-spinning (fixes an orphaned peer pegging a CPU core for hours).
Coordinator (Go): per-member cluster attestation relay + verify
(cluster trust = min member), cluster-provider registers as one mlx-swift
provider, dev local-cluster switches.
Executables: cluster-run, cluster-provider (+ --peer), and bench/smoke
harnesses (solo-bench, comms-bench, batch-*).
Docs: clustering.md, cluster-node-handshake.md, cluster-benchmark.md,
cluster-tensor-expert-parallel.md (incl. pipeline/TP/RDMA diagrams),
cluster-perf-backlog.md, jaccl-rdma-readiness.md.
Cluster model support: GPT-OSS (Gemma 4 added in the following commit).
Submodules point at the ring-enabled mlx-swift + pipeline-shard
mlx-swift-lm fork branches.
Wires the Gemma 4 pipeline shard (mlx-swift-lm) into the cluster so a Gemma 4 text model runs across the ring through the same engine/transport/crypto path as GPT-OSS. - Gemma4ShardAdapter: bridges Gemma4PipelineShard to PipelineModelShard. - Dispatch: ClusterHeadBringup + cluster-run select the gemma4 shard by model_type; configInt now reads text-tower dims from the nested `text_config` (Gemma 4's multimodal config layout) with a top-level fallback. - gemma4-shard-smoke: synthetic monolithic-vs-shard correctness oracle (0.000000 logit diff; covers MoE experts, k_eq_v full-attention, tied embeddings, final-logit softcap). Validated end-to-end: gemma-4-26B-A4B-it-qat-4bit sharded across two Macs over Thunderbolt (18/12 layer split, encrypted ring, coherent generation).
20ef807 to
900a1db
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Review-only PR against the fork's
masterto inspect the diff before targetingLayr-Labs/d-inference.Adds the ability to run a model across two Macs (capacity — fits models that fit on neither node alone) with continuous batching for throughput. Depends on the two submodule fork PRs (mlx-swift #1, mlx-swift-lm #1); this branch bumps those pointers.
Core
ProviderCore/Cluster/): ring bringup + memory-weighted layer split, shard load bymodel_type, ephemeral X25519 sessions, per-hop ChaCha20-Poly1305 encrypted activations, proven serial (B=1) decode loop.ClusterBatch{Control,Pipeline,Server,Scheduler}— per-step composition control protocol (admit/evict in lockstep), batched[B,hidden]ring decode, per-row routing.[cluster].batchedflag, default off (serial path unchanged).cluster-provider(head serving a coordinator) +cluster-run(bring-up harness).Results
Notes
[cluster].batcheddefaults off → zero behavior change for non-cluster providers.