Skip to content

Cluster inference: pipeline parallelism + continuous batching across Macs#1

Draft
crypt0fairy wants to merge 2 commits into
masterfrom
feat/cluster-pipeline-parallelism
Draft

Cluster inference: pipeline parallelism + continuous batching across Macs#1
crypt0fairy wants to merge 2 commits into
masterfrom
feat/cluster-pipeline-parallelism

Conversation

@crypt0fairy

Copy link
Copy Markdown
Owner

Review-only PR against the fork's master to inspect the diff before targeting Layr-Labs/d-inference.

Adds the ability to run a model across two Macs (capacity — fits models that fit on neither node alone) with continuous batching for throughput. Depends on the two submodule fork PRs (mlx-swift #1, mlx-swift-lm #1); this branch bumps those pointers.

Core

  • Cluster runtime (ProviderCore/Cluster/): ring bringup + memory-weighted layer split, shard load by model_type, ephemeral X25519 sessions, per-hop ChaCha20-Poly1305 encrypted activations, proven serial (B=1) decode loop.
  • Continuous batching: ClusterBatch{Control,Pipeline,Server,Scheduler} — per-step composition control protocol (admit/evict in lockstep), batched [B,hidden] ring decode, per-row routing. [cluster].batched flag, default off (serial path unchanged).
  • Coordinator: per-member cluster attestation (head relays each member's SE attestation; trust = min of members).
  • cluster-provider (head serving a coordinator) + cluster-run (bring-up harness).

Results

  • Models too big for one Mac run across the pair (capacity).
  • Continuous batching: ~2.5× aggregate throughput validated across two physical Macs over Thunderbolt, stable across drain/readmit.
  • Honest framing: at batch=1 pipeline is slower than single-node (it's for capacity); batching is the throughput lever.

Notes

  • This branch currently also contains benchmark harnesses (solo/loop/batch/comms/spec-bench) and research docs (TP/EP design, JACCL/RDMA readiness, perf backlog). Decision pending: carve these into a follow-up PR for the upstream submission (core-only). Reviewing here first to decide the split.
  • [cluster].batched defaults off → zero behavior change for non-cluster providers.

@crypt0fairy crypt0fairy force-pushed the feat/cluster-pipeline-parallelism branch 7 times, most recently from 9a2c769 to 20ef807 Compare June 26, 2026 18:58
…ster

Adds multi-device clustering so a co-located, individually-attested Apple
Silicon cluster serves one logical provider, splitting a model layer-wise
across nodes over an encrypted MLX ring — without breaking Darkbloom's
operator-blind guarantee.

Core (provider-swift/Sources/ProviderCore/Cluster/):
- Trust: ClusterRoster (coordinator-signed membership, min-member trust),
  ClusterHandshake (SIGMA pairwise mutual auth over the roster),
  ClusterLinkCrypto (HKDF directional keys + per-token ChaCha20-Poly1305,
  AAD-bound to cluster/request/layer/seq).
- Planning: LayerPartition (memory-weighted largest-remainder split),
  ClusterPlan (topology from [cluster] config).
- Transport: MLXDistributed (Swift binding over mlx-c distributed collectives),
  MLXRingEnvironment, ActivationCodec (tensor<->sealed bytes).
- Decode: ClusterPipeline (the verified lockstep ring loop), ClusterServer
  (control-round request broadcast), DistributedInferenceEngine, the
  PipelineModelShard seam + the GPT-OSS shard adapter.
- Continuous batching: ClusterBatch{Scheduler,Server,Pipeline,Control}
  (~2.5x aggregate throughput, validated cross-Mac).
- Robustness: peer control-loop backs off and exits on a dead ring instead of
  busy-spinning (fixes an orphaned peer pegging a CPU core for hours).

Coordinator (Go): per-member cluster attestation relay + verify
(cluster trust = min member), cluster-provider registers as one mlx-swift
provider, dev local-cluster switches.

Executables: cluster-run, cluster-provider (+ --peer), and bench/smoke
harnesses (solo-bench, comms-bench, batch-*).

Docs: clustering.md, cluster-node-handshake.md, cluster-benchmark.md,
cluster-tensor-expert-parallel.md (incl. pipeline/TP/RDMA diagrams),
cluster-perf-backlog.md, jaccl-rdma-readiness.md.

Cluster model support: GPT-OSS (Gemma 4 added in the following commit).
Submodules point at the ring-enabled mlx-swift + pipeline-shard
mlx-swift-lm fork branches.
Wires the Gemma 4 pipeline shard (mlx-swift-lm) into the cluster so a Gemma 4
text model runs across the ring through the same engine/transport/crypto path
as GPT-OSS.

- Gemma4ShardAdapter: bridges Gemma4PipelineShard to PipelineModelShard.
- Dispatch: ClusterHeadBringup + cluster-run select the gemma4 shard by
  model_type; configInt now reads text-tower dims from the nested `text_config`
  (Gemma 4's multimodal config layout) with a top-level fallback.
- gemma4-shard-smoke: synthetic monolithic-vs-shard correctness oracle
  (0.000000 logit diff; covers MoE experts, k_eq_v full-attention, tied
  embeddings, final-logit softcap).

Validated end-to-end: gemma-4-26B-A4B-it-qat-4bit sharded across two Macs over
Thunderbolt (18/12 layer split, encrypted ring, coherent generation).
@crypt0fairy crypt0fairy force-pushed the feat/cluster-pipeline-parallelism branch from 20ef807 to 900a1db Compare June 26, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant