Cluster inference: pipeline parallelism + continuous batching across Macs by crypt0fairy · Pull Request #1 · crypt0fairy/d-inference

crypt0fairy · 2026-06-24T17:46:40Z

Review-only PR against the fork's master to inspect the diff before targeting Layr-Labs/d-inference.

Adds the ability to run a model across two Macs (capacity — fits models that fit on neither node alone) with continuous batching for throughput. Depends on the two submodule fork PRs (mlx-swift #1, mlx-swift-lm #1); this branch bumps those pointers.

Core

Cluster runtime (ProviderCore/Cluster/): ring bringup + memory-weighted layer split, shard load by model_type, ephemeral X25519 sessions, per-hop ChaCha20-Poly1305 encrypted activations, proven serial (B=1) decode loop.
Continuous batching: ClusterBatch{Control,Pipeline,Server,Scheduler} — per-step composition control protocol (admit/evict in lockstep), batched [B,hidden] ring decode, per-row routing. [cluster].batched flag, default off (serial path unchanged).
Coordinator: per-member cluster attestation (head relays each member's SE attestation; trust = min of members).
cluster-provider (head serving a coordinator) + cluster-run (bring-up harness).

Results

Models too big for one Mac run across the pair (capacity).
Continuous batching: ~2.5× aggregate throughput validated across two physical Macs over Thunderbolt, stable across drain/readmit.
Honest framing: at batch=1 pipeline is slower than single-node (it's for capacity); batching is the throughput lever.

Notes

This branch currently also contains benchmark harnesses (solo/loop/batch/comms/spec-bench) and research docs (TP/EP design, JACCL/RDMA readiness, perf backlog). Decision pending: carve these into a follow-up PR for the upstream submission (core-only). Reviewing here first to decide the split.
[cluster].batched defaults off → zero behavior change for non-cluster providers.

…ster Adds multi-device clustering so a co-located, individually-attested Apple Silicon cluster serves one logical provider, splitting a model layer-wise across nodes over an encrypted MLX ring — without breaking Darkbloom's operator-blind guarantee. Core (provider-swift/Sources/ProviderCore/Cluster/): - Trust: ClusterRoster (coordinator-signed membership, min-member trust), ClusterHandshake (SIGMA pairwise mutual auth over the roster), ClusterLinkCrypto (HKDF directional keys + per-token ChaCha20-Poly1305, AAD-bound to cluster/request/layer/seq). - Planning: LayerPartition (memory-weighted largest-remainder split), ClusterPlan (topology from [cluster] config). - Transport: MLXDistributed (Swift binding over mlx-c distributed collectives), MLXRingEnvironment, ActivationCodec (tensor<->sealed bytes). - Decode: ClusterPipeline (the verified lockstep ring loop), ClusterServer (control-round request broadcast), DistributedInferenceEngine, the PipelineModelShard seam + the GPT-OSS shard adapter. - Continuous batching: ClusterBatch{Scheduler,Server,Pipeline,Control} (~2.5x aggregate throughput, validated cross-Mac). - Robustness: peer control-loop backs off and exits on a dead ring instead of busy-spinning (fixes an orphaned peer pegging a CPU core for hours). Coordinator (Go): per-member cluster attestation relay + verify (cluster trust = min member), cluster-provider registers as one mlx-swift provider, dev local-cluster switches. Executables: cluster-run, cluster-provider (+ --peer), and bench/smoke harnesses (solo-bench, comms-bench, batch-*). Docs: clustering.md, cluster-node-handshake.md, cluster-benchmark.md, cluster-tensor-expert-parallel.md (incl. pipeline/TP/RDMA diagrams), cluster-perf-backlog.md, jaccl-rdma-readiness.md. Cluster model support: GPT-OSS (Gemma 4 added in the following commit). Submodules point at the ring-enabled mlx-swift + pipeline-shard mlx-swift-lm fork branches.

Wires the Gemma 4 pipeline shard (mlx-swift-lm) into the cluster so a Gemma 4 text model runs across the ring through the same engine/transport/crypto path as GPT-OSS. - Gemma4ShardAdapter: bridges Gemma4PipelineShard to PipelineModelShard. - Dispatch: ClusterHeadBringup + cluster-run select the gemma4 shard by model_type; configInt now reads text-tower dims from the nested `text_config` (Gemma 4's multimodal config layout) with a top-level fallback. - gemma4-shard-smoke: synthetic monolithic-vs-shard correctness oracle (0.000000 logit diff; covers MoE experts, k_eq_v full-attention, tied embeddings, final-logit softcap). Validated end-to-end: gemma-4-26B-A4B-it-qat-4bit sharded across two Macs over Thunderbolt (18/12 layer split, encrypted ring, coherent generation).

crypt0fairy force-pushed the feat/cluster-pipeline-parallelism branch 7 times, most recently from 9a2c769 to 20ef807 Compare June 26, 2026 18:58

crypt0fairy added 2 commits June 26, 2026 12:18

crypt0fairy force-pushed the feat/cluster-pipeline-parallelism branch from 20ef807 to 900a1db Compare June 26, 2026 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster inference: pipeline parallelism + continuous batching across Macs#1

Cluster inference: pipeline parallelism + continuous batching across Macs#1
crypt0fairy wants to merge 2 commits into
masterfrom
feat/cluster-pipeline-parallelism

crypt0fairy commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crypt0fairy commented Jun 24, 2026

Core

Results

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant