ring: make cross-machine TCP ring join robust to launch skew by crypt0fairy · Pull Request #1 · crypt0fairy/mlx

crypt0fairy · 2026-06-24T17:57:58Z

Review-only PR against the fork's main to inspect the diff. (This is the Apple mlx C++ core, vendored by mlx-swift. Upstream of this fork is ml-explore/mlx, not Layr-Labs.)

The actual C++ ring-backend fixes that make two independently-launched Macs reliably form a ring over TCP. (mlx-swift only bumps a pointer to this; the code lives here.)

ring/ring.cpp: CONN_ATTEMPTS 5 → 60 — the 5×1s connect window was far too tight for nodes launched seconds apart with differing model-load times; they'd abort with errno 60 before the peer reached accept().
distributed/utils.cpp TCPSocket::connect: (1) close() the socket fd on a failed attempt — upstream leaked the fd every retry, piling up SYN_SENT/CLOSED sockets that confused the peer's single accept(); (2) bound the retry backoff — upstream doubled the wait unconditionally, so after a few misses a node slept 16s/32s while its peer sat ready, falling permanently out of phase.

Verified: 2-Mac ring join succeeds reliably; end-to-end sharded inference runs on top (see the d-inference and mlx-swift PRs).

Darkbloom runs the MLX ring backend across independently-started Macs (separate processes, separate machines). Three problems made the join flaky/impossible in that setting: 1. ring.cpp: CONN_ATTEMPTS was 5, so the connect retry window was only ~5s. Two nodes launched by hand / over SSH with differing shard-load and attestation startup times routinely missed it and aborted with errno 60. Widened to 60 attempts. 2. utils.cpp TCPSocket::connect leaked the socket fd on every failed attempt, piling up SYN_SENT/CLOSED sockets that confused the peer's single accept(). Now closes the fd before retrying. 3. utils.cpp used unbounded exponential backoff (wait <<= 1), so after a few misses a node slept 16s/32s while its peer sat ready — they fell out of phase and never rendezvoused. Capped the wait at 2s so retries stay frequent and in-phase across the whole window. Verified on a 2-Mac cluster (32GB + 24GB) running sharded Mistral-24B-8bit end-to-end through the Darkbloom coordinator.

crypt0fairy force-pushed the darkbloom/ring-join-robustness branch from 7416676 to 19d5ac8 Compare June 24, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ring: make cross-machine TCP ring join robust to launch skew#1

ring: make cross-machine TCP ring join robust to launch skew#1
crypt0fairy wants to merge 1 commit into
mainfrom
darkbloom/ring-join-robustness

crypt0fairy commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crypt0fairy commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant