Skip to content

ring: make cross-machine TCP ring join robust to launch skew#1

Draft
crypt0fairy wants to merge 1 commit into
mainfrom
darkbloom/ring-join-robustness
Draft

ring: make cross-machine TCP ring join robust to launch skew#1
crypt0fairy wants to merge 1 commit into
mainfrom
darkbloom/ring-join-robustness

Conversation

@crypt0fairy

Copy link
Copy Markdown
Owner

Review-only PR against the fork's main to inspect the diff. (This is the Apple mlx C++ core, vendored by mlx-swift. Upstream of this fork is ml-explore/mlx, not Layr-Labs.)

The actual C++ ring-backend fixes that make two independently-launched Macs reliably form a ring over TCP. (mlx-swift only bumps a pointer to this; the code lives here.)

  • ring/ring.cpp: CONN_ATTEMPTS 5 → 60 — the 5×1s connect window was far too tight for nodes launched seconds apart with differing model-load times; they'd abort with errno 60 before the peer reached accept().
  • distributed/utils.cpp TCPSocket::connect: (1) close() the socket fd on a failed attempt — upstream leaked the fd every retry, piling up SYN_SENT/CLOSED sockets that confused the peer's single accept(); (2) bound the retry backoff — upstream doubled the wait unconditionally, so after a few misses a node slept 16s/32s while its peer sat ready, falling permanently out of phase.

Verified: 2-Mac ring join succeeds reliably; end-to-end sharded inference runs on top (see the d-inference and mlx-swift PRs).

Darkbloom runs the MLX ring backend across independently-started Macs
(separate processes, separate machines). Three problems made the join
flaky/impossible in that setting:

1. ring.cpp: CONN_ATTEMPTS was 5, so the connect retry window was only ~5s.
   Two nodes launched by hand / over SSH with differing shard-load and
   attestation startup times routinely missed it and aborted with errno 60.
   Widened to 60 attempts.

2. utils.cpp TCPSocket::connect leaked the socket fd on every failed attempt,
   piling up SYN_SENT/CLOSED sockets that confused the peer's single accept().
   Now closes the fd before retrying.

3. utils.cpp used unbounded exponential backoff (wait <<= 1), so after a few
   misses a node slept 16s/32s while its peer sat ready — they fell out of
   phase and never rendezvoused. Capped the wait at 2s so retries stay frequent
   and in-phase across the whole window.

Verified on a 2-Mac cluster (32GB + 24GB) running sharded Mistral-24B-8bit
end-to-end through the Darkbloom coordinator.
@crypt0fairy crypt0fairy force-pushed the darkbloom/ring-join-robustness branch from 7416676 to 19d5ac8 Compare June 24, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant