ring: make cross-machine TCP ring join robust to launch skew#1
Draft
crypt0fairy wants to merge 1 commit into
Draft
ring: make cross-machine TCP ring join robust to launch skew#1crypt0fairy wants to merge 1 commit into
crypt0fairy wants to merge 1 commit into
Conversation
Darkbloom runs the MLX ring backend across independently-started Macs (separate processes, separate machines). Three problems made the join flaky/impossible in that setting: 1. ring.cpp: CONN_ATTEMPTS was 5, so the connect retry window was only ~5s. Two nodes launched by hand / over SSH with differing shard-load and attestation startup times routinely missed it and aborted with errno 60. Widened to 60 attempts. 2. utils.cpp TCPSocket::connect leaked the socket fd on every failed attempt, piling up SYN_SENT/CLOSED sockets that confused the peer's single accept(). Now closes the fd before retrying. 3. utils.cpp used unbounded exponential backoff (wait <<= 1), so after a few misses a node slept 16s/32s while its peer sat ready — they fell out of phase and never rendezvoused. Capped the wait at 2s so retries stay frequent and in-phase across the whole window. Verified on a 2-Mac cluster (32GB + 24GB) running sharded Mistral-24B-8bit end-to-end through the Darkbloom coordinator.
7416676 to
19d5ac8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Review-only PR against the fork's
mainto inspect the diff. (This is the ApplemlxC++ core, vendored bymlx-swift. Upstream of this fork isml-explore/mlx, not Layr-Labs.)The actual C++ ring-backend fixes that make two independently-launched Macs reliably form a ring over TCP. (
mlx-swiftonly bumps a pointer to this; the code lives here.)ring/ring.cpp:CONN_ATTEMPTS5 → 60 — the 5×1s connect window was far too tight for nodes launched seconds apart with differing model-load times; they'd abort with errno 60 before the peer reachedaccept().distributed/utils.cppTCPSocket::connect: (1)close()the socket fd on a failed attempt — upstream leaked the fd every retry, piling up SYN_SENT/CLOSED sockets that confused the peer's singleaccept(); (2) bound the retry backoff — upstream doubled the wait unconditionally, so after a few misses a node slept 16s/32s while its peer sat ready, falling permanently out of phase.Verified: 2-Mac ring join succeeds reliably; end-to-end sharded inference runs on top (see the d-inference and mlx-swift PRs).