Cluster consensus#69025
Open
dwoz wants to merge 20 commits intosaltstack:3008.xfrom
Open
Conversation
Ports and adapts the Raft algorithm into salt.cluster.consensus, replacing shared-filesystem cluster coordination with a proper consensus protocol for master membership and metadata. - salt/cluster/consensus/raft/: core Raft state machine (node, log, scheduler, util, chaos) with synchronous callback surface for testability - salt/cluster/consensus/rpc.py: pack/unpack helpers for cluster/raft/* event tags multiplexed over the existing cluster_pool_port - salt/cluster/consensus/peer.py: SaltPeer (Peer interface over cluster channel pushers) and RaftDispatcher (routes inbound RPCs to Node) - salt/channel/server.py: intercept cluster/raft/* tags in handle_pool_publish and dispatch to RaftDispatcher - 172 unit and functional tests covering the Raft core, wire layer, election, log replication, fault tolerance, and dispatcher edge cases
Adds RaftService (salt/cluster/consensus/service.py) which owns the Node, AsyncTimeoutScheduler, SaltPeer list, and RaftDispatcher for one master process. Wires into _publish_daemon when cluster_id and cluster_peers are configured: builds peer pushers from the existing PublishServer list, constructs RaftService, attaches the dispatcher to the channel, starts the follower timer, and drives leader heartbeats via call_later. Also adds 17 functional tests covering construction, attach, lifecycle (start/stop), heartbeat behaviour, and an end-to-end three-node election.
- tests/pytests/integration/cluster/test_raft_cluster.py: three slow_test integration tests that spin up real masters on 127.0.0.1/2/3 and verify Raft service starts, elects exactly one leader, and re-elects after the leader is terminated. - salt/cluster/consensus/service.py: use opts["interface"] as the Raft node-id so it matches cluster_peers keys and peer_pusher addressing; opts["id"] is the hostname which remote masters cannot correlate. - salt/channel/server.py: add _raft_dispatcher and _raft_service = None to __setstate__ so the deserialized channel object is always consistent. - tests/pytests/functional/cluster/consensus/test_raft_service.py: update _make_opts fixture to include both "id" and "interface" keys, matching the node_id change above.
- salt/cluster/consensus/storage.py (new): SaltStorage implements BaseStorage using salt.cache.Cache so any configured cache_driver (localfs today, mmapcache when it lands) is used automatically. All three durable Raft documents (state, log, snapshot) are stored in the bank cluster/consensus/<node_id>. - salt/cluster/consensus/raft/log.py: remove JSONStorage entirely. Also drop the now-unused os and salt.utils.files imports. - salt/cluster/consensus/raft/__init__.py: drop JSONStorage re-export. - salt/cluster/consensus/service.py: wire SaltStorage into RaftService.__init__ so the Node has durable persistence from the first election. - tests: replace every JSONStorage(tmpdir) with a _storage(tmpdir) helper that builds a SaltStorage with a localfs cache; add four new SaltStorage unit tests; remove JSONStorage-specific tests.
New nodes joining the Salt master cluster via the autoscale join protocol now enter the Raft cluster as non-voting learners rather than immediately competing for quorum. The leader replicates the log to learners and auto-promotes them once they are caught up, closing the split-brain window that existed when a joining master started with voting=True. Changes ------- salt/cluster/consensus/raft/node.py - Node(voting=True): new parameter; voting=False marks a node as a learner that will never start a pre-vote or election. - follower_timeout_callback: learner re-arms timer instead of calling start_pre_vote, guaranteeing it cannot become leader before promotion. - append_entries_reply: fixed promotion — leader now proposes CONFIG entry *without* pre-flipping p.voting; the flip happens on commit/apply. - on_config_change: updates self.voting when the local node_id appears in the voters/learners list; updates existing peers in-place when no peer factory is registered (unit-test path). - handle_append_entries: fixed dict-format entry parsing — entries arriving as _asdict() dicts over the wire now correctly extract cmd/type/term from dict keys rather than treating the whole dict as cmd. - info(): exposes voting status. salt/cluster/consensus/service.py - RaftService._make_peer(addr, voting): peer factory wired into Node.register_peer_factory so Node.on_config_change can instantiate SaltPeers for new addresses in CONFIG entries. - RaftService.notify_peer_joined(peer_addr): called when a new master completes the Salt cluster join handshake; adds the peer as a non-voting learner and initialises leader replication tracking. - _peer_pushers is now a mutable dict copy so dynamically added pushers stay in sync between RaftService and RaftDispatcher. - NodeState imported at module level; removed inline import from _heartbeat_tick. salt/channel/server.py - handle_pool_publish: after writing the new peer's pub key from a cluster/peer/join-notify event, calls _raft_service.notify_peer_joined() so existing members add the joiner as a learner immediately. - Restore gen_token() method dropped during rebase conflict resolution. Tests ----- - tests/pytests/unit/cluster/consensus/test_raft_membership.py: 16 tests covering learner no-election guarantee, quorum exclusion of non-voters, CONFIG-driven promotion, on_config_change self-voting, and all notify_peer_joined scenarios including leader replication tracking init. - tests/pytests/functional/cluster/consensus/test_raft_learner.py: 4 functional tests with a live in-process Raft cluster: learner blocked from leadership, leader replicates to learner, full learner→voter promotion cycle, quorum commits without waiting for learner ack. - tests/pytests/functional/cluster/consensus/test_raft_transport.py: updated test_config_entry_type_replicates to assert the now-correct behaviour — CONFIG entries on followers carry the right entry type from the wire rather than burying it in cmd.
Closes three gaps identified in the Raft correctness audit. Dynamic join — joiner starts as learner (priority #2 & #6) ----------------------------------------------------------- A master that joins via discover_peers starts _publish_daemon before the join handshake completes, so cluster_peers is empty and RaftService is not started. Once join-reply arrives the MasterPubServerChannel now calls _start_raft_as_learner(), which: - builds pushers to every peer received in join-reply (fixes gap #6: dispatcher had no pushers to reply to the leader) - starts RaftService with voting=False (fixes gap #2: joiner was starting as a voter, risking split-brain before the leader promoted it) salt/channel/server.py - _start_raft_as_learner(known_peers): new helper started from the join-reply handler; builds peer pushers and calls RaftService(voting=False). - join-reply handler: calls _start_raft_as_learner after event.set(). salt/cluster/consensus/service.py - RaftService(voting=True): new optional parameter passed to Node so the node starts in learner mode when the master joined dynamically. Founding CONFIG entry (priority #5) ------------------------------------- The first elected leader of a fresh cluster now writes a CONFIG entry recording the founding voter set before any application entries. This makes cluster membership durably verifiable from the log alone. salt/cluster/consensus/service.py - _maybe_commit_founding_config(): proposes {voters, learners} CONFIG entry when the leader's log is empty; no-ops if the log already has content. - _heartbeat_tick: calls _maybe_commit_founding_config() on every tick while leader (the guard makes it idempotent). Tests ----- tests/pytests/functional/cluster/consensus/test_raft_service.py - TestFoundingConfig: leader writes CONFIG entry on empty log; no-ops if log is non-empty. - TestDynamicJoinerAsLearner: RaftService(voting=False) produces a non-voting node; learner-mode service does not start elections.
The Raft log already replicated CONFIG entries; this commit turns those entries into durable, queryable cluster membership state. MembershipStateMachine (salt/cluster/consensus/raft/log.py) ----------------------------------------------------------- New class implementing BaseStateMachine for cluster membership. - apply(cmd, index): updates internal voter/learner sets from a committed CONFIG entry; fires optional on_change(voters, learners) callback. - current_voters() / current_learners(): sorted lists of the committed membership — the authoritative query API. - is_voter(node_id) / is_learner(node_id): membership predicates. - membership_version: log index of the most recently committed CONFIG. - get_snapshot() / restore_snapshot(): full snapshot/restore for log compaction support. Node wiring (salt/cluster/consensus/raft/node.py) ------------------------------------------------- - Node.__init__ now accepts membership_sm parameter and always creates a MembershipStateMachine by default. - register_membership_sm(sm): replaces the SM after construction. - apply_entries: routes committed CONFIG entries through membership_sm.apply (updates the SM's committed view) then calls on_config_change to update Node.peers and Node.voting. A monotonic _applied_config_index guard prevents an older committed CONFIG from regressing the peer state when a leader has already eagerly applied a newer CONFIG via log_add. - handle_append_entries: likewise tracks _applied_config_index when followers apply CONFIG entries on receipt. - info(): includes membership snapshot (voters, learners, version). RaftService (salt/cluster/consensus/service.py) ----------------------------------------------- - membership property: exposes node.membership_sm so callers can query current_voters() and current_learners() from committed log state. Tests ----- - test_raft_log.py: 11 new TestMembershipStateMachine tests covering initial state, apply, overwrite, is_voter/learner, on_change callback, snapshot roundtrip, bytes restore, invalid restore, and repr. - test_raft_node.py: 7 new TestNodeMembershipSM tests covering SM presence, info() membership field, SM updated only on commit, learner recorded, version advance across multiple commits, register_membership_sm, snapshot reflects committed state only, and service.membership property.
Add 88 new unit and functional tests covering previously untested paths across the full consensus package: - log.py: LogEntryCommitStatus.set, LogEntry.__eq__ memoryview/str paths, cmd_view property, info() bytes decode, Log.__repr__/last_index alias, Log.add snapshot boundary/conflict truncation/gap-append with storage, Log.snapshot empty guard, Log.clear with storage, Log.has_entry None/-1 and snapshot boundary, Log.truncate_prefix with storage and before-snap no-op, max_log_size auto-snapshot trigger, all BaseStorage and BaseStateMachine abstract-method raise paths, CounterStateMachine restore_snapshot valid bytes / invalid bytes / non-bytes-non-dict paths, MembershipStateMachine LogEntryCommitStatus initial_node, SaltStorage dict-format log entry load and save_snapshot JSON encoding branch - node.py: Peer.address property and pre_request_vote callback, ManualPeer.install_snapshot via handle_all_requests, storage tuple-format load (guards added to Log.__init__), request_votes dispatch to voting peers, native_engine become_leader/become_follower/send_heartbeat hooks, on_config_change factory learner path (new and existing peer), no-factory voter/learner flip, append() alias, handle_append_entries lca guard, install_snapshot snapshot-already-newer no-op, candidacy_timeout_callback stale-candidacy clear, commit_index setter snapshot trigger, append_entries_reply non-leader early return, start_pre_vote non-follower guard, pre_request_vote_reply high-term step-down - service.py: _heartbeat_tick per-peer send exception swallowed, _heartbeat_tick outer exception swallowed, _maybe_commit_founding_config log_add exception caught - storage.py: load_log dict-format branch, save_snapshot JSON encoding - util.py: is_socket_closed all branches (empty data, BlockingIOError, ConnectionResetError, OSError 107, OSError 9, unknown OSError, generic exception, non-empty data), log_exceptions_async decorator passthrough/ reraise/name preservation, log_exceptions decorator passthrough/reraise/ name preservation - scheduler.py: ThreadedTimeoutScheduler exception in callback caught, advance_clock_to_next_timeout empty/advance cases, AsyncTimeoutScheduler stop no-op, TimeoutScheduler.process_timeouts fires past-due callbacks, TimeoutHandle cancel via threaded lock, async coroutine callback, async cancelled wrapper early-return - peer.py: RaftDispatcher publish exception caught and logged - chaos.py: ChaosPeer.install_snapshot delegation via _wrap_rpc
New masters joining a cluster now block all non-_auth requests until the
Raft MembershipStateMachine has committed a CONFIG entry listing this node
in the voter set.
- SMaster.populate_secrets creates a multiprocessing.Event under
secrets["cluster_ready"] when cluster_id is set; it starts unset.
- RaftService accepts an on_ready callback; _on_membership_change fires it
exactly once when this node's node_id first appears in committed voters.
- MasterPubServerChannel._signal_cluster_ready sets the event and is
passed as on_ready when constructing RaftService in _publish_daemon.
- _cluster_is_ready() helper checks the event; always True for non-cluster
masters.
- ReqServerChannel.handle_message and PoolRoutingChannel.handle_and_route_message
return {"cluster_retry": True} for non-_auth commands while the gate is
closed, letting _auth through unconditionally so minions can authenticate.
- Drop the never-implemented discover_peers() stub and the _discover_event
parameter from MasterPubServerChannel; require cluster_peers to be
non-empty via a preflight critical error when cluster_id is set.
Tests: unit (TestClusterIsReady, TestReqServerChannelGate,
TestPoolRoutingChannelGate), functional (TestOnReady), and scenario
(TestOnReadyFunctional, TestClusterReadyIntegration,
TestLearnerReadinessScenario) -- 30 new tests, all passing.
- Remove unused imports (tempfile, NodeState, ManualTimeoutScheduler, _patch_make_peer, patch, MasterPubServerChannel, pytest) - Replace unnecessary lambda with a named inner function
Each inline Concrete subclass in TestBaseStorageAndStateMachineAbstractBodies only overrode the one method under test, leaving the others unimplemented. Add stub implementations for all remaining abstract methods so pylint is satisfied while keeping the NotImplementedError behaviour under test intact.
The stub save_snapshot overrides in TestBaseStorageAndStateMachineAbstractBodies used last_index/last_term instead of index/term, mismatching BaseStorage's declared signature. pylint 3.1.1 (CI version) catches this as W0237.
Repair join-notify handling: decrypt with cluster_aes, write peer pub keys only when missing to avoid PermissionError on 0400 files, and skip RSA wrapping of the full cluster PEM in join-reply. Start dynamic Raft learners with the same peer pushers as the cluster pool (one pusher per host), wire on_ready so promoted voters open cluster_ready, and guard against duplicate learner startup. Use defaultdict for auth_errors. Align master discover/join with per-master join sentinel and non-blocking bootstrap. Add cluster_minion_all scenario fixture and fix fourth-master join test.
Use python3.11 for default_language_version so nox lint sessions keep working where stdlib distutils is gone on 3.12+. Ignore the local venv311 tree used for that interpreter.
- salt/cluster/consensus/peer.py: SaltPeer._send now offloads real PublishServer.publish (async-declared but synchronous via SyncWrapper) to a worker thread. Awaiting it directly stalled the asyncio event loop while a dead peer's TCP retries ran, starving heartbeats to the reachable peers and causing test_raft_election_three_masters and the basic_cluster minion tests to fail with a split-brain leader. Test fakes (Mock/FakePusher) keep the direct-await path. - salt/channel/server.py: align the dynamic-join learner log line with "Raft consensus service started" so test_raft_service_started_on_all_masters picks it up alongside the founding-voter and normal-startup lines. - salt/channel/client.py: in crypted_transfer_decode_dictentry, raise a clean AuthenticationError when the master still does not return a session key after the post-reauth retry (e.g. a cluster_retry payload while the gate is closed) instead of the KeyError on ret["key"].
Two consensus fixes plus one test-isolation fix turned up by CI run
25302783378 on the consensus branch:
- salt/channel/server.py / salt/master.py: deterministic founding
voter. Only the lowest interface address in {self} ∪ cluster_peers
arms ``_start_raft_as_founding_voter``; all other masters wait for
``cluster/peer/join-reply`` and come up as learners. The same
founder also skips ``discover_peers`` so no inbound join-reply can
race the timer and demote it to a learner, which previously left the
cluster with zero voters or — when both timers expired before the
exchange — multiple disjoint single-member leaders. Fixes
``test_raft_election_three_masters``, ``test_raft_re_election_after_leader_restart``
and the fourth-master scenario where the gate stayed closed because
no membership CONFIG ever committed.
- tests/pytests/integration/minion/test_startup_states.py: each fixture
now removes its minion's pubkey from the master's
``pki_dir/{minions,minions_pre,minions_rejected,minions_denied}``
on teardown. Without that the keys persisted across tests, so
later netapi / cherrypy / saltutil.test_wheel tests in the same
shard targeted ``*`` and timed out waiting on the defunct
``-empty-string`` / ``-highstate`` / ``-sls`` / ``-top`` minions.
Two unrelated cleanups landing together: - salt/channel/server.py: PoolRoutingChannel binds the external request socket in pre_fork but the only post_fork consumer (MWorker) takes the ``pool_name`` branch and delegates to the IPC pool server, leaving the external socket bound but never ``listen``ed. ZeroMQ side-stepped this with ``zmq_device_pooled`` started from its pre_fork; TCP and WebSocket transports had no equivalent, so cluster masters using those transports silently dropped every minion connection (``StreamClosedError`` on the client, no listener on the server). The fix adds ``_transport_has_builtin_router`` and, when the external transport lacks its own router, spawns a ``PoolRouter`` process from ``PoolRoutingChannel.pre_fork``. The forked process inherits the bound external socket, runs an asyncio loop, and calls ``post_fork`` on the no-``pool_name`` branch — that's the existing routing path which wires the external socket through ``handle_and_route_message`` and creates an IPC RequestClient per pool. Confirmed locally: scenarios cluster + integration cluster suites now pass on both ``--transport=zeromq`` (10/10) and ``--transport=tcp`` (10/10); 379 unit + functional cluster tests still pass. - tests/support/raft_chaos.py: ``ChaosController`` and ``ChaosPeer`` move out of ``salt/cluster/consensus/raft/`` (where they were exposed in the package's public API alongside ``Node``, ``Log`` etc.) into ``tests/support`` so production code does not ship with fault injection. ``salt.cluster.consensus.raft.__init__`` drops the imports and ``__all__`` entries; the lone consumer (``tests/pytests/unit/cluster/consensus/test_raft_chaos.py``) imports from ``tests.support.raft_chaos`` instead.
Two cluster-bootstrap fixes for the ARM64 split-brain failures seen on Ubuntu 22.04 in run 25342696586: - ``MasterPubServerChannel._join_sentinel_path``: namespace the sentinel by the master's interface (``.cluster_joined.<interface>``). Cluster deployments and the integration fixtures that point every master at the same ``cachedir`` were sharing a single ``.cluster_joined`` file — the first master to join wrote it and every later master saw it on startup, took the "rejoining-voter" path, and bypassed the deterministic founding-voter check. Result was two masters bootstrapping their own clusters and ``test_raft_election_three_masters`` failing with ``got 2 leaders ['127.0.0.1', '127.0.0.2']``. - ``handle_pool_publish`` ``cluster/peer/join`` branch: also call ``self._raft_service.notify_peer_joined(payload['peer_id'])`` on the receiver. The leader was broadcasting ``join-notify`` to every other peer so they'd add the joiner as a Raft learner — but the receiver of the join itself never sees its own broadcast and never added the new peer to its own RaftService.peers. Replication to a freshly joined master stalled, the promotion CONFIG entry never committed, the joiner's ``_signal_cluster_ready`` never ran, and downstream tests (``test_fourth_master_joins_existing_cluster``, the ``test_cluster_key_rotation`` flake) hit gate-closed timeouts. Verified locally: 379/379 cluster unit + functional; 10/10 cluster integration + scenarios on each of zeromq and tcp; 3 stress reruns of basic_cluster + raft_cluster on tcp = 8/8 each.
When ``Node.start_pre_vote`` sent its RPCs, no further timer was scheduled. If no peer replied (peers still booting in CI, peers unreachable, etc.) the node sat in ``FOLLOWER`` state with ``_pre_candidacy`` set forever — nothing on the success or failure path ever re-fired ``follower_timeout_callback``. Re-arming the follower timeout at the end of ``start_pre_vote`` makes the pre-vote retry until at least one voting peer is reachable. On the success path the candidate transition immediately moves us out of FOLLOWER state, so ``follower_timeout_callback`` short-circuits and the re-arm is harmless. Surfaced as the Ubuntu 22.04 ARM64 ``scenarios tcp`` flake in run 25352778561: master_1 (deterministic founder) ran its first pre-vote before master_2's pool puller was listening, then sat there forever. The cluster_ready event never fired and the minion fixture hit "Master did not return a session key for pillar request" timeouts. Updated ``test_node_follower_timeout_reschedule`` to assert the new correct behaviour (timer not None after pre-vote).
Three independent stability fixes for CI failures observed in run 25358451150 on the consensus branch: - ``salt/cluster/consensus/service.py``: bump the cluster Raft election window from 150–300 ms (the ``Node`` defaults, sized for in-process unit tests) to 750–1500 ms — about 10× the 50 ms heartbeat interval, matching standard Raft tuning. CI runners regularly delay a heartbeat enough to fire the election timer mid-flight; under the old window a follower would step up, win the term, and produce a second ``BECOMING LEADER`` log entry while the original leader was still reachable. New ``cluster_election_min`` / ``cluster_election_max`` opts let deployments tune. - ``tests/pytests/integration/cluster/test_raft_cluster.py``: replace the historical ``"BECOMING LEADER" in log`` count with a current-leader check that parses the most recent ``BECOMING <state> for term N`` line per master. The old grep conflated Raft's safety property (≤ 1 leader at any moment) with liveness churn (leadership can legitimately move between terms), turning every successful election that ever occurred during the test window into a counted leader. - ``tests/pytests/functional/cluster/consensus/test_raft_service.py``: ``_make_opts`` keeps the legacy 150–300 ms window so the in-process functional suite stays fast. Real cluster masters override via opts. - ``tests/pytests/scenarios/dns/multimaster/conftest.py``: ``mm-master-2`` now grabs freshly-unused ports from ``pytestshellutils.utils.ports.get_unused_localhost_port`` instead of ``mm-master-1.port + 1``. On busy CI runners the adjacent port was regularly already in use, so the second master aborted in 4.66 s with ``[Errno 98] Address already in use`` from ``salt.utils.verify.verify_socket``. Test only uses master-1 in its body; master-2 sits in the fixture chain, so a fresh port doesn't change observable behaviour.
twangboy
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is laying the ground work for a true salt master cluster. Adding raft on top of the existing cluster event bus. The raft log tracks "true" cluster membership. Masters will not start serving requests to clients/minions until it is a part of the cluster in the raft log.
In a follow up PR we will sync the existing state of the cluster to new peers before joining the raft log.