Skip to content

Cluster consensus#69025

Open
dwoz wants to merge 20 commits intosaltstack:3008.xfrom
dwoz:consensus
Open

Cluster consensus#69025
dwoz wants to merge 20 commits intosaltstack:3008.xfrom
dwoz:consensus

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 2, 2026

This PR is laying the ground work for a true salt master cluster. Adding raft on top of the existing cluster event bus. The raft log tracks "true" cluster membership. Masters will not start serving requests to clients/minions until it is a part of the cluster in the raft log.

In a follow up PR we will sync the existing state of the cluster to new peers before joining the raft log.

@dwoz dwoz requested a review from a team as a code owner May 2, 2026 07:57
@dwoz dwoz added the test:full Run the full test suite label May 2, 2026
dwoz added 20 commits May 5, 2026 10:18
Ports and adapts the Raft algorithm into salt.cluster.consensus,
replacing shared-filesystem cluster coordination with a proper
consensus protocol for master membership and metadata.

- salt/cluster/consensus/raft/: core Raft state machine (node, log,
  scheduler, util, chaos) with synchronous callback surface for
  testability
- salt/cluster/consensus/rpc.py: pack/unpack helpers for cluster/raft/*
  event tags multiplexed over the existing cluster_pool_port
- salt/cluster/consensus/peer.py: SaltPeer (Peer interface over cluster
  channel pushers) and RaftDispatcher (routes inbound RPCs to Node)
- salt/channel/server.py: intercept cluster/raft/* tags in
  handle_pool_publish and dispatch to RaftDispatcher
- 172 unit and functional tests covering the Raft core, wire layer,
  election, log replication, fault tolerance, and dispatcher edge cases
Adds RaftService (salt/cluster/consensus/service.py) which owns the Node,
AsyncTimeoutScheduler, SaltPeer list, and RaftDispatcher for one master
process. Wires into _publish_daemon when cluster_id and cluster_peers are
configured: builds peer pushers from the existing PublishServer list,
constructs RaftService, attaches the dispatcher to the channel, starts the
follower timer, and drives leader heartbeats via call_later.

Also adds 17 functional tests covering construction, attach, lifecycle
(start/stop), heartbeat behaviour, and an end-to-end three-node election.
- tests/pytests/integration/cluster/test_raft_cluster.py: three slow_test
  integration tests that spin up real masters on 127.0.0.1/2/3 and verify
  Raft service starts, elects exactly one leader, and re-elects after the
  leader is terminated.

- salt/cluster/consensus/service.py: use opts["interface"] as the Raft
  node-id so it matches cluster_peers keys and peer_pusher addressing;
  opts["id"] is the hostname which remote masters cannot correlate.

- salt/channel/server.py: add _raft_dispatcher and _raft_service = None
  to __setstate__ so the deserialized channel object is always consistent.

- tests/pytests/functional/cluster/consensus/test_raft_service.py: update
  _make_opts fixture to include both "id" and "interface" keys, matching
  the node_id change above.
- salt/cluster/consensus/storage.py (new): SaltStorage implements
  BaseStorage using salt.cache.Cache so any configured cache_driver
  (localfs today, mmapcache when it lands) is used automatically.
  All three durable Raft documents (state, log, snapshot) are stored
  in the bank cluster/consensus/<node_id>.

- salt/cluster/consensus/raft/log.py: remove JSONStorage entirely.
  Also drop the now-unused os and salt.utils.files imports.

- salt/cluster/consensus/raft/__init__.py: drop JSONStorage re-export.

- salt/cluster/consensus/service.py: wire SaltStorage into
  RaftService.__init__ so the Node has durable persistence from the
  first election.

- tests: replace every JSONStorage(tmpdir) with a _storage(tmpdir)
  helper that builds a SaltStorage with a localfs cache; add four
  new SaltStorage unit tests; remove JSONStorage-specific tests.
New nodes joining the Salt master cluster via the autoscale join protocol
now enter the Raft cluster as non-voting learners rather than immediately
competing for quorum.  The leader replicates the log to learners and
auto-promotes them once they are caught up, closing the split-brain window
that existed when a joining master started with voting=True.

Changes
-------
salt/cluster/consensus/raft/node.py
- Node(voting=True): new parameter; voting=False marks a node as a learner
  that will never start a pre-vote or election.
- follower_timeout_callback: learner re-arms timer instead of calling
  start_pre_vote, guaranteeing it cannot become leader before promotion.
- append_entries_reply: fixed promotion — leader now proposes CONFIG entry
  *without* pre-flipping p.voting; the flip happens on commit/apply.
- on_config_change: updates self.voting when the local node_id appears in
  the voters/learners list; updates existing peers in-place when no peer
  factory is registered (unit-test path).
- handle_append_entries: fixed dict-format entry parsing — entries arriving
  as _asdict() dicts over the wire now correctly extract cmd/type/term from
  dict keys rather than treating the whole dict as cmd.
- info(): exposes voting status.

salt/cluster/consensus/service.py
- RaftService._make_peer(addr, voting): peer factory wired into
  Node.register_peer_factory so Node.on_config_change can instantiate
  SaltPeers for new addresses in CONFIG entries.
- RaftService.notify_peer_joined(peer_addr): called when a new master
  completes the Salt cluster join handshake; adds the peer as a non-voting
  learner and initialises leader replication tracking.
- _peer_pushers is now a mutable dict copy so dynamically added pushers
  stay in sync between RaftService and RaftDispatcher.
- NodeState imported at module level; removed inline import from
  _heartbeat_tick.

salt/channel/server.py
- handle_pool_publish: after writing the new peer's pub key from a
  cluster/peer/join-notify event, calls _raft_service.notify_peer_joined()
  so existing members add the joiner as a learner immediately.
- Restore gen_token() method dropped during rebase conflict resolution.

Tests
-----
- tests/pytests/unit/cluster/consensus/test_raft_membership.py: 16 tests
  covering learner no-election guarantee, quorum exclusion of non-voters,
  CONFIG-driven promotion, on_config_change self-voting, and all
  notify_peer_joined scenarios including leader replication tracking init.
- tests/pytests/functional/cluster/consensus/test_raft_learner.py: 4
  functional tests with a live in-process Raft cluster: learner blocked
  from leadership, leader replicates to learner, full learner→voter
  promotion cycle, quorum commits without waiting for learner ack.
- tests/pytests/functional/cluster/consensus/test_raft_transport.py:
  updated test_config_entry_type_replicates to assert the now-correct
  behaviour — CONFIG entries on followers carry the right entry type from
  the wire rather than burying it in cmd.
Closes three gaps identified in the Raft correctness audit.

Dynamic join — joiner starts as learner (priority #2 & #6)
-----------------------------------------------------------
A master that joins via discover_peers starts _publish_daemon before the
join handshake completes, so cluster_peers is empty and RaftService is not
started.  Once join-reply arrives the MasterPubServerChannel now calls
_start_raft_as_learner(), which:
  - builds pushers to every peer received in join-reply (fixes gap #6:
    dispatcher had no pushers to reply to the leader)
  - starts RaftService with voting=False (fixes gap #2: joiner was starting
    as a voter, risking split-brain before the leader promoted it)

salt/channel/server.py
- _start_raft_as_learner(known_peers): new helper started from the
  join-reply handler; builds peer pushers and calls
  RaftService(voting=False).
- join-reply handler: calls _start_raft_as_learner after event.set().

salt/cluster/consensus/service.py
- RaftService(voting=True): new optional parameter passed to Node so the
  node starts in learner mode when the master joined dynamically.

Founding CONFIG entry (priority #5)
-------------------------------------
The first elected leader of a fresh cluster now writes a CONFIG entry
recording the founding voter set before any application entries.  This
makes cluster membership durably verifiable from the log alone.

salt/cluster/consensus/service.py
- _maybe_commit_founding_config(): proposes {voters, learners} CONFIG entry
  when the leader's log is empty; no-ops if the log already has content.
- _heartbeat_tick: calls _maybe_commit_founding_config() on every tick
  while leader (the guard makes it idempotent).

Tests
-----
tests/pytests/functional/cluster/consensus/test_raft_service.py
- TestFoundingConfig: leader writes CONFIG entry on empty log; no-ops if
  log is non-empty.
- TestDynamicJoinerAsLearner: RaftService(voting=False) produces a
  non-voting node; learner-mode service does not start elections.
The Raft log already replicated CONFIG entries; this commit turns those
entries into durable, queryable cluster membership state.

MembershipStateMachine (salt/cluster/consensus/raft/log.py)
-----------------------------------------------------------
New class implementing BaseStateMachine for cluster membership.

- apply(cmd, index): updates internal voter/learner sets from a committed
  CONFIG entry; fires optional on_change(voters, learners) callback.
- current_voters() / current_learners(): sorted lists of the committed
  membership — the authoritative query API.
- is_voter(node_id) / is_learner(node_id): membership predicates.
- membership_version: log index of the most recently committed CONFIG.
- get_snapshot() / restore_snapshot(): full snapshot/restore for log
  compaction support.

Node wiring (salt/cluster/consensus/raft/node.py)
-------------------------------------------------
- Node.__init__ now accepts membership_sm parameter and always creates a
  MembershipStateMachine by default.
- register_membership_sm(sm): replaces the SM after construction.
- apply_entries: routes committed CONFIG entries through membership_sm.apply
  (updates the SM's committed view) then calls on_config_change to update
  Node.peers and Node.voting.  A monotonic _applied_config_index guard
  prevents an older committed CONFIG from regressing the peer state
  when a leader has already eagerly applied a newer CONFIG via log_add.
- handle_append_entries: likewise tracks _applied_config_index when
  followers apply CONFIG entries on receipt.
- info(): includes membership snapshot (voters, learners, version).

RaftService (salt/cluster/consensus/service.py)
-----------------------------------------------
- membership property: exposes node.membership_sm so callers can query
  current_voters() and current_learners() from committed log state.

Tests
-----
- test_raft_log.py: 11 new TestMembershipStateMachine tests covering
  initial state, apply, overwrite, is_voter/learner, on_change callback,
  snapshot roundtrip, bytes restore, invalid restore, and repr.
- test_raft_node.py: 7 new TestNodeMembershipSM tests covering SM presence,
  info() membership field, SM updated only on commit, learner recorded,
  version advance across multiple commits, register_membership_sm, snapshot
  reflects committed state only, and service.membership property.
Add 88 new unit and functional tests covering previously untested paths
across the full consensus package:

- log.py: LogEntryCommitStatus.set, LogEntry.__eq__ memoryview/str paths,
  cmd_view property, info() bytes decode, Log.__repr__/last_index alias,
  Log.add snapshot boundary/conflict truncation/gap-append with storage,
  Log.snapshot empty guard, Log.clear with storage, Log.has_entry None/-1
  and snapshot boundary, Log.truncate_prefix with storage and before-snap
  no-op, max_log_size auto-snapshot trigger, all BaseStorage and
  BaseStateMachine abstract-method raise paths, CounterStateMachine
  restore_snapshot valid bytes / invalid bytes / non-bytes-non-dict paths,
  MembershipStateMachine LogEntryCommitStatus initial_node, SaltStorage
  dict-format log entry load and save_snapshot JSON encoding branch

- node.py: Peer.address property and pre_request_vote callback,
  ManualPeer.install_snapshot via handle_all_requests, storage tuple-format
  load (guards added to Log.__init__), request_votes dispatch to voting
  peers, native_engine become_leader/become_follower/send_heartbeat hooks,
  on_config_change factory learner path (new and existing peer), no-factory
  voter/learner flip, append() alias, handle_append_entries lca guard,
  install_snapshot snapshot-already-newer no-op, candidacy_timeout_callback
  stale-candidacy clear, commit_index setter snapshot trigger,
  append_entries_reply non-leader early return, start_pre_vote non-follower
  guard, pre_request_vote_reply high-term step-down

- service.py: _heartbeat_tick per-peer send exception swallowed,
  _heartbeat_tick outer exception swallowed, _maybe_commit_founding_config
  log_add exception caught

- storage.py: load_log dict-format branch, save_snapshot JSON encoding

- util.py: is_socket_closed all branches (empty data, BlockingIOError,
  ConnectionResetError, OSError 107, OSError 9, unknown OSError, generic
  exception, non-empty data), log_exceptions_async decorator passthrough/
  reraise/name preservation, log_exceptions decorator passthrough/reraise/
  name preservation

- scheduler.py: ThreadedTimeoutScheduler exception in callback caught,
  advance_clock_to_next_timeout empty/advance cases, AsyncTimeoutScheduler
  stop no-op, TimeoutScheduler.process_timeouts fires past-due callbacks,
  TimeoutHandle cancel via threaded lock, async coroutine callback,
  async cancelled wrapper early-return

- peer.py: RaftDispatcher publish exception caught and logged

- chaos.py: ChaosPeer.install_snapshot delegation via _wrap_rpc
New masters joining a cluster now block all non-_auth requests until the
Raft MembershipStateMachine has committed a CONFIG entry listing this node
in the voter set.

- SMaster.populate_secrets creates a multiprocessing.Event under
  secrets["cluster_ready"] when cluster_id is set; it starts unset.
- RaftService accepts an on_ready callback; _on_membership_change fires it
  exactly once when this node's node_id first appears in committed voters.
- MasterPubServerChannel._signal_cluster_ready sets the event and is
  passed as on_ready when constructing RaftService in _publish_daemon.
- _cluster_is_ready() helper checks the event; always True for non-cluster
  masters.
- ReqServerChannel.handle_message and PoolRoutingChannel.handle_and_route_message
  return {"cluster_retry": True} for non-_auth commands while the gate is
  closed, letting _auth through unconditionally so minions can authenticate.
- Drop the never-implemented discover_peers() stub and the _discover_event
  parameter from MasterPubServerChannel; require cluster_peers to be
  non-empty via a preflight critical error when cluster_id is set.

Tests: unit (TestClusterIsReady, TestReqServerChannelGate,
TestPoolRoutingChannelGate), functional (TestOnReady), and scenario
(TestOnReadyFunctional, TestClusterReadyIntegration,
TestLearnerReadinessScenario) -- 30 new tests, all passing.
- Remove unused imports (tempfile, NodeState, ManualTimeoutScheduler,
  _patch_make_peer, patch, MasterPubServerChannel, pytest)
- Replace unnecessary lambda with a named inner function
Each inline Concrete subclass in TestBaseStorageAndStateMachineAbstractBodies
only overrode the one method under test, leaving the others unimplemented.
Add stub implementations for all remaining abstract methods so pylint is
satisfied while keeping the NotImplementedError behaviour under test intact.
The stub save_snapshot overrides in TestBaseStorageAndStateMachineAbstractBodies
used last_index/last_term instead of index/term, mismatching BaseStorage's
declared signature. pylint 3.1.1 (CI version) catches this as W0237.
Repair join-notify handling: decrypt with cluster_aes, write peer pub keys
only when missing to avoid PermissionError on 0400 files, and skip RSA
wrapping of the full cluster PEM in join-reply.

Start dynamic Raft learners with the same peer pushers as the cluster pool
(one pusher per host), wire on_ready so promoted voters open cluster_ready,
and guard against duplicate learner startup. Use defaultdict for
auth_errors. Align master discover/join with per-master join sentinel and
non-blocking bootstrap.

Add cluster_minion_all scenario fixture and fix fourth-master join test.
Use python3.11 for default_language_version so nox lint sessions keep
working where stdlib distutils is gone on 3.12+. Ignore the local venv311
tree used for that interpreter.
- salt/cluster/consensus/peer.py: SaltPeer._send now offloads real
  PublishServer.publish (async-declared but synchronous via SyncWrapper)
  to a worker thread. Awaiting it directly stalled the asyncio event
  loop while a dead peer's TCP retries ran, starving heartbeats to the
  reachable peers and causing test_raft_election_three_masters and the
  basic_cluster minion tests to fail with a split-brain leader.
  Test fakes (Mock/FakePusher) keep the direct-await path.
- salt/channel/server.py: align the dynamic-join learner log line with
  "Raft consensus service started" so test_raft_service_started_on_all_masters
  picks it up alongside the founding-voter and normal-startup lines.
- salt/channel/client.py: in crypted_transfer_decode_dictentry, raise a
  clean AuthenticationError when the master still does not return a
  session key after the post-reauth retry (e.g. a cluster_retry payload
  while the gate is closed) instead of the KeyError on ret["key"].
Two consensus fixes plus one test-isolation fix turned up by CI run
25302783378 on the consensus branch:

- salt/channel/server.py / salt/master.py: deterministic founding
  voter.  Only the lowest interface address in {self} ∪ cluster_peers
  arms ``_start_raft_as_founding_voter``; all other masters wait for
  ``cluster/peer/join-reply`` and come up as learners.  The same
  founder also skips ``discover_peers`` so no inbound join-reply can
  race the timer and demote it to a learner, which previously left the
  cluster with zero voters or — when both timers expired before the
  exchange — multiple disjoint single-member leaders.  Fixes
  ``test_raft_election_three_masters``, ``test_raft_re_election_after_leader_restart``
  and the fourth-master scenario where the gate stayed closed because
  no membership CONFIG ever committed.

- tests/pytests/integration/minion/test_startup_states.py: each fixture
  now removes its minion's pubkey from the master's
  ``pki_dir/{minions,minions_pre,minions_rejected,minions_denied}``
  on teardown.  Without that the keys persisted across tests, so
  later netapi / cherrypy / saltutil.test_wheel tests in the same
  shard targeted ``*`` and timed out waiting on the defunct
  ``-empty-string`` / ``-highstate`` / ``-sls`` / ``-top`` minions.
Two unrelated cleanups landing together:

- salt/channel/server.py: PoolRoutingChannel binds the external request
  socket in pre_fork but the only post_fork consumer (MWorker) takes the
  ``pool_name`` branch and delegates to the IPC pool server, leaving the
  external socket bound but never ``listen``ed.  ZeroMQ side-stepped this
  with ``zmq_device_pooled`` started from its pre_fork; TCP and WebSocket
  transports had no equivalent, so cluster masters using those transports
  silently dropped every minion connection (``StreamClosedError`` on the
  client, no listener on the server).

  The fix adds ``_transport_has_builtin_router`` and, when the external
  transport lacks its own router, spawns a ``PoolRouter`` process from
  ``PoolRoutingChannel.pre_fork``.  The forked process inherits the bound
  external socket, runs an asyncio loop, and calls ``post_fork`` on the
  no-``pool_name`` branch — that's the existing routing path which wires
  the external socket through ``handle_and_route_message`` and creates an
  IPC RequestClient per pool.

  Confirmed locally: scenarios cluster + integration cluster suites now
  pass on both ``--transport=zeromq`` (10/10) and ``--transport=tcp``
  (10/10); 379 unit + functional cluster tests still pass.

- tests/support/raft_chaos.py: ``ChaosController`` and ``ChaosPeer`` move
  out of ``salt/cluster/consensus/raft/`` (where they were exposed in
  the package's public API alongside ``Node``, ``Log`` etc.) into
  ``tests/support`` so production code does not ship with fault
  injection.  ``salt.cluster.consensus.raft.__init__`` drops the imports
  and ``__all__`` entries; the lone consumer
  (``tests/pytests/unit/cluster/consensus/test_raft_chaos.py``) imports
  from ``tests.support.raft_chaos`` instead.
Two cluster-bootstrap fixes for the ARM64 split-brain failures seen on
Ubuntu 22.04 in run 25342696586:

- ``MasterPubServerChannel._join_sentinel_path``: namespace the sentinel
  by the master's interface (``.cluster_joined.<interface>``).  Cluster
  deployments and the integration fixtures that point every master at
  the same ``cachedir`` were sharing a single ``.cluster_joined`` file —
  the first master to join wrote it and every later master saw it on
  startup, took the "rejoining-voter" path, and bypassed the
  deterministic founding-voter check.  Result was two masters
  bootstrapping their own clusters and ``test_raft_election_three_masters``
  failing with ``got 2 leaders ['127.0.0.1', '127.0.0.2']``.

- ``handle_pool_publish`` ``cluster/peer/join`` branch: also call
  ``self._raft_service.notify_peer_joined(payload['peer_id'])`` on the
  receiver.  The leader was broadcasting ``join-notify`` to every other
  peer so they'd add the joiner as a Raft learner — but the receiver of
  the join itself never sees its own broadcast and never added the new
  peer to its own RaftService.peers.  Replication to a freshly joined
  master stalled, the promotion CONFIG entry never committed, the
  joiner's ``_signal_cluster_ready`` never ran, and downstream tests
  (``test_fourth_master_joins_existing_cluster``, the
  ``test_cluster_key_rotation`` flake) hit gate-closed timeouts.

Verified locally:
  379/379 cluster unit + functional;
  10/10 cluster integration + scenarios on each of zeromq and tcp;
  3 stress reruns of basic_cluster + raft_cluster on tcp = 8/8 each.
When ``Node.start_pre_vote`` sent its RPCs, no further timer was
scheduled.  If no peer replied (peers still booting in CI, peers
unreachable, etc.) the node sat in ``FOLLOWER`` state with
``_pre_candidacy`` set forever — nothing on the success or failure path
ever re-fired ``follower_timeout_callback``.

Re-arming the follower timeout at the end of ``start_pre_vote`` makes
the pre-vote retry until at least one voting peer is reachable.  On the
success path the candidate transition immediately moves us out of
FOLLOWER state, so ``follower_timeout_callback`` short-circuits and the
re-arm is harmless.

Surfaced as the Ubuntu 22.04 ARM64 ``scenarios tcp`` flake in run
25352778561: master_1 (deterministic founder) ran its first pre-vote
before master_2's pool puller was listening, then sat there forever.
The cluster_ready event never fired and the minion fixture hit
"Master did not return a session key for pillar request" timeouts.

Updated ``test_node_follower_timeout_reschedule`` to assert the new
correct behaviour (timer not None after pre-vote).
Three independent stability fixes for CI failures observed in run
25358451150 on the consensus branch:

- ``salt/cluster/consensus/service.py``: bump the cluster Raft election
  window from 150–300 ms (the ``Node`` defaults, sized for in-process
  unit tests) to 750–1500 ms — about 10× the 50 ms heartbeat interval,
  matching standard Raft tuning.  CI runners regularly delay a heartbeat
  enough to fire the election timer mid-flight; under the old window a
  follower would step up, win the term, and produce a second
  ``BECOMING LEADER`` log entry while the original leader was still
  reachable.  New ``cluster_election_min`` / ``cluster_election_max``
  opts let deployments tune.

- ``tests/pytests/integration/cluster/test_raft_cluster.py``: replace
  the historical ``"BECOMING LEADER" in log`` count with a current-leader
  check that parses the most recent ``BECOMING <state> for term N`` line
  per master.  The old grep conflated Raft's safety property (≤ 1 leader
  at any moment) with liveness churn (leadership can legitimately move
  between terms), turning every successful election that ever occurred
  during the test window into a counted leader.

- ``tests/pytests/functional/cluster/consensus/test_raft_service.py``:
  ``_make_opts`` keeps the legacy 150–300 ms window so the in-process
  functional suite stays fast.  Real cluster masters override via opts.

- ``tests/pytests/scenarios/dns/multimaster/conftest.py``:
  ``mm-master-2`` now grabs freshly-unused ports from
  ``pytestshellutils.utils.ports.get_unused_localhost_port`` instead of
  ``mm-master-1.port + 1``.  On busy CI runners the adjacent port was
  regularly already in use, so the second master aborted in 4.66 s with
  ``[Errno 98] Address already in use`` from
  ``salt.utils.verify.verify_socket``.  Test only uses master-1 in its
  body; master-2 sits in the fixture chain, so a fresh port doesn't
  change observable behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants