Skip to content

fix(replication): use explicit bank reset signal #1294

Merged
bmuddha merged 5 commits into
masterfrom
bmuddha/fix/explicit-reset-signal
Jun 5, 2026
Merged

fix(replication): use explicit bank reset signal #1294
bmuddha merged 5 commits into
masterfrom
bmuddha/fix/explicit-reset-signal

Conversation

@bmuddha
Copy link
Copy Markdown
Collaborator

@bmuddha bmuddha commented Jun 4, 2026

Summary by CodeRabbit

  • New Features

    • Coordinated startup account cleanup that removes stale accounts while preserving protected and critical system accounts
    • Replication-level reset messaging so Primaries can trigger coordinated resets across Replicas; reset publishes are acknowledged and replicas apply/reset before replay proceeds
  • Refactor

    • Simplified disabled chainlink initialization and reduced runtime coupling in validator/replication startup
  • Tests

    • Added unit coverage verifying only stale accounts are removed during reset

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • test-integration/Cargo.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6419f9c4-60d2-4ac0-9302-bbee9261234a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements a startup bank reset mechanism that removes stale accounts while preserving critical system and protocol accounts. The reset is coordinated through the replication layer: the validator initiates reset after ledger replay, communicates it via a new Message::Reset through the replication stream to all nodes, and both primary and replica nodes execute the reset at the appropriate slot. Chainlink's Disabled mode is modernized by removing stored state, and validator reset responsibility shifts from chainlink methods to direct accounts-db calls integrated with replication messaging.

Suggested reviewers

  • thlorenz
  • GabrielePicco
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bmuddha/fix/explicit-reset-signal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-accounts-db/src/lib.rs`:
- Around line 369-371: The current removal condition only preserves accounts
whose owner() equals feature::ID but not the feature program account itself
(pubkey == feature::ID); change the predicate in the removal logic (the block
using account.owner() and feature::ID) to also preserve the account when the
account's pubkey equals feature::ID (e.g., add || account.pubkey() ==
feature::ID), or alternatively add feature::ID to reset::protected_accounts so
the feature program account is never selected for deletion.

In `@magicblock-accounts-db/src/tests.rs`:
- Around line 646-651: Add an explicit assertion that the feature program
account itself (feature::ID) survives the startup/reset path tested here: after
inserting the feature-owned account (feature_owned via env.insert_account and
AccountSharedData), also ensure env.get_account(&feature::ID) (or the test's
equivalent account lookup) returns Some(...) and matches expected
lamports/owner/state; apply the same additional assertion in the other similar
block around the 680-688 region so the test covers both the feature-owned
account and the feature::ID account survival.

In `@magicblock-api/src/magic_validator.rs`:
- Around line 908-931: The pending-intent recovery is started unconditionally
while ReplicationMode::Replica skips reset_bank(), which can cause recovery to
read stale state; modify the startup flow so pending-intent recovery is deferred
until after the replica reset has actually been applied—either by gating
recovery on the reset path for ReplicationMode::Replica (i.e., wait for
replication_tx.send(Message::Reset(...)) to be handled/applied) or by moving the
recovery start to the point after the replication service is spawned and the
Message::Reset has been consumed/applied; use the existing symbols
ReplicationMode::Replica, reset_bank(), replication_tx, Message::Reset and the
pending-intent recovery entry point to locate and change the code.

In `@magicblock-replicator/src/service/replica.rs`:
- Around line 86-101: The Reset branch for Message::Reset must honor the
slot/index ordering and persist the replica position and ack behavior: change
the fast-path to extract slot_and_index (use the same slot_and_index contract
used elsewhere), compare against self.ctx.position (or self.ctx.slot and index)
so resets are only applied when the incoming (slot,index) > current position,
call self.ctx.reset_bank().await only when appropriate, and on successful reset
persist/update the replica position to the reset's (slot,index) and ack the NATS
message (msg.ack().await); additionally, if a reset is obsolete (older or equal
to current position) ack the message to avoid redelivery loops and log
appropriately; reference Message::Reset, self.ctx.reset_bank(), msg.ack(), and
the slot_and_index/replica position update logic when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5a0af93a-a12d-4e5f-970e-e2ef7b2815a2

📥 Commits

Reviewing files that changed from the base of the PR and between 2a31a73 and e69c725.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • magicblock-accounts-db/Cargo.toml
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/reset.rs
  • magicblock-accounts-db/src/tests.rs
  • magicblock-aperture/src/tests.rs
  • magicblock-aperture/tests/setup.rs
  • magicblock-api/src/magic_validator.rs
  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/tests/utils/test_context.rs
  • magicblock-core/src/link/replication.rs
  • magicblock-replicator/src/nats/mod.rs
  • magicblock-replicator/src/service/context.rs
  • magicblock-replicator/src/service/primary.rs
  • magicblock-replicator/src/service/replica.rs

Comment thread magicblock-accounts-db/src/lib.rs
Comment thread magicblock-accounts-db/src/tests.rs
Comment thread magicblock-api/src/magic_validator.rs
Comment thread magicblock-replicator/src/service/replica.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
magicblock-replicator/src/service/replica.rs (1)

86-100: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reset handling is incomplete: missing return, position update, and obsolete-reset ack.

The Reset fast-path has several issues:

  1. No return after success (line 96): After acking a successful reset, execution falls through to line 101 and continues processing until eventually hitting Message::Reset(_) => return at line 128.

  2. No position update after successful reset: Without calling self.ctx.update_position(slot, index), the replica's position doesn't advance. If the reset message is redelivered (e.g., ack failure), it will be processed again, potentially wiping post-startup accounts.

  3. Obsolete resets not acked (lines 88-90): Returning without acking causes redelivery loops for stale reset messages.

🔧 Proposed fix
         if let Message::Reset(slot) = &message {
-            let slot = *slot;
-            if self.ctx.slot > slot {
+            let (slot, index) = message.slot_and_index();
+            let obsolete = self.ctx.slot > slot
+                || (self.ctx.slot == slot && self.ctx.index >= index);
+            if obsolete {
+                // Ack obsolete resets to prevent redelivery loops
+                if let Err(error) = msg.ack().await {
+                    warn!(%error, "failed to ack obsolete reset message");
+                }
                 return;
             }
-            let result = self.ctx.reset_bank().await;
-            if result.is_ok() {
-                if let Err(error) = msg.ack().await {
-                    warn!(%error, "failed to ack nats message");
-                }
-            }
-            if let Err(error) = result {
+
+            if let Err(error) = self.ctx.reset_bank().await {
                 error!(slot, %error, "message processing error");
+                return;
             }
+
+            self.ctx.update_position(slot, index);
+
+            if let Err(error) = msg.ack().await {
+                warn!(%error, "failed to ack nats message");
+            }
+            return;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-replicator/src/service/replica.rs` around lines 86 - 100, The
Reset branch in the message handler (matching Message::Reset) must stop
fall-through after a successful reset: after await on self.ctx.reset_bank() and
a successful msg.ack(), return immediately; also advance the replica position by
calling self.ctx.update_position(slot, index) with the appropriate index before
returning so the reset isn’t reprocessed; finally ensure that when the incoming
reset is obsolete (self.ctx.slot > slot) you ack the message (msg.ack().await)
before returning to avoid redelivery loops. Locate the Message::Reset handling
around the reset_bank() call and msg.ack() to apply these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@magicblock-replicator/src/service/replica.rs`:
- Around line 86-100: The Reset branch in the message handler (matching
Message::Reset) must stop fall-through after a successful reset: after await on
self.ctx.reset_bank() and a successful msg.ack(), return immediately; also
advance the replica position by calling self.ctx.update_position(slot, index)
with the appropriate index before returning so the reset isn’t reprocessed;
finally ensure that when the incoming reset is obsolete (self.ctx.slot > slot)
you ack the message (msg.ack().await) before returning to avoid redelivery
loops. Locate the Message::Reset handling around the reset_bank() call and
msg.ack() to apply these changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6bf7714e-06bc-463e-bc05-6d5a8751321b

📥 Commits

Reviewing files that changed from the base of the PR and between e69c725 and 93cf9dd.

⛔ Files ignored due to path filters (1)
  • test-integration/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • magicblock-api/src/magic_validator.rs
  • magicblock-replicator/src/service/replica.rs

@bmuddha bmuddha force-pushed the bmuddha/fix/explicit-reset-signal branch from 93cf9dd to 8558d5f Compare June 4, 2026 13:52
@bmuddha bmuddha self-assigned this Jun 5, 2026
@bmuddha bmuddha requested a review from lucacillario June 5, 2026 08:54
@bmuddha bmuddha enabled auto-merge (squash) June 5, 2026 08:55
@bmuddha bmuddha merged commit 2407d9e into master Jun 5, 2026
35 checks passed
@bmuddha bmuddha deleted the bmuddha/fix/explicit-reset-signal branch June 5, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants