fix(replication): use explicit bank reset signal #1294
Conversation
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR implements a startup bank reset mechanism that removes stale accounts while preserving critical system and protocol accounts. The reset is coordinated through the replication layer: the validator initiates reset after ledger replay, communicates it via a new Message::Reset through the replication stream to all nodes, and both primary and replica nodes execute the reset at the appropriate slot. Chainlink's Disabled mode is modernized by removing stored state, and validator reset responsibility shifts from chainlink methods to direct accounts-db calls integrated with replication messaging. Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@magicblock-accounts-db/src/lib.rs`:
- Around line 369-371: The current removal condition only preserves accounts
whose owner() equals feature::ID but not the feature program account itself
(pubkey == feature::ID); change the predicate in the removal logic (the block
using account.owner() and feature::ID) to also preserve the account when the
account's pubkey equals feature::ID (e.g., add || account.pubkey() ==
feature::ID), or alternatively add feature::ID to reset::protected_accounts so
the feature program account is never selected for deletion.
In `@magicblock-accounts-db/src/tests.rs`:
- Around line 646-651: Add an explicit assertion that the feature program
account itself (feature::ID) survives the startup/reset path tested here: after
inserting the feature-owned account (feature_owned via env.insert_account and
AccountSharedData), also ensure env.get_account(&feature::ID) (or the test's
equivalent account lookup) returns Some(...) and matches expected
lamports/owner/state; apply the same additional assertion in the other similar
block around the 680-688 region so the test covers both the feature-owned
account and the feature::ID account survival.
In `@magicblock-api/src/magic_validator.rs`:
- Around line 908-931: The pending-intent recovery is started unconditionally
while ReplicationMode::Replica skips reset_bank(), which can cause recovery to
read stale state; modify the startup flow so pending-intent recovery is deferred
until after the replica reset has actually been applied—either by gating
recovery on the reset path for ReplicationMode::Replica (i.e., wait for
replication_tx.send(Message::Reset(...)) to be handled/applied) or by moving the
recovery start to the point after the replication service is spawned and the
Message::Reset has been consumed/applied; use the existing symbols
ReplicationMode::Replica, reset_bank(), replication_tx, Message::Reset and the
pending-intent recovery entry point to locate and change the code.
In `@magicblock-replicator/src/service/replica.rs`:
- Around line 86-101: The Reset branch for Message::Reset must honor the
slot/index ordering and persist the replica position and ack behavior: change
the fast-path to extract slot_and_index (use the same slot_and_index contract
used elsewhere), compare against self.ctx.position (or self.ctx.slot and index)
so resets are only applied when the incoming (slot,index) > current position,
call self.ctx.reset_bank().await only when appropriate, and on successful reset
persist/update the replica position to the reset's (slot,index) and ack the NATS
message (msg.ack().await); additionally, if a reset is obsolete (older or equal
to current position) ack the message to avoid redelivery loops and log
appropriately; reference Message::Reset, self.ctx.reset_bank(), msg.ack(), and
the slot_and_index/replica position update logic when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5a0af93a-a12d-4e5f-970e-e2ef7b2815a2
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
magicblock-accounts-db/Cargo.tomlmagicblock-accounts-db/src/lib.rsmagicblock-accounts-db/src/reset.rsmagicblock-accounts-db/src/tests.rsmagicblock-aperture/src/tests.rsmagicblock-aperture/tests/setup.rsmagicblock-api/src/magic_validator.rsmagicblock-chainlink/src/chainlink/mod.rsmagicblock-chainlink/tests/utils/test_context.rsmagicblock-core/src/link/replication.rsmagicblock-replicator/src/nats/mod.rsmagicblock-replicator/src/service/context.rsmagicblock-replicator/src/service/primary.rsmagicblock-replicator/src/service/replica.rs
There was a problem hiding this comment.
♻️ Duplicate comments (1)
magicblock-replicator/src/service/replica.rs (1)
86-100:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winReset handling is incomplete: missing return, position update, and obsolete-reset ack.
The Reset fast-path has several issues:
No
returnafter success (line 96): After acking a successful reset, execution falls through to line 101 and continues processing until eventually hittingMessage::Reset(_) => returnat line 128.No position update after successful reset: Without calling
self.ctx.update_position(slot, index), the replica's position doesn't advance. If the reset message is redelivered (e.g., ack failure), it will be processed again, potentially wiping post-startup accounts.Obsolete resets not acked (lines 88-90): Returning without acking causes redelivery loops for stale reset messages.
🔧 Proposed fix
if let Message::Reset(slot) = &message { - let slot = *slot; - if self.ctx.slot > slot { + let (slot, index) = message.slot_and_index(); + let obsolete = self.ctx.slot > slot + || (self.ctx.slot == slot && self.ctx.index >= index); + if obsolete { + // Ack obsolete resets to prevent redelivery loops + if let Err(error) = msg.ack().await { + warn!(%error, "failed to ack obsolete reset message"); + } return; } - let result = self.ctx.reset_bank().await; - if result.is_ok() { - if let Err(error) = msg.ack().await { - warn!(%error, "failed to ack nats message"); - } - } - if let Err(error) = result { + + if let Err(error) = self.ctx.reset_bank().await { error!(slot, %error, "message processing error"); + return; } + + self.ctx.update_position(slot, index); + + if let Err(error) = msg.ack().await { + warn!(%error, "failed to ack nats message"); + } + return; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-replicator/src/service/replica.rs` around lines 86 - 100, The Reset branch in the message handler (matching Message::Reset) must stop fall-through after a successful reset: after await on self.ctx.reset_bank() and a successful msg.ack(), return immediately; also advance the replica position by calling self.ctx.update_position(slot, index) with the appropriate index before returning so the reset isn’t reprocessed; finally ensure that when the incoming reset is obsolete (self.ctx.slot > slot) you ack the message (msg.ack().await) before returning to avoid redelivery loops. Locate the Message::Reset handling around the reset_bank() call and msg.ack() to apply these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@magicblock-replicator/src/service/replica.rs`:
- Around line 86-100: The Reset branch in the message handler (matching
Message::Reset) must stop fall-through after a successful reset: after await on
self.ctx.reset_bank() and a successful msg.ack(), return immediately; also
advance the replica position by calling self.ctx.update_position(slot, index)
with the appropriate index before returning so the reset isn’t reprocessed;
finally ensure that when the incoming reset is obsolete (self.ctx.slot > slot)
you ack the message (msg.ack().await) before returning to avoid redelivery
loops. Locate the Message::Reset handling around the reset_bank() call and
msg.ack() to apply these changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6bf7714e-06bc-463e-bc05-6d5a8751321b
⛔ Files ignored due to path filters (1)
test-integration/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
magicblock-api/src/magic_validator.rsmagicblock-replicator/src/service/replica.rs
93cf9dd to
8558d5f
Compare
Summary by CodeRabbit
New Features
Refactor
Tests