Skip to content

feat: coordination mode aware crank#1232

Open
Dodecahedr0x wants to merge 14 commits into
masterfrom
dode/mode-aware-crank
Open

feat: coordination mode aware crank#1232
Dodecahedr0x wants to merge 14 commits into
masterfrom
dode/mode-aware-crank

Conversation

@Dodecahedr0x
Copy link
Copy Markdown
Contributor

@Dodecahedr0x Dodecahedr0x commented May 20, 2026

Closes #1231

Summary

Disables the crank service when the node starts in replica mode

Breaking Changes

  • None

Summary by CodeRabbit

  • New Features

    • Scheduler now waits for coordination startup, loads persisted tasks (clearing/rescheduling in-memory state) and only begins scheduling when in primary mode; in standby/replica mode startup returns immediately.
  • Tests

    • Added and updated serialized tests to validate startup behavior across coordination states, including a test ensuring standby mode returns immediately.
  • Chores

    • Added a dev dependency to enable serialized test execution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Review Change Stack

Warning

Review limit reached

@Dodecahedr0x, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 3 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0ff21d1d-5974-40b8-b115-2c56dbf82223

📥 Commits

Reviewing files that changed from the base of the PR and between 897ac4d and 917c101.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • magicblock-api/src/magic_validator.rs
  • magicblock-task-scheduler/src/service.rs
📝 Walkthrough

Walkthrough

TaskScheduler startup is now mode-aware: MagicValidator waits for CoordinationMode to exit StartingUp and only spawns the TaskScheduler when the final mode is Primary. TaskSchedulerService::start now awaits load_persisted_tasks() before spawning the run loop. load_persisted_tasks clears in-memory scheduling state then reschedules persisted tasks and returns Ok(()). Tests were updated to use serial_test, many tests marked #[serial], coordination mode is switched in tests where needed, and a new test verifies start() returns immediately in replica/standby mode. A dev-dependency for serial_test was added.

Assessment against linked issues

Objective Addressed Explanation
Disable the task scheduler service for replicas [#1231]

Suggested reviewers

  • GabrielePicco
  • taco-paco
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dode/mode-aware-crank

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-task-scheduler/src/service.rs`:
- Around line 735-749: The test
test_task_scheduler_does_not_start_on_standby_mode calls
switch_to_replica_mode() and never restores coordination state; capture the
current coordination mode before calling switch_to_replica_mode() (e.g., let
prev = get_coordination_mode()), then after the test body restore it (e.g.,
set_coordination_mode(prev)) in a finally/teardown block (or use a RAII guard)
so switch_to_replica_mode()’s mutation does not leak to other tests; ensure you
reference and use the existing helpers (switch_to_replica_mode,
get_coordination_mode, set_coordination_mode or equivalent) when implementing
the restore.
- Around line 501-507: The is_primary_mode loop can hang forever; wrap the
waiting loop in a bounded timeout (e.g. using tokio::time::timeout with a
reasonable Duration like 10s) and return false if the timeout elapses, and also
make the inner wait cancellation-aware by using tokio::select! to await either
the sleep or an existing shutdown/cancellation token (or future) if one exists;
modify async fn is_primary_mode(&self) to use
tokio::time::timeout(Duration::from_secs(...), async { loop { ... } }).await
handling the Err(timeout) case and/or the cancellation branch so startup cannot
deadlock while polling CoordinationMode::current.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ca1d00d1-f89c-463a-a54d-3c0e34e1e4bd

📥 Commits

Reviewing files that changed from the base of the PR and between ac36179 and cb607bd.

📒 Files selected for processing (1)
  • magicblock-task-scheduler/src/service.rs

Comment thread magicblock-task-scheduler/src/service.rs Outdated
Comment thread magicblock-task-scheduler/src/service.rs Outdated
@Dodecahedr0x Dodecahedr0x marked this pull request as ready for review May 21, 2026 10:06
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-task-scheduler/src/service.rs (1)

762-775: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Fix is_primary_mode placement and swap the misleading test names

  • In magicblock-task-scheduler/src/service.rs, async fn is_primary_mode(&self) -> bool is currently declared inside fn next_execution_millis(...) (nested item), so it cannot be an associated TaskSchedulerService method called via self.is_primary_mode(). Move is_primary_mode into impl TaskSchedulerService (or a separate impl block) and close next_execution_millis after returning its i64.
  • The test names are swapped around the current test_stale_crank_completion_* / test_failed_records_* definitions: the first test is exercising periodic cleanup, and the second test is exercising “stale crank completion does not mutate replacement”.
♻️ Suggested layout
impl TaskSchedulerService {
    // ... existing methods ...

    /// Waits until the coordination mode is not StartingUp.
    /// Should be fast because task scheduler is started after the ledger replay completes.
    async fn is_primary_mode(&self) -> bool {
        let mut mode = CoordinationMode::current();
        while mode == CoordinationMode::StartingUp {
            tokio::select! {
                _ = self.token.cancelled() => return false,
                _ = tokio::time::sleep(Duration::from_millis(100)) => {}
            }
            mode = CoordinationMode::current();
        }
        mode == CoordinationMode::Primary
    }
}

fn next_execution_millis(
    last_execution_millis: i64,
    execution_interval_millis: i64,
    now: i64,
) -> i64 {
    if last_execution_millis == 0 {
        now
    } else {
        last_execution_millis + execution_interval_millis
    }
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-task-scheduler/src/service.rs` around lines 762 - 775, The
is_primary_mode async fn is currently nested inside next_execution_millis which
prevents calling self.is_primary_mode(); move the async fn
is_primary_mode(&self) into the impl TaskSchedulerService (or its own impl) so
it becomes an associated method, and ensure next_execution_millis(...) is a
standalone free function that returns its i64 before the impl block closes; also
rename/swap the two test identifiers so the test exercising periodic cleanup is
named test_stale_crank_completion_* and the one exercising “stale crank
completion does not mutate replacement” is named test_failed_records_* to match
their behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-task-scheduler/src/service.rs`:
- Line 1004: The two test functions' identifiers are swapped: the function
currently named test_stale_crank_completion_does_not_mutate_replaced_task
actually tests periodic cleanup, and the function named
test_failed_records_are_cleaned_up_periodically actually tests stale-crank
handling. Rename the functions so the names match their behavior — change the
function currently declared as
test_stale_crank_completion_does_not_mutate_replaced_task to
test_failed_records_are_cleaned_up_periodically, and change the one declared as
test_failed_records_are_cleaned_up_periodically to
test_stale_crank_completion_does_not_mutate_replaced_task — keeping their
bodies, assertions, and any attributes unchanged (refer to the two test fn
symbols to locate and swap the identifiers).

---

Outside diff comments:
In `@magicblock-task-scheduler/src/service.rs`:
- Around line 762-775: The is_primary_mode async fn is currently nested inside
next_execution_millis which prevents calling self.is_primary_mode(); move the
async fn is_primary_mode(&self) into the impl TaskSchedulerService (or its own
impl) so it becomes an associated method, and ensure next_execution_millis(...)
is a standalone free function that returns its i64 before the impl block closes;
also rename/swap the two test identifiers so the test exercising periodic
cleanup is named test_stale_crank_completion_* and the one exercising “stale
crank completion does not mutate replacement” is named test_failed_records_* to
match their behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b2b401f-742a-4009-8732-f044b43a254a

📥 Commits

Reviewing files that changed from the base of the PR and between bc9162e and 9e3855e.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • magicblock-task-scheduler/src/service.rs

Comment thread magicblock-task-scheduler/src/service.rs
Copy link
Copy Markdown
Contributor

@taco-paco taco-paco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment on design, but never the less this should work as well

pub async fn start(
mut self,
) -> TaskSchedulerResult<JoinHandle<TaskSchedulerResult<()>>> {
if self.is_primary_mode().await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest doing it on calling site level(magic_validator.rs). It seems like task scheduler wether works or not, so from point crank service it shouldn't be aware of mode at the moment.

Copy link
Copy Markdown
Collaborator

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but may addressing the concern by @taco-paco is worth it before the merge

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 944-981: The current code consumes task_scheduler inside the
startup-time primary check so if the process boots as a replica the scheduler is
dropped and never started after a Replica→Primary promotion; change the logic to
not consume task_scheduler at startup but instead defer starting it until
CoordinationMode transitions to Primary (e.g., wrap task_scheduler in a shared
container like Arc<Mutex<Option<TaskScheduler>>> or keep it as an Option on the
struct and spawn a background task that waits using CoordinationMode::current()
/ token.cancelled() and only calls task_scheduler.start() when mode ==
CoordinationMode::Primary), ensure the join_handle is awaited / logged as in the
existing tokio::spawn block, and remove the early drop so promotions
(ReplicationMode::StandBy → Primary) can start the scheduler later.

In `@magicblock-task-scheduler/src/service.rs`:
- Around line 1036-1053: The test
test_task_scheduler_does_not_start_on_standby_mode is asserting behavior that
TaskSchedulerService::start no longer provides (start now always spawns run()),
so the timeout+unwrap chain will fail; remove or move this test to the
higher-level magic_validator suite where replica/primary gating lives, and in
this file replace it with a service-level test that asserts service.start()
returns a running handle (i.e., does not immediately complete) and that run()
can be cleanly cancelled or stopped via the service stop/shutdown API; update
references to service.start(), TaskSchedulerService::start, and run() in the new
test to reflect the service contract rather than mode-gating.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 50144489-6dd1-438d-939b-df3c5b914578

📥 Commits

Reviewing files that changed from the base of the PR and between 9e3855e and 897ac4d.

📒 Files selected for processing (2)
  • magicblock-api/src/magic_validator.rs
  • magicblock-task-scheduler/src/service.rs

Comment thread magicblock-api/src/magic_validator.rs
Comment thread magicblock-task-scheduler/src/service.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: crank is not aware of coordination mode

3 participants