Skip to content

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262

Open
julienrbrt wants to merge 12 commits into
mainfrom
julien/edge-case-p2p-stall
Open

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
julienrbrt wants to merge 12 commits into
mainfrom
julien/edge-case-p2p-stall

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Apr 16, 2026

Overview

Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead

Early WIP, i do not like what the LLM has produced.

Summary by CodeRabbit

  • New Features

    • Added automatic DA retriever walkback mechanism: when P2P synchronization stalls while data availability blocks advance too far ahead, the system automatically rewinds to recover from divergence and restore synchronization stability. Includes cooldown protection to prevent oscillation.
  • Chores

    • Updated CHANGELOG to document new walkback behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

This PR implements automatic DA retriever walkback when P2P synchronization stalls. It introduces a callback mechanism that allows the Syncer to request DA height rewinds through the follower and subscriber layers when P2P is stuck and DA blocks have advanced too far ahead, enabling re-fetching of blocks at earlier heights once P2P recovers.

Changes

DA Retriever Walkback on P2P Stall

Layer / File(s) Summary
Handler interface contract and async retriever implementation
block/internal/da/subscriber.go, block/internal/da/async_block_retriever.go
SubscriberHandler.HandleCatchup now returns []common.DAHeightEvent alongside error instead of only error. asyncBlockRetriever implementation is updated to match the new signature, adapting early exits and final returns to two-value type.
Subscriber walkback infrastructure and RewindTo mechanism
block/internal/da/subscriber.go, block/internal/da/subscriber_test.go
Subscriber stores optional WalkbackChecker callback and implements RewindTo(daHeight) for atomic height rewind with head-reached state reset. runCatchup invokes the callback after successful HandleCatchup and calls RewindTo when callback returns non-zero height. Tests verify RewindTo is no-op for equal/higher targets and properly updates height and clears state for lower targets.
DA Follower walkback callback propagation and tests
block/internal/syncing/da_follower.go, block/internal/syncing/da_follower_test.go
DAFollower interface adds RewindTo method; DAFollowerConfig adds WalkbackChecker callback forwarded to underlying subscriber. daFollower implementation delegates RewindTo to subscriber. Tests updated to capture returned events from HandleCatchup; helper makeHeader(height) added for test setup.
Syncer P2P stall tracking, walkback logic, and tests
block/internal/syncing/syncer.go, block/internal/syncing/syncer_test.go
Syncer adds atomic state for p2pStalled flag, walkbackActive flag, and walkbackCooldownUntil timestamp. P2P error handler distinguishes external cancellation from genuine failures, setting p2pStalled only on non-external errors and resetting to false on success. walkbackCheck callback gates on p2pStalled, DA bounds, and cooldown; computes minimum DA event height and returns daHeight - 1 to rewind when height gap is detected. Tests verify activation on gap, persistence across empty events, stop on contiguity, and clearance on recovery.
Documentation update
CHANGELOG.md
Entry describing automatic DA retriever walkback when P2P stalls and DA blocks advance too far ahead.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • tac0turtle
  • chatton

Poem

🐰 When P2P falters and heights diverge too wide,
DA blocks press on, leaving blocks far behind.
Now Syncer tracks stalls with atomic care,
Rewinds the retriever through callback layers fair.
Hop back, fetch again, and sync shall prevail!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description provides a brief overview matching the title but lacks substantive explanation of context, rationale, goals, and issue linkage required by the template. Expand the Overview section to explain the problem being solved, the technical approach, and link to issue #3262; clarify concerns about LLM-generated content or remove placeholder text.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature being introduced: automatic DA retriever walkback triggered by P2P stalls and diverging DA block heights.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/edge-case-p2p-stall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMay 18, 2026, 10:05 AM

@julienrbrt julienrbrt changed the title feat: add automatic DA retriever walkback when P2P stalls and DA blocks feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 16, 2026

Claude finished @julienrbrt's task in 3m 12s —— View job


Code Review

  • Gather context and read changed files
  • Review block/internal/da/async_block_retriever.go
  • Review block/internal/da/subscriber.go and tests
  • Review block/internal/syncing/da_follower.go and tests
  • Review block/internal/syncing/syncer.go and tests
  • Compile overall findings

Summary

The PR introduces a DA retriever walkback mechanism: when P2P stalls, the syncer walks the DA subscriber back one height at a time until it finds a DA height whose block events are contiguous with the node's current height. The overall concept is sound, but there are several correctness, robustness, and design issues worth addressing.


Issues

🔴 Potential panic — nil Header in walkbackCheck

syncer.go:561-566:

minHeight := events[0].Header.Height()
for _, e := range events[1:] {
    if e.Header.Height() < minHeight {
        minHeight = e.Header.Height()
    }
}

DAHeightEvent.Header is a *types.SignedHeader and can be nil (data-only events have a nil header). Calling .Height() on a nil pointer panics. A nil guard is required before accessing Header:

minHeight := uint64(math.MaxUint64)
for _, e := range events {
    if e.Header == nil {
        continue
    }
    if h := e.Header.Height(); h < minHeight {
        minHeight = h
    }
}
if minHeight == math.MaxUint64 {
    // all data-only, no block heights to compare
    ...
}

Fix this →


🟡 Walkback oscillates and stalls for large gaps

syncer.go:583: return daHeight - 1 — the mechanism steps back exactly one DA height per cooldown tick (2 s). If the gap is N DA heights wide, convergence takes O(N × 2s) but, worse, creates a sawtooth: after rewinding to daHeight-1, the catchup loop immediately re-processes daHeight-1 then re-processes daHeight, potentially re-triggering walkback after the cooldown expires. This bounces forever between two DA heights while P2P remains stalled.

Consider: (a) computing the walkback target directly from nodeHeight rather than decrementing by 1, e.g. binary-search for the right DA height or use a larger step size; or (b) at minimum documenting this as a known limitation with a TODO. Fix this →


🟡 walkbackCheck borrows s.ctx instead of the caller's context

syncer.go:554: s.store.Height(s.ctx) uses the syncer root context rather than the context passed down through runCatchup. If the subscriber's per-run context is a stricter child (e.g. has a deadline), cancellation won't propagate here. More importantly, the WalkbackChecker function type carries no context.Context:

WalkbackChecker func(daHeight uint64, events []common.DAHeightEvent) uint64

This forces the implementation to capture s.ctx — a design smell for any function that does I/O. Either thread context.Context through the callback or, if I/O in the callback is intentional, document why the root context is correct. Fix this →


🟡 No lower bound on walkback — can reach DA genesis

syncer.go:546:

if daHeight <= s.daRetrieverHeight.Load() {
    return 0
}

When walking back, daHeight is decremented until this guard fires. In a worst-case scenario (large P2P stall, DA genesis is 0, daRetrieverHeight was set to genesis), the walkback would walk all the way to DA genesis, re-fetching years of data. A more conservative lower bound (e.g. max(daRetrieverHeight, daHeight - someMaxWalkback)) would bound the damage. Fix this →


🟡 p2pStalled not cleared on external P2P cancellation

syncer.go:504-522:

wasExternallyCancelled := waitCtx.Err() != nil
...
if !wasExternallyCancelled {
    s.p2pStalled.Store(true)  // set on genuine failure
}
// no Store(false) on external cancellation
...
s.p2pStalled.Store(false)    // only on success

If a genuine failure sets p2pStalled = true, and then DA cancels the next P2P wait externally (meaning DA is now servicing blocks), p2pStalled remains true through the external-cancellation iteration. The walkback check will fire and see events contiguous with nodeHeight (since DA is filling in blocks), so it self-corrects in practice. But the intent of "P2P is stalled" is imprecise — it's actually "P2P failed at least once recently". Consider renaming to p2pLastFailedGenuinely or resetting the flag on successful external cancellation. Fix this →


🟡 Priority-height events silently discarded in HandleCatchup

da_follower.go:150: _, err := f.fetchAndPipeHeight(ctx, priorityHeight) — the events from priority-height fetches are ignored. Only the sequential-height events are returned to the WalkbackChecker. This means: if a priority hint pointed to a DA height containing the missing block, the walkback check won't see it. This is probably intentional (priority hints are best-effort), but it's undocumented and could cause the walkback to continue past the correct DA height when a priority fetch already resolved the gap. Fix this →


🟡 walkbackCooldownUntil unit not documented

syncer.go:98: walkbackCooldownUntil atomic.Int64 // unix nanoseconds — the comment is present but the field is directly compared with time.Now().UnixNano(). This is correct, but walkbackCooldownUntil.Store(0) as "not set" relies on the fact that Unix time 0 (1970) is always in the past. Add a brief comment explaining this sentinel. Fix this →


🟢 _ = events placeholder in test

da_follower_test.go:184: The returned events from HandleCatchup are discarded. Now that the signature returns events, the test should validate them (especially for the priority-hint and sequential-fetch cases). At minimum, remove _ = events and check the result is non-nil/correct for cases that should return events.


🟢 break // continue with daHeight comment removed without replacement

da_follower.go:158: The break is now uncommented. The previous // continue with daHeight was useful to clarify the loop structure. The new code has no comment explaining why the loop breaks after a successful priority fetch (vs. continuing to the sequential fetch below). Fix this →


🟢 Log level for recurring walkback trigger

syncer.go:578: s.logger.Info() fires every 2 seconds during a sustained stall. This is noisy in production logs. logger.Warn() on first activation + logger.Debug() on subsequent steps would be more appropriate.


🟢 asyncBlockRetriever.HandleCatchup always returns nil, nil

async_block_retriever.go:217: The signature change is purely mechanical to satisfy the updated interface. The nil, nil return is fine since no WalkbackChecker is wired for this subscriber. A brief comment here (e.g. // no events produced; walkback not applicable) would make this more self-documenting.


Architecture observation

The WalkbackChecker callback is passed through three layers (Syncer → DAFollowerConfig → SubscriberConfig) to get from Syncer.walkbackCheck to Subscriber.runCatchup. This introduces coupling: the Subscriber and DAFollower packages now import common.DAHeightEvent specifically to carry this callback, whereas before they were clean. An alternative would be a Subscriber.SetWalkbackHook method or an event bus, but given the PR's scope, the current approach is acceptable if the interface is documented.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.89%. Comparing base (8904387) to head (5e908de).

Files with missing lines Patch % Lines
block/internal/syncing/syncer.go 57.77% 16 Missing and 3 partials ⚠️
block/internal/syncing/da_follower.go 77.27% 4 Missing and 1 partial ⚠️
block/internal/da/subscriber.go 78.57% 2 Missing and 1 partial ⚠️
block/internal/da/async_block_retriever.go 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3262      +/-   ##
==========================================
+ Coverage   60.87%   60.89%   +0.01%     
==========================================
  Files         127      127              
  Lines       13762    13815      +53     
==========================================
+ Hits         8378     8412      +34     
- Misses       4473     4487      +14     
- Partials      911      916       +5     
Flag Coverage Δ
combined 60.89% <66.66%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review May 18, 2026 10:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 14: Fix the grammatical error in the changelog entry that reads "Add
automatic DA retriever walkback when P2P stalls and DA blocks too far ahead" by
inserting the missing verb—e.g., change it to "Add automatic DA retriever
walkback when P2P stalls and DA blocks are too far ahead" (or another equivalent
phrasing) so the sentence is grammatically correct; update the entry text
exactly where that line appears.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52376b7e-4f8a-4af0-b287-a0669edfe473

📥 Commits

Reviewing files that changed from the base of the PR and between 8904387 and 5e908de.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • block/internal/da/async_block_retriever.go
  • block/internal/da/subscriber.go
  • block/internal/da/subscriber_test.go
  • block/internal/syncing/da_follower.go
  • block/internal/syncing/da_follower_test.go
  • block/internal/syncing/syncer.go
  • block/internal/syncing/syncer_test.go

Comment thread CHANGELOG.md

### Changes

- Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead [#3262](https://github.com/evstack/ev-node/pull/3262)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix grammatical issue in the changelog entry.

The phrase "DA blocks too far ahead" is missing a verb.

📝 Proposed fix
-- Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead [`#3262`](https://github.com/evstack/ev-node/pull/3262)
+- Add automatic DA retriever walkback when P2P stalls and DA blocks are too far ahead [`#3262`](https://github.com/evstack/ev-node/pull/3262)

Alternative phrasing:

-- Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead [`#3262`](https://github.com/evstack/ev-node/pull/3262)
+- Add automatic DA retriever walkback when P2P stalls and DA blocks get too far ahead [`#3262`](https://github.com/evstack/ev-node/pull/3262)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead [#3262](https://github.com/evstack/ev-node/pull/3262)
- Add automatic DA retriever walkback when P2P stalls and DA blocks are too far ahead [`#3262`](https://github.com/evstack/ev-node/pull/3262)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` at line 14, Fix the grammatical error in the changelog entry
that reads "Add automatic DA retriever walkback when P2P stalls and DA blocks
too far ahead" by inserting the missing verb—e.g., change it to "Add automatic
DA retriever walkback when P2P stalls and DA blocks are too far ahead" (or
another equivalent phrasing) so the sentence is grammatically correct; update
the entry text exactly where that line appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant