Skip to content

fix(storer): unstick radius decrement on live networks#5463

Open
martinconic wants to merge 1 commit into
masterfrom
fix/5396-radius-decrement
Open

fix(storer): unstick radius decrement on live networks#5463
martinconic wants to merge 1 commit into
masterfrom
fix/5396-radius-decrement

Conversation

@martinconic
Copy link
Copy Markdown
Contributor

@martinconic martinconic commented May 13, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Nodes whose local storageRadius drifts above the neighborhood consensus stay stuck there indefinitely. Salud (pkg/salud/salud.go:243) compares CommittedDepth = capacityDoubling + Radius against the network mode with zero tolerance, flips the node unhealthy on any mismatch, and redistribution stops playing rounds. The reporter of #5428 observed this on bee 2.7.1-rc2 after an RPC instability event; the reporter of #5396 reproduced the same condition with code-level analysis.

The reserve worker's radius-decrement gate db.syncer.SyncRate() == 0 is structurally unreachable on a live mainnet. SyncRate() measures historical sync only, but historical sync never goes quiet for the 15-minute rate window because (1) peer churn keeps starting new cursor-based historical workers and (2) every successful decrement calls puller.onChange → resetIntervals(prevRadius) which discards non-neighbor sync history and forces re-historical-sync, bumping SyncRate above zero for the next ~15-30 min. The gate locks closed permanently. Nodes whose local Radius drifts above the neighborhood consensus stay there indefinitely — salud (zero-tolerance comparison of CommittedDepth = capacityDoubling + Radius) flips the node unhealthy and redistribution stops playing rounds. Restart doesn't help because waitNetworkRFunc re-loads the wrong local value.

Capacity-doubled nodes are structurally worse off: they pull more data, their SyncRate floor is higher, and the +1 from doubling amplifies any Radius mismatch into CommittedDepth mismatch.

Three changes, applied together:

  1. Remove the SyncRate gate. The threshold check (count < threshold(capacity)) is already a correctness signal — if less than half of capacity sits within radius, the radius is wrong, regardless of sync activity. The unreserve path raises radius without consulting SyncRate; the asymmetry was unjustified.

  2. Allow a two-step decrement when count is far below threshold (count*4 <= threshold). Under approximately uniform CAC bin distribution, each step doubles count-within-radius, so two steps land at or below threshold rather than overshooting. Capped at two to keep behavior bounded; each radius change triggers one resetIntervals so multi-step in one tick is strictly cheaper than the equivalent single-step ticks 5 minutes apart.

  3. Reduce reserveWakeUpDuration from 15 to 5 minutes. With the gate gone, faster ticks bound worst-case recovery; tick cost is one countWithinRadius iteration.

The existing test "radius doesn't change due to non-zero pull rate encoded the buggy behavior as an invariant. Renamed and inverted to assert the new contract: with an empty reserve, the worker decrements toward minimumRadius regardless of the syncer's reported rate.

No protocol, storage layout, persistence format, or interface changes. The Syncer.SyncRate() interface stays — it's still used for "fully synced" determination at startup and for the status protocol's pullsyncRate field, both correct uses.

Fixes #5396
Fixes #5428

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

@martinconic martinconic self-assigned this May 13, 2026
@martinconic martinconic added this to the 2026 milestone May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant