fix(storer): unstick radius decrement on live networks#5463
Open
martinconic wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checklist
Description
Nodes whose local
storageRadiusdrifts above the neighborhood consensus stay stuck there indefinitely. Salud (pkg/salud/salud.go:243) comparesCommittedDepth = capacityDoubling + Radiusagainst the network mode with zero tolerance, flips the node unhealthy on any mismatch, and redistribution stops playing rounds. The reporter of #5428 observed this on bee 2.7.1-rc2 after an RPC instability event; the reporter of #5396 reproduced the same condition with code-level analysis.The reserve worker's radius-decrement gate
db.syncer.SyncRate() == 0is structurally unreachable on a live mainnet. SyncRate() measures historical sync only, but historical sync never goes quiet for the 15-minute rate window because (1) peer churn keeps starting new cursor-based historical workers and (2) every successful decrement calls puller.onChange → resetIntervals(prevRadius) which discards non-neighbor sync history and forces re-historical-sync, bumping SyncRate above zero for the next ~15-30 min. The gate locks closed permanently. Nodes whose local Radius drifts above the neighborhood consensus stay there indefinitely — salud (zero-tolerance comparison of CommittedDepth = capacityDoubling + Radius) flips the node unhealthy and redistribution stops playing rounds. Restart doesn't help because waitNetworkRFunc re-loads the wrong local value.Capacity-doubled nodes are structurally worse off: they pull more data, their SyncRate floor is higher, and the +1 from doubling amplifies any Radius mismatch into CommittedDepth mismatch.
Three changes, applied together:
Remove the SyncRate gate. The threshold check (count < threshold(capacity)) is already a correctness signal — if less than half of capacity sits within radius, the radius is wrong, regardless of sync activity. The unreserve path raises radius without consulting SyncRate; the asymmetry was unjustified.
Allow a two-step decrement when count is far below threshold (count*4 <= threshold). Under approximately uniform CAC bin distribution, each step doubles count-within-radius, so two steps land at or below threshold rather than overshooting. Capped at two to keep behavior bounded; each radius change triggers one resetIntervals so multi-step in one tick is strictly cheaper than the equivalent single-step ticks 5 minutes apart.
Reduce reserveWakeUpDuration from 15 to 5 minutes. With the gate gone, faster ticks bound worst-case recovery; tick cost is one countWithinRadius iteration.
The existing test "radius doesn't change due to non-zero pull rate encoded the buggy behavior as an invariant. Renamed and inverted to assert the new contract: with an empty reserve, the worker decrements toward minimumRadius regardless of the syncer's reported rate.
No protocol, storage layout, persistence format, or interface changes. The Syncer.SyncRate() interface stays — it's still used for "fully synced" determination at startup and for the status protocol's pullsyncRate field, both correct uses.
Fixes #5396
Fixes #5428
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):
AI Disclosure