feat: add scale-down cooldown period#284
Open
FocalChord wants to merge 1 commit into
Open
Conversation
77b8d2b to
e69552b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR adds a new optional config field
scale_down_cool_down_periodthat prevents successive scale-down (taint) actions for a configurable duration after each taint event. It mirrors the existingscale_up_cool_down_periodwhich already rate-limits scale-up actions.When set, the cooldown engages after nodes are tainted and blocks further tainting until the duration expires. Unlike the scale-up lock, it only gates scale-down. Scale-up, starvation checks, and max-node-age rotation continue to operate normally while the scale-down cooldown is held.
Defaults to no cooldown when the field is omitted, empty, or set to "0s". Existing deployments that do not set this field behave identically to today.
Why
Escalator currently has no rate-limiting on scale-down. After tainting nodes, the very next scan (30 seconds later) runs the full decision logic against a denominator that just shrank, because tainted nodes are excluded from the capacity calculation while their pods are still counted in the numerator. On workloads where pods do not drain immediately after tainting, this creates a feedback loop: taint nodes, utilisation rises artificially, taint more nodes, repeat.
We observed this in production on a cluster running ~92 bare-metal nodes with
slow_node_removal_rate: 3:The scale-up path already accounts for this kind of runaway behavior through
scale_up_cool_down_periodand thescaleLockmechanism. The scale-down path has no equivalent. This PR reuses the existingscaleLockstruct to provide one.Testing
Four test cases covering: cooldown blocks further tainting while locked, cooldown does not block scale-up, empty string and "0s" both preserve existing behavior, and lock expires after the configured duration.
Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.