Skip to content

feat: add scale-down cooldown period#284

Open
FocalChord wants to merge 1 commit into
atlassian:masterfrom
FocalChord:feat/scale-down-cooldown
Open

feat: add scale-down cooldown period#284
FocalChord wants to merge 1 commit into
atlassian:masterfrom
FocalChord:feat/scale-down-cooldown

Conversation

@FocalChord

@FocalChord FocalChord commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

What

This PR adds a new optional config field scale_down_cool_down_period that prevents successive scale-down (taint) actions for a configurable duration after each taint event. It mirrors the existing scale_up_cool_down_period which already rate-limits scale-up actions.

When set, the cooldown engages after nodes are tainted and blocks further tainting until the duration expires. Unlike the scale-up lock, it only gates scale-down. Scale-up, starvation checks, and max-node-age rotation continue to operate normally while the scale-down cooldown is held.

Defaults to no cooldown when the field is omitted, empty, or set to "0s". Existing deployments that do not set this field behave identically to today.

Why

Escalator currently has no rate-limiting on scale-down. After tainting nodes, the very next scan (30 seconds later) runs the full decision logic against a denominator that just shrank, because tainted nodes are excluded from the capacity calculation while their pods are still counted in the numerator. On workloads where pods do not drain immediately after tainting, this creates a feedback loop: taint nodes, utilisation rises artificially, taint more nodes, repeat.

We observed this in production on a cluster running ~92 bare-metal nodes with slow_node_removal_rate: 3:

  • Utilisation dropped to 47.8% during a demand transition
  • Escalator began tainting 3 nodes every 30-second cycle
  • Over 11 minutes, 24 nodes were tainted
  • The shrinking denominator caused a 26-point utilisation spike in a single scan cycle (48% to 74.5%) with no change in actual demand
  • Escalator reversed and untainted 22 of the 24 nodes, but 12 EC2 instances had already been terminated and needed to be replaced
  • The same pattern repeated roughly an hour later on the same cluster

The scale-up path already accounts for this kind of runaway behavior through scale_up_cool_down_period and the scaleLock mechanism. The scale-down path has no equivalent. This PR reuses the existing scaleLock struct to provide one.

Testing

Four test cases covering: cooldown blocks further tainting while locked, cooldown does not block scale-up, empty string and "0s" both preserve existing behavior, and lock expires after the configured duration.


Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.

@FocalChord FocalChord changed the title Add configurable scale-down cooldown period to rate-limit successive … (feat) Add configurable scale-down cooldown period to rate-limit successive … Mar 6, 2026
@FocalChord FocalChord changed the title (feat) Add configurable scale-down cooldown period to rate-limit successive … feat: add scale-down cooldown period Mar 6, 2026
@FocalChord FocalChord force-pushed the feat/scale-down-cooldown branch from 77b8d2b to e69552b Compare March 10, 2026 23:34
@atlassian atlassian deleted a comment from atlassian-cla-bot Bot Mar 10, 2026
@FocalChord FocalChord requested a review from awprice March 13, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant