feat: add scale-down cooldown period by FocalChord · Pull Request #284 · atlassian/escalator

FocalChord · 2026-03-06T17:54:23Z

What

This PR adds a new optional config field scale_down_cool_down_period that prevents successive scale-down (taint) actions for a configurable duration after each taint event. It mirrors the existing scale_up_cool_down_period which already rate-limits scale-up actions.

When set, the cooldown engages after nodes are tainted and blocks further tainting until the duration expires. Unlike the scale-up lock, it only gates scale-down. Scale-up, starvation checks, and max-node-age rotation continue to operate normally while the scale-down cooldown is held.

Defaults to no cooldown when the field is omitted, empty, or set to "0s". Existing deployments that do not set this field behave identically to today.

Why

Escalator currently has no rate-limiting on scale-down. After tainting nodes, the very next scan (30 seconds later) runs the full decision logic against a denominator that just shrank, because tainted nodes are excluded from the capacity calculation while their pods are still counted in the numerator. On workloads where pods do not drain immediately after tainting, this creates a feedback loop: taint nodes, utilisation rises artificially, taint more nodes, repeat.

We observed this in production on a cluster running ~92 bare-metal nodes with slow_node_removal_rate: 3:

Utilisation dropped to 47.8% during a demand transition
Escalator began tainting 3 nodes every 30-second cycle
Over 11 minutes, 24 nodes were tainted
The shrinking denominator caused a 26-point utilisation spike in a single scan cycle (48% to 74.5%) with no change in actual demand
Escalator reversed and untainted 22 of the 24 nodes, but 12 EC2 instances had already been terminated and needed to be replaced
The same pattern repeated roughly an hour later on the same cluster

The scale-up path already accounts for this kind of runaway behavior through scale_up_cool_down_period and the scaleLock mechanism. The scale-down path has no equivalent. This PR reuses the existing scaleLock struct to provide one.

Testing

Four test cases covering: cooldown blocks further tainting while locked, cooldown does not block scale-up, empty string and "0s" both preserve existing behavior, and lock expires after the configured duration.

Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.

…taint actions

FocalChord changed the title ~~Add configurable scale-down cooldown period to rate-limit successive …~~ (feat) Add configurable scale-down cooldown period to rate-limit successive … Mar 6, 2026

FocalChord changed the title ~~(feat) Add configurable scale-down cooldown period to rate-limit successive …~~ feat: add scale-down cooldown period Mar 6, 2026

Add configurable scale-down cooldown period to rate-limit successive …

e69552b

…taint actions

FocalChord force-pushed the feat/scale-down-cooldown branch from 77b8d2b to e69552b Compare March 10, 2026 23:34

atlassian deleted a comment from atlassian-cla-bot Bot Mar 10, 2026

FocalChord requested a review from awprice March 13, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add scale-down cooldown period#284

feat: add scale-down cooldown period#284
FocalChord wants to merge 1 commit into
atlassian:masterfrom
FocalChord:feat/scale-down-cooldown

FocalChord commented Mar 6, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FocalChord commented Mar 6, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FocalChord commented Mar 6, 2026 •

edited by atlassian Bot

Loading