CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping by jparrill · Pull Request #8262 · openshift/hypershift

jparrill · 2026-04-16T14:49:34Z

Summary

Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (`capi_test.go`)

TestPauseUnpauseCycle — 9 table-driven test cases:

MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
Replica clamping: unchanged, decremented to min, below min, above max
Non-AWS platform min=0 → effectiveMin=1 enforcement (KubeVirt)
Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (`autoscaling_test.go`)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

Scale up NodePool to max with workload
Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
Pause NodePool → verify MachineDeployment pause annotation
Delete workload → wait 3min for CAS to attempt scale-down
Verify replicas unchanged while paused
Unpause → verify stability (no catastrophic scale-down)
Clean up to neutral state

DRY improvements

getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

go vet passes on both packages
TestPauseUnpauseCycle — 9/9 cases PASS
E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added a regression test for the NodePool pause → replica drift → unpause cycle, verifying pause propagation, autoscaler annotation preservation, and replica clamping on reconcile.
- Added an end-to-end test confirming the cluster-autoscaler detects paused node groups, prevents scale-down while paused, and resumes normal scaling after unpause.
Refactor
- Consolidated autoscaling test helpers and shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

openshift-merge-bot · 2026-04-16T14:49:37Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-04-16T14:49:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added a unit test TestPauseUnpauseCycle and an e2e subtest TestAutoscalerRespectsNodePoolPause to validate NodePool pause/unpause behavior with autoscaling. The unit test exercises pausing CAPI MachineDeployment and MachineSet objects, simulates autoscaler-driven replica drift while paused, invokes the reconcile that should unpause, and asserts pause annotation removal, preservation of autoscaler min/max annotations (with effective min clamped to 1 on non‑AWS when autoscalingMin==0), and that Spec.Replicas is clamped to [effectiveMin, autoscalingMax]. The e2e test configures aggressive autoscaler timing, scales a NodePool up via a workload, pauses the NodePool (setting Spec.PausedUntil and checking the CAPI paused annotation), verifies Cluster Autoscaler detects the paused node group and does not scale down while paused, then unpauses and verifies cluster convergence. Several test helpers and timing constants were added and duplicate listing logic was consolidated.

Sequence Diagram(s)

sequenceDiagram
  participant Test as Test Harness
  participant NP as NodePool
  participant CAPI as CAPI (MachineDeployment / MachineSet)
  participant Work as Workload
  participant CA as Cluster Autoscaler
  participant Cluster as Cluster Nodes

  Test->>NP: enable autoscaling & set pausedUntil / pause
  Test->>Work: create workload to increase load
  Work->>Cluster: create pods -> triggers scale up
  Cluster->>CAPI: increase Spec.Replicas (autoscaler effect)
  Test->>CAPI: verify PausedAnnotation present
  Test->>CA: ensure CA ready (aggressive timers)
  Test->>Work: delete workload
  CA->>CA: evaluate nodegroups
  CA->>CAPI: detect paused node group (logs "discovered a paused node group")
  Note right of CAPI: while paused, Spec.Replicas remains unchanged
  Test->>NP: remove pausedUntil / unpause
  Test->>CAPI: trigger reconcile -> unpause (remove annotation)
  CAPI->>CAPI: clamp Spec.Replicas to [effectiveMin, autoscalingMax]
  CAPI->>Cluster: adjust node count
  Cluster->>Test: converge to expected numNodes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (3 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	New tests lack descriptive failure messages on critical assertions and some Eventually-style waits omit explicit timeouts, hindering failure diagnosis and risking test hangs.	Add message parameters to all critical assertions and explicit timeouts to Eventually/Consistently calls; refactor repeated setup/teardown into BeforeEach/AfterEach.
Single Node Openshift (Sno) Test Compatibility	⚠️ Warning	The testAutoscalerRespectsNodePoolPause test makes explicit multi-node assumptions by scaling from 1 to 3 nodes and distributing workloads across multiple nodes, with no SNO guard mechanisms.	Add [Skipped:SingleReplicaTopology] label to parent test or exutil.IsSingleNode() check with g.Skip() to skip on SNO platforms.
Stable And Deterministic Test Names	❓ Inconclusive	Test files referenced in the PR summary are not accessible in the repository context, preventing verification of deterministic test naming.	Provide access to hypershift-operator/controllers/nodepool/capi_test.go and test/e2e/autoscaling_test.go to verify test names use only static strings without dynamic values.

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding regression tests for Cluster Autoscaler pause/unpause replica clamping, directly aligned with the PR's core objective.
Microshift Test Compatibility	✅ Passed	The new e2e test uses standard Go testing patterns (t.Run() and Gomega) rather than Ginkgo patterns (It(), Describe(), Context(), When()), so the Ginkgo-specific check does not apply.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds test-only regression tests for CAS pause/unpause replica clamping without introducing topology-specific constraints or deployment manifests.
Ote Binary Stdout Contract	✅ Passed	The PR adds no process-level stdout writes violating the OTE Binary Stdout Contract. All test output uses appropriate mechanisms: t.Logf() for test functions and controller-runtime logging defaults to stderr.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	New tests use standard Go testing framework with no IPv4 assumptions, external connectivity requirements, hardcoded addresses, or external registry URLs.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-16T14:50:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 551-592: The test incorrectly requires the MachineDeployment
replica count to equal replicasBeforePause and waits for `max` ready nodes
immediately after unpause; remove the strict equality check comparing
`replicasAfterUnpause` to `replicasBeforePause` and drop the immediate
`e2eutil.WaitForNReadyNodes(..., max, ...)` assertion. Instead, after confirming
the MachineDeployment is unpaused (the existing `EventuallyObject` block), log
the current replica count if desired and proceed directly to wait for
convergence back to `numNodes` using `e2eutil.WaitForNReadyNodes(t, ctx,
guestClient, numNodes, hostedCluster.Spec.Platform.Type)` (the existing Step 9),
ensuring the test only asserts eventual scale-down to `numNodes` and not an
immediate preservation of `max`. Reference symbols: `replicasAfterUnpause`,
`replicasBeforePause`, `max`, `numNodes`, and `e2eutil.WaitForNReadyNodes`.
- Around line 438-493: This subtest never enables autoscaling on the NodePool
returned by getFirstNodePool, so make the test explicitly enable autoscaling and
clear fixed replicas before waiting for scale-up: fetch the NodePool (nodepool),
set nodepool.Spec.AutoScaling to the desired hyperv1.NodePoolAutoScaling config
and set nodepool.Spec.Replicas = nil (or remove fixed replicas), then persist
the change with mgtClient.Update inside a retry.RetryOnConflict block (use
crclient.ObjectKeyFromObject(nodepool) to refetch) so the NodePool is
independently configured for autoscaling even if other subtests are not run or
reordered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 32012c66-572c-455c-ac17-d0c2ff29d53a

📥 Commits

Reviewing files that changed from the base of the PR and between 846f2e9 and 1da08b9.

📒 Files selected for processing (2)

hypershift-operator/controllers/nodepool/capi_test.go
test/e2e/autoscaling_test.go

codecov · 2026-04-16T15:04:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 39.85%. Comparing base (96a2bcb) to head (ec5bc72).
⚠️ Report is 141 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8262      +/-   ##
==========================================
- Coverage   39.85%   39.85%   -0.01%     
==========================================
  Files         751      751              
  Lines       92554    92556       +2     
==========================================
- Hits        36889    36888       -1     
- Misses      52992    52994       +2     
- Partials     2673     2674       +1

see 1 file with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.03% <ø> (ø)`
cpo-hostedcontrolplane	`40.52% <ø> (-0.02%)`	⬇️
cpo-other	`40.08% <ø> (ø)`
hypershift-operator	`50.38% <ø> (ø)`
other	`30.70% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci-robot · 2026-04-16T15:33:57Z

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040

Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause

Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

MachineDeployment (Replace) and MachineSet (InPlace) upgrade types

Replica clamping: unchanged, decremented to min, below min, above max

Non-AWS platform min=0 → effectiveMin=1 enforcement (KubeVirt)

Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

Scale up NodePool to max with workload

Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)

Pause NodePool → verify MachineDeployment pause annotation

Delete workload → wait 3min for CAS to attempt scale-down

Verify replicas unchanged while paused

Unpause → verify stability (no catastrophic scale-down)

Clean up to neutral state

DRY improvements

getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions

waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions

Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration

Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

go vet passes on both packages

TestPauseUnpauseCycle — 9/9 cases PASS

E2E compiles cleanly (go vet -tags e2e ./test/e2e/)

E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean

CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests

Added test coverage for NodePool pause→unpause cycle, validating autoscaler annotations and replica clamping behavior during pause state transitions.

Added end-to-end test verifying cluster-autoscaler respects NodePool pause, preventing unwanted scale-down operations while paused.

Refactored test helpers for improved code reusability in autoscaling test suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-16T15:49:57Z

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040

Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause

Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

MachineDeployment (Replace) and MachineSet (InPlace) upgrade types

Replica clamping: unchanged, decremented to min, below min, above max

Non-AWS platform min=0 → effectiveMin=1 enforcement (KubeVirt)

Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

Scale up NodePool to max with workload

Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)

Pause NodePool → verify MachineDeployment pause annotation

Delete workload → wait 3min for CAS to attempt scale-down

Verify replicas unchanged while paused

Unpause → verify stability (no catastrophic scale-down)

Clean up to neutral state

DRY improvements

getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions

waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions

Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration

Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

go vet passes on both packages

TestPauseUnpauseCycle — 9/9 cases PASS

E2E compiles cleanly (go vet -tags e2e ./test/e2e/)

E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean

CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests

Added regression test covering a full NodePool pause → simulated drift → unpause cycle, validating pause propagation, preservation of autoscaler annotations, and replica clamping on reconcile.

Added end-to-end test ensuring the cluster-autoscaler does not scale down while a NodePool is paused and that normal scaling resumes after unpause.

Refactored autoscaling test helpers and added shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-16T15:59:04Z

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040

Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause

Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

MachineDeployment (Replace) and MachineSet (InPlace) upgrade types

Replica clamping: unchanged, decremented to min, below min, above max

Non-AWS platform min=0 → effectiveMin=1 enforcement (KubeVirt)

Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

Scale up NodePool to max with workload

Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)

Pause NodePool → verify MachineDeployment pause annotation

Delete workload → wait 3min for CAS to attempt scale-down

Verify replicas unchanged while paused

Unpause → verify stability (no catastrophic scale-down)

Clean up to neutral state

DRY improvements

getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions

waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions

Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration

Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

go vet passes on both packages

TestPauseUnpauseCycle — 9/9 cases PASS

E2E compiles cleanly (go vet -tags e2e ./test/e2e/)

E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean

CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests

Added a regression test covering a full NodePool pause → simulated drift → unpause cycle, validating pause propagation, preservation of autoscaler annotations, and replica clamping during reconcile.

Added an end-to-end test ensuring the cluster-autoscaler does not scale down while a NodePool is paused and that normal scaling resumes after unpause.

Refactored autoscaling test helpers and introduced shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jparrill · 2026-04-17T06:58:19Z

/test e2e-aws

devguyio

Solid regression test coverage overall. The 9 table-driven unit cases systematically hit every branch of setMachineDeploymentReplicas and both MachineDeployment/MachineSet code paths. The DRY refactors are clean.

⚠️ Consider merging testAutoscaling and testAutoscalerRespectsNodePoolPause

As-is, adding testAutoscalerRespectsNodePoolPause as a separate subtest has significant cost:

~10 extra minutes of CI wall time — a full scale-up/down cycle on real AWS infrastructure, duplicating what testAutoscaling already does
~10 extra minutes of EC2 spend per run — provisioning and draining max nodes a second time
Doubles the flake surface — two independent scale-up/down cycles means two chances for node provisioning timeouts, API throttling, or slow kubelet registration to cause spurious failures
Multiplied across every PR — this runs on every PR that triggers the autoscaling e2e job

The two subtests follow very similar flows:

testAutoscaling (existing):

Enable autoscaling (min=numNodes, max=numNodes+2)
Create workload → scale up to max
Delete workload → verify scale-down to numNodes

testAutoscalerRespectsNodePoolPause (new):

Enable autoscaling (same min/max)
Configure aggressive CAS timers
Create workload → scale up to max
Pause → delete workload → verify NO scale-down
Unpause → verify scale-down to numNodes

Steps 1-3 and the final scale-down are essentially duplicated. A merged flow could be:

Enable autoscaling + configure aggressive CAS timers
Create workload → scale up to max (proves scale-up works)
Pause NodePool → delete workload → verify replicas unchanged (proves pause blocks CAS)
Unpause → verify scale-down to numNodes (proves scale-down works after unpause)

One scale-up/down cycle instead of two. Same coverage, tells a stronger story: "CAS scales up, respects pause, and resumes scale-down cleanly" as one coherent lifecycle.

The tradeoff is fault isolation — if the pause logic breaks, a merged test would also mask whether basic scale-up/down still works. That said, the unit tests already cover the controller logic thoroughly, so the e2e is really validating the CAS integration end-to-end where the merged flow makes more sense.

Review generated with Claude Code

cwbotbot · 2026-04-17T09:15:04Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-04-21T06:40:37Z
View Job
View Job History

openshift-ci-robot · 2026-04-17T10:51:54Z

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040

Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause

Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

MachineDeployment (Replace) and MachineSet (InPlace) upgrade types

Replica clamping: unchanged, decremented to min, below min, above max

Non-AWS platform min=0 → effectiveMin=1 enforcement (KubeVirt)

Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

Scale up NodePool to max with workload

Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)

Pause NodePool → verify MachineDeployment pause annotation

Delete workload → wait 3min for CAS to attempt scale-down

Verify replicas unchanged while paused

Unpause → verify stability (no catastrophic scale-down)

Clean up to neutral state

DRY improvements

getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions

waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions

Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration

Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

go vet passes on both packages

TestPauseUnpauseCycle — 9/9 cases PASS

E2E compiles cleanly (go vet -tags e2e ./test/e2e/)

E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean

CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests

Added a regression test covering a full NodePool pause → simulated replica drift → unpause cycle, validating pause propagation, preservation of autoscaler annotations, and replica clamping on reconcile.

Added an end-to-end test ensuring the cluster-autoscaler detects paused node groups, avoids scale-down while paused, and that normal scaling resumes after unpause.

Refactor

Consolidated autoscaling test helpers and introduced shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/e2e/autoscaling_test.go (1)

420-478: Make the log-poll helper honor ctx.Done().

The raw sleeps here keep waiting even after the subtest has been canceled, which can burn the full three-minute window on failure paths.

♻️ Suggested direction

 	cfg, err := e2eutil.GetConfig()
 	if err != nil {
 		t.Logf("Failed to get management cluster config for CAS log polling, falling back to timeout: %v", err)
-		time.Sleep(timeout)
+		select {
+		case <-ctx.Done():
+		case <-time.After(timeout):
+		}
 		return false
 	}
 	kubeClient, err := kubeclient.NewForConfig(cfg)
 	if err != nil {
 		t.Logf("Failed to create kubeclient for CAS log polling, falling back to timeout: %v", err)
-		time.Sleep(timeout)
+		select {
+		case <-ctx.Done():
+		case <-time.After(timeout):
+		}
 		return false
 	}
@@
 		pods, err := kubeClient.CoreV1().Pods(controlPlaneNamespace).List(ctx, metav1.ListOptions{
 			LabelSelector: "app=cluster-autoscaler",
 		})
 		if err != nil || len(pods.Items) == 0 {
-			time.Sleep(casLogPollInterval)
+			select {
+			case <-ctx.Done():
+				return false
+			case <-time.After(casLogPollInterval):
+			}
 			continue
 		}
@@
 		stream, err := kubeClient.CoreV1().Pods(controlPlaneNamespace).GetLogs(pods.Items[0].Name, logOpts).Stream(ctx)
 		if err != nil {
-			time.Sleep(casLogPollInterval)
+			select {
+			case <-ctx.Done():
+				return false
+			case <-time.After(casLogPollInterval):
+			}
 			continue
 		}
@@
-		time.Sleep(casLogPollInterval)
+		select {
+		case <-ctx.Done():
+			return false
+		case <-time.After(casLogPollInterval):
+		}
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/autoscaling_test.go` around lines 420 - 478,
pollCASLogsForPausedNodeGroup currently uses time.Sleep calls that ignore ctx
cancellation; update the function so every sleep or blocking wait honors
ctx.Done() (e.g., replace time.Sleep(timeout) and time.Sleep(casLogPollInterval)
with a select that returns early on <-ctx.Done(), and when waiting for the pod
log stream use context-aware GetLogs(...).Stream(ctx) already accepts ctx but
also ensure the scanner/reading loop breaks if ctx is done); specifically modify
the error/fallback branches and the loop waiting on casLogPollInterval to select
between time.After(casLogPollInterval) and <-ctx.Done(), and return false (or
exit) immediately when ctx is cancelled so pollCASLogsForPausedNodeGroup stops
promptly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 576-585: After successfully pausing the NodePool in the
retry.RetryOnConflict block, add a deferred cleanup that will unpause the same
nodepool to avoid leaving the shared fixture paused; implement the defer to
fetch the latest NodePool (using mgtClient.Get or the same retry.RetryOnConflict
pattern), clear nodepool.Spec.PausedUntil (set to nil/""), and update it
(mgtClient.Update) with retries on conflict to ensure the unpause runs even if
the test fails before the explicit unpause at Line 637; reference the same
symbols nodepool, mgtClient.Get, mgtClient.Update, and retry.RetryOnConflict to
locate where to insert the defer.

---

Nitpick comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 420-478: pollCASLogsForPausedNodeGroup currently uses time.Sleep
calls that ignore ctx cancellation; update the function so every sleep or
blocking wait honors ctx.Done() (e.g., replace time.Sleep(timeout) and
time.Sleep(casLogPollInterval) with a select that returns early on <-ctx.Done(),
and when waiting for the pod log stream use context-aware
GetLogs(...).Stream(ctx) already accepts ctx but also ensure the scanner/reading
loop breaks if ctx is done); specifically modify the error/fallback branches and
the loop waiting on casLogPollInterval to select between
time.After(casLogPollInterval) and <-ctx.Done(), and return false (or exit)
immediately when ctx is cancelled so pollCASLogsForPausedNodeGroup stops
promptly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 90440ea1-460c-4d9b-90ce-ace734345bb7

📥 Commits

Reviewing files that changed from the base of the PR and between 32754b4 and 7396132.

📒 Files selected for processing (2)

hypershift-operator/controllers/nodepool/capi_test.go
test/e2e/autoscaling_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

hypershift-operator/controllers/nodepool/capi_test.go

openshift-ci-robot · 2026-04-17T12:09:02Z

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040

Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause

Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

MachineDeployment (Replace) and MachineSet (InPlace) upgrade types

Replica clamping: unchanged, decremented to min, below min, above max

Non-AWS platform min=0 → effectiveMin=1 enforcement (KubeVirt)

Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

Scale up NodePool to max with workload

Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)

Pause NodePool → verify MachineDeployment pause annotation

Delete workload → wait 3min for CAS to attempt scale-down

Verify replicas unchanged while paused

Unpause → verify stability (no catastrophic scale-down)

Clean up to neutral state

DRY improvements

getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions

waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions

Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration

Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

go vet passes on both packages

TestPauseUnpauseCycle — 9/9 cases PASS

E2E compiles cleanly (go vet -tags e2e ./test/e2e/)

E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean

CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests

Added a regression test for the NodePool pause → replica drift → unpause cycle, verifying pause propagation, autoscaler annotation preservation, and replica clamping on reconcile.

Added an end-to-end test confirming the cluster-autoscaler detects paused node groups, prevents scale-down while paused, and resumes normal scaling after unpause.

Refactor

Consolidated autoscaling test helpers and shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

♻️ Duplicate comments (1)

test/e2e/autoscaling_test.go (1)

569-578: ⚠️ Potential issue | 🟠 Major

Add a deferred unpause cleanup to prevent fixture corruption.

If any assertion fails between pause (line 577) and unpause (line 637), the shared NodePool remains paused and subsequent sibling subtests (e.g., TestAutoscalingBalancing) inherit a broken fixture. Add a deferred cleanup immediately after the pause succeeds.

🛡️ Recommended safeguard

 		err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
 			if err := mgtClient.Get(ctx, crclient.ObjectKeyFromObject(nodepool), nodepool); err != nil {
 				return err
 			}
 			nodepool.Spec.PausedUntil = ptr.To("true")
 			return mgtClient.Update(ctx, nodepool)
 		})
 		g.Expect(err).NotTo(HaveOccurred(), "failed to pause NodePool")
 		t.Logf("Paused NodePool %s", nodepool.Name)
+
+		// Ensure unpause on test failure to avoid breaking subsequent subtests.
+		defer func() {
+			cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 2*time.Minute)
+			defer cleanupCancel()
+			_ = retry.RetryOnConflict(retry.DefaultRetry, func() error {
+				if err := mgtClient.Get(cleanupCtx, crclient.ObjectKeyFromObject(nodepool), nodepool); err != nil {
+					if apierrors.IsNotFound(err) {
+						return nil
+					}
+					return err
+				}
+				if nodepool.Spec.PausedUntil == nil {
+					return nil // Already unpaused
+				}
+				nodepool.Spec.PausedUntil = nil
+				return mgtClient.Update(cleanupCtx, nodepool)
+			})
+		}()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/autoscaling_test.go` around lines 569 - 578, After successfully
pausing the NodePool, add a deferred cleanup that unpauses it to avoid leaving
the shared fixture paused if the test fails; specifically, immediately after the
block that sets nodepool.Spec.PausedUntil = ptr.To("true") and checks err, add a
defer that uses retry.RetryOnConflict (same pattern) with mgtClient.Get + set
nodepool.Spec.PausedUntil = nil (or remove the pause flag) and mgtClient.Update
to restore the NodePool, logging failures but not crashing the test harness;
reference the existing nodepool variable, mgtClient, and retry.RetryOnConflict
to implement this cleanup.

🧹 Nitpick comments (1)

hypershift-operator/controllers/nodepool/capi_test.go (1)

2541-2548: Consider using platform-appropriate templates for KubeVirt test cases.

Using capiaws.AWSMachineTemplate for KubeVirt platform test cases (lines 2379-2387 and 2419-2427) is technically inconsistent, though it works because the test focuses on replica clamping logic rather than infrastructure references.

♻️ Optional: use KubeVirt template for non-AWS platforms

+			var template client.Object
+			if tc.platformType == hyperv1.KubevirtPlatform {
+				template = &capikubevirt.KubevirtMachineTemplate{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:      "test-template",
+						Namespace: controlPlaneNamespace,
+					},
+				}
+			} else {
+				template = &capiaws.AWSMachineTemplate{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:      "test-template",
+						Namespace: controlPlaneNamespace,
+					},
+				}
+			}
-			template := &capiaws.AWSMachineTemplate{
-				ObjectMeta: metav1.ObjectMeta{
-					Name:      "test-template",
-					Namespace: controlPlaneNamespace,
-				},
-			}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/capi_test.go` around lines 2541 -
2548, The test uses capiaws.AWSMachineTemplate even for KubeVirt-platform cases;
update those KubeVirt-specific test instances to use the corresponding KubeVirt
template type (e.g., capikubevirt.KubevirtMachineTemplate) instead of
capiaws.AWSMachineTemplate so the infrastructure objects match the platform
under test—locate the declarations named template in capi_test.go and replace
the AWSMachineTemplate type with the KubeVirt machine template type, keeping the
same metadata setup and test logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 569-578: After successfully pausing the NodePool, add a deferred
cleanup that unpauses it to avoid leaving the shared fixture paused if the test
fails; specifically, immediately after the block that sets
nodepool.Spec.PausedUntil = ptr.To("true") and checks err, add a defer that uses
retry.RetryOnConflict (same pattern) with mgtClient.Get + set
nodepool.Spec.PausedUntil = nil (or remove the pause flag) and mgtClient.Update
to restore the NodePool, logging failures but not crashing the test harness;
reference the existing nodepool variable, mgtClient, and retry.RetryOnConflict
to implement this cleanup.

---

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 2541-2548: The test uses capiaws.AWSMachineTemplate even for
KubeVirt-platform cases; update those KubeVirt-specific test instances to use
the corresponding KubeVirt template type (e.g.,
capikubevirt.KubevirtMachineTemplate) instead of capiaws.AWSMachineTemplate so
the infrastructure objects match the platform under test—locate the declarations
named template in capi_test.go and replace the AWSMachineTemplate type with the
KubeVirt machine template type, keeping the same metadata setup and test logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: baa9ddf1-0440-4794-8cf5-c3a9b4754469

📥 Commits

Reviewing files that changed from the base of the PR and between 7396132 and 98e69f1.

📒 Files selected for processing (2)

hypershift-operator/controllers/nodepool/capi_test.go
test/e2e/autoscaling_test.go

jparrill · 2026-04-17T13:08:38Z

/test e2e-aws

hypershift-jira-solve-ci · 2026-04-17T15:20:22Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2045127312260730880 | Cost: $3.3262295 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

jparrill · 2026-04-30T10:16:34Z

Rebased

…amping Add unit and E2E tests for OCPBUGS-78152 / CNTRLPLANE-3040 to prevent CAS from accumulating replica decrements on paused MachineDeployments. Unit tests: 9 table-driven cases covering replica clamping for both MachineDeployment and MachineSet upgrade types, including non-AWS effectiveMin=1 enforcement. E2E test: verifies CAS respects pause annotation by polling CAS logs for paused node group detection, with deferred unpause cleanup. Extracts shared helpers and constants to reduce duplication across autoscaling tests. Refs: OCPBUGS-78152, CNTRLPLANE-3040 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

jparrill · 2026-05-12T09:07:53Z

Rebased again

devguyio · 2026-05-20T10:10:34Z

/lgtm
/hold leaving it for @enxebre to unhold

openshift-merge-bot · 2026-05-20T10:11:30Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

enxebre · 2026-05-20T12:17:32Z

+//   - When reconcileMachineDeployment/reconcileMachineSet runs after unpause, the
+//     pause annotation is removed and autoscaler annotations are preserved
+//   - setMachineDeploymentReplicas clamps replicas within [min, max] bounds regardless
+//     of external modifications (e.g. CAS decrementing replicas while paused)


isn't all this really effectively the same than adding a few test cases for TestSetMachineDeploymentReplicas and TestSetMachineSetReplicas?

Fair point — the clamping logic itself is the same. The difference is that TestPauseUnpauseCycle exercises the full reconcile flow end-to-end:

Pause annotation set on NodePool │ ▼ External replica modification (CAS decrements while paused) │ ▼ reconcileMachineDeployment / reconcileMachineSet │ ▼ Pause annotation removed │ ▼ Replicas clamped back to [min, max]

Adding cases to TestSetMachineDeploymentReplicas/TestSetMachineSetReplicas would cover the clamping in isolation, but wouldn't validate that all the pieces work together through the reconcile cycle. I think both are worth keeping — unit cases for the function, integration-style test for the flow.

enxebre · 2026-05-20T12:17:49Z

 }

+const (
+	aggressiveDelayAfterAddSeconds    int32 = 30


is this setting aggressive config for existing tests?

Right — from the point this test runs until the end of the autoscaling suite, the aggressive policy stays set on the HostedCluster. I could restore the original config after the test, but is it worth it? The e2e tests are passing with the aggressive timers and we're saving wall time on scale-down waits for the remaining subtests. What do you think?

enxebre · 2026-05-20T12:18:28Z

/hold cancel

hypershift-jira-solve-ci · 2026-05-20T12:46:02Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057041823968268288 | Cost: $3.4503719999999998 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

jparrill · 2026-05-26T10:12:29Z

/retest-required

jparrill · 2026-05-26T10:12:54Z

/verified by e2e

openshift-ci-robot · 2026-05-26T10:13:06Z

@jparrill: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hypershift-jira-solve-ci · 2026-05-26T12:23:22Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2059216110523256832 | Cost: $4.240028000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-26T12:28:45Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2059216110842023936 | Cost: $2.6145214999999995 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-05-26T12:44:35Z

I now have all the evidence needed. The picture is clear. Here is the final report:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2059216110842023936
Target: e2e-aws
PR: CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping #8262 — "CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping"
Test Results: 569 total, 534 passed, 10 failed, 25 skipped
PR's new test (TestAutoscalerRespectsNodePoolPause): ✅ PASSED (680.15s)

Test Failure Analysis

Error

DONE 569 tests, 25 skipped, 10 failures in 0.030s

4 distinct root failures (6 are parent-test propagation):
1. TestKarpenterUpgradeControlPlane/Main — NodeClaim failed to register after drift (5m timeout)
2. TestNTOMachineConfigAppliedInPlace — DaemonSet pod count mismatch (expected 2, got 1)
3. TestCreateClusterPrivateWithRouteKAS/Teardown — AWS resource cleanup timeout (3 resources)
4. TestKarpenter/Teardown — AWS resource cleanup timeout (10 resources)

Summary

None of the 10 test failures are related to PR #8262. The PR adds unit tests in capi_test.go and a new E2E test TestAutoscalerRespectsNodePoolPause in the TestAutoscaling suite — all of which passed successfully. The failures are in four unrelated tests: a Karpenter node registration flake during control plane upgrade drift replacement, a DaemonSet scheduling issue caused by management cluster resource pressure (310 FailedScheduling events), and two AWS infrastructure teardown timeouts where EBS volumes and NLBs were not deleted within the cleanup deadline. These are all known pre-existing flaky test patterns in the HyperShift e2e-aws suite. A re-run should pass.

Root Cause

All 10 failures decompose into 4 independent root causes, none related to the PR:

1. TestKarpenterUpgradeControlPlane/Main — After the control plane upgrade completed and Karpenter detected drift on NodeClaim on-demand-jp8qh, a replacement NodeClaim on-demand-jhwh8 was launched (EC2 instance provisioned) but never registered with the cluster (Registered=Unknown: NodeNotFound). The test observed 2 NodeClaims (old drifted + new unregistered) instead of the expected 1. This is a known race condition where EC2 instance launch succeeds but kubelet bootstrap or node registration times out within the 5-minute window.

2. TestNodePool/HostedCluster0/Main/TestNTOMachineConfigAppliedInPlace — The NTO MachineConfig in-place rollout test created a 2-replica NodePool, applied a Tuned configuration (config update completed in 4m20s), but only 1 of 2 verification DaemonSet pods reached Ready state after 15 minutes. The management cluster was under significant scheduling pressure with 310 FailedScheduling events due to anti-affinity rules and too-many-pods conditions across the 6 management nodes running many parallel hosted clusters. The parent test TestNodePool/HostedCluster0 also reported a benign condition-validation race (ClusterVersionSucceeding=True arrived before the test framework's 2s validation window expected it to still be progressing).

3. TestCreateClusterPrivateWithRouteKAS/Teardown — The hosted cluster private-x4lpn was destroyed successfully, but 3 AWS resources (2 EC2 node volumes, 1 NLB for router-default) were not deleted within the teardown timeout of ~22 minutes. This is an AWS API eventual-consistency issue during rapid teardown.

4. TestKarpenter/Teardown — Same class of issue as #3 but with 10 remaining AWS resources spanning multiple Karpenter NodePools (version-test, kubelet-config-test, instance-profile-test, on-demand, capacity-reservation-test, arbitrary-subnet-test). Teardown took ~25 minutes. The bastion destruction also failed because the HostedCluster CR was already deleted.

Recommendations

Re-run the job — All failures are transient infrastructure/timing issues that should not recur consistently. The PR's changes are clean.
No code changes needed in this PR — The new TestAutoscalerRespectsNodePoolPause test passed, and all unit tests in capi_test.go compiled and ran successfully.
Consider merging — The 10 failures are all pre-existing flaky tests unrelated to the CAS pause/unpause changes. If re-run passes, this PR is safe to merge.

Evidence

Evidence	Detail
PR's new test result	`TestAutoscaling/Main/TestAutoscalerRespectsNodePoolPause` — PASSED (680.15s)
PR's parent test result	`TestAutoscaling` — PASSED (4689.04s), including all subtests and teardown
Failure 1	`TestKarpenterUpgradeControlPlane/Main` — NodeClaim `on-demand-jhwh8` launched but `Registered=Unknown: NodeNotFound`, 2 NodeClaims observed instead of 1
Failure 2	`TestNTOMachineConfigAppliedInPlace` — DaemonSet `kube-system/node-pool-v4ptt-test-ntomachineconfig-inplace` expected 2 pods, got 1 after 15m timeout
Failure 3	`TestCreateClusterPrivateWithRouteKAS/Teardown` — 3 AWS resources not cleaned up (2 EBS volumes + 1 NLB) after 1362s
Failure 4	`TestKarpenter/Teardown` — 10 AWS resources not cleaned up after 1535s
Management cluster pressure	310 FailedScheduling events observed (anti-affinity + too-many-pods across 6 nodes)
Overall test stats	569 tests: 534 passed, 10 failed, 25 skipped
Failed step	`hypershift-aws-run-e2e-nested` (test phase) — exit code 1
Pre/post phases	Pre phase passed (16m46s), post phase passed (27m31s)
Prow artifacts	`.work/prow-job-analyze-test-failure/2059216110842023936/logs/`

openshift-merge-bot · 2026-05-26T12:57:58Z

/retest-required

Remaining retests: 0 against base HEAD fa30196 and 2 for PR HEAD ec5bc72 in total

openshift-ci · 2026-05-26T15:54:35Z

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/needs-area label Apr 16, 2026

openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 16, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2026

openshift-ci Bot requested review from bryan-cox and devguyio April 16, 2026 14:50

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread test/e2e/autoscaling_test.go

Comment thread test/e2e/autoscaling_test.go Outdated

jparrill changed the title ~~test(nodepool): add regression tests for CAS pause/unpause replica clamping~~ CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping Apr 16, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2026

jparrill force-pushed the CNTRLPLANE-3040 branch from 1da08b9 to dca8bd9 Compare April 16, 2026 15:47

jparrill force-pushed the CNTRLPLANE-3040 branch from dca8bd9 to 32754b4 Compare April 16, 2026 15:56

devguyio reviewed Apr 17, 2026

View reviewed changes

Comment thread test/e2e/autoscaling_test.go Outdated

Comment thread test/e2e/autoscaling_test.go Outdated

jparrill force-pushed the CNTRLPLANE-3040 branch from 32754b4 to 7396132 Compare April 17, 2026 10:49

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread test/e2e/autoscaling_test.go Outdated

jparrill force-pushed the CNTRLPLANE-3040 branch from 7396132 to 98e69f1 Compare April 17, 2026 12:07

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Apr 28, 2026

bryan-cox reviewed Apr 29, 2026

View reviewed changes

jparrill force-pushed the CNTRLPLANE-3040 branch 2 times, most recently from ba544cb to 09ad081 Compare April 30, 2026 10:16

jparrill force-pushed the CNTRLPLANE-3040 branch from 09ad081 to 58ae2a9 Compare May 12, 2026 07:50

jparrill force-pushed the CNTRLPLANE-3040 branch from 58ae2a9 to ec5bc72 Compare May 12, 2026 08:30

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2026

openshift-ci Bot assigned devguyio May 20, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 20, 2026

enxebre reviewed May 20, 2026

View reviewed changes

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 26, 2026

openshift-merge-bot Bot merged commit 014ce23 into openshift:main May 26, 2026
42 checks passed

hypershift-jira-solve-ci Bot mentioned this pull request May 27, 2026

OCPBUGS-86238: add CPO overrides for ARO swift-nic resource limits #8610

Merged

2 tasks

Conversation

jparrill commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Unit tests (capi_test.go)

E2E test (autoscaling_test.go)

DRY improvements

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Pre-merge checks failed

❌ Failed checks (3 warnings, 1 inconclusive)

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Unit tests (capi_test.go)

E2E test (autoscaling_test.go)

DRY improvements

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Unit tests (capi_test.go)

E2E test (autoscaling_test.go)

DRY improvements

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Unit tests (capi_test.go)

E2E test (autoscaling_test.go)

DRY improvements

Test plan

Summary by CodeRabbit

Uh oh!

jparrill commented Apr 17, 2026

Uh oh!

devguyio left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cwbotbot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

Uh oh!

openshift-ci-robot commented Apr 17, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Unit tests (capi_test.go)

jparrill commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Unit tests (`capi_test.go`)

E2E test (`autoscaling_test.go`)

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

Unit tests (`capi_test.go`)

E2E test (`autoscaling_test.go`)

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

Unit tests (`capi_test.go`)

E2E test (`autoscaling_test.go`)

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

Unit tests (`capi_test.go`)

E2E test (`autoscaling_test.go`)

devguyio left a comment •

edited

Loading

cwbotbot commented Apr 17, 2026 •

edited

Loading

openshift-ci-robot commented Apr 17, 2026 •

edited by openshift-ci Bot

Loading

Unit tests (`capi_test.go`)

E2E test (`autoscaling_test.go`)

openshift-ci-robot commented Apr 17, 2026 •

edited by openshift-ci Bot

Loading

Unit tests (`capi_test.go`)

E2E test (`autoscaling_test.go`)

jparrill May 20, 2026 •

edited

Loading