Skip to content

CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping#8262

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jparrill:CNTRLPLANE-3040
May 26, 2026
Merged

CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping#8262
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jparrill:CNTRLPLANE-3040

Conversation

@jparrill

@jparrill jparrill commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
  • Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
  • Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

  • MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
  • Replica clamping: unchanged, decremented to min, below min, above max
  • Non-AWS platform min=0effectiveMin=1 enforcement (KubeVirt)
  • Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

  1. Scale up NodePool to max with workload
  2. Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
  3. Pause NodePool → verify MachineDeployment pause annotation
  4. Delete workload → wait 3min for CAS to attempt scale-down
  5. Verify replicas unchanged while paused
  6. Unpause → verify stability (no catastrophic scale-down)
  7. Clean up to neutral state

DRY improvements

  • getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
  • waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
  • Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
  • Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

  • go vet passes on both packages
  • TestPauseUnpauseCycle — 9/9 cases PASS
  • E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
  • E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
  • CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added a regression test for the NodePool pause → replica drift → unpause cycle, verifying pause propagation, autoscaler annotation preservation, and replica clamping on reconcile.
    • Added an end-to-end test confirming the cluster-autoscaler detects paused node groups, prevents scale-down while paused, and resumes normal scaling after unpause.
  • Refactor
    • Consolidated autoscaling test helpers and shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added a unit test TestPauseUnpauseCycle and an e2e subtest TestAutoscalerRespectsNodePoolPause to validate NodePool pause/unpause behavior with autoscaling. The unit test exercises pausing CAPI MachineDeployment and MachineSet objects, simulates autoscaler-driven replica drift while paused, invokes the reconcile that should unpause, and asserts pause annotation removal, preservation of autoscaler min/max annotations (with effective min clamped to 1 on non‑AWS when autoscalingMin==0), and that Spec.Replicas is clamped to [effectiveMin, autoscalingMax]. The e2e test configures aggressive autoscaler timing, scales a NodePool up via a workload, pauses the NodePool (setting Spec.PausedUntil and checking the CAPI paused annotation), verifies Cluster Autoscaler detects the paused node group and does not scale down while paused, then unpauses and verifies cluster convergence. Several test helpers and timing constants were added and duplicate listing logic was consolidated.

Sequence Diagram(s)

sequenceDiagram
  participant Test as Test Harness
  participant NP as NodePool
  participant CAPI as CAPI (MachineDeployment / MachineSet)
  participant Work as Workload
  participant CA as Cluster Autoscaler
  participant Cluster as Cluster Nodes

  Test->>NP: enable autoscaling & set pausedUntil / pause
  Test->>Work: create workload to increase load
  Work->>Cluster: create pods -> triggers scale up
  Cluster->>CAPI: increase Spec.Replicas (autoscaler effect)
  Test->>CAPI: verify PausedAnnotation present
  Test->>CA: ensure CA ready (aggressive timers)
  Test->>Work: delete workload
  CA->>CA: evaluate nodegroups
  CA->>CAPI: detect paused node group (logs "discovered a paused node group")
  Note right of CAPI: while paused, Spec.Replicas remains unchanged
  Test->>NP: remove pausedUntil / unpause
  Test->>CAPI: trigger reconcile -> unpause (remove annotation)
  CAPI->>CAPI: clamp Spec.Replicas to [effectiveMin, autoscalingMax]
  CAPI->>Cluster: adjust node count
  Cluster->>Test: converge to expected numNodes
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (3 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning New tests lack descriptive failure messages on critical assertions and some Eventually-style waits omit explicit timeouts, hindering failure diagnosis and risking test hangs. Add message parameters to all critical assertions and explicit timeouts to Eventually/Consistently calls; refactor repeated setup/teardown into BeforeEach/AfterEach.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The testAutoscalerRespectsNodePoolPause test makes explicit multi-node assumptions by scaling from 1 to 3 nodes and distributing workloads across multiple nodes, with no SNO guard mechanisms. Add [Skipped:SingleReplicaTopology] label to parent test or exutil.IsSingleNode() check with g.Skip() to skip on SNO platforms.
Stable And Deterministic Test Names ❓ Inconclusive Test files referenced in the PR summary are not accessible in the repository context, preventing verification of deterministic test naming. Provide access to hypershift-operator/controllers/nodepool/capi_test.go and test/e2e/autoscaling_test.go to verify test names use only static strings without dynamic values.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding regression tests for Cluster Autoscaler pause/unpause replica clamping, directly aligned with the PR's core objective.
Microshift Test Compatibility ✅ Passed The new e2e test uses standard Go testing patterns (t.Run() and Gomega) rather than Ginkgo patterns (It(), Describe(), Context(), When()), so the Ginkgo-specific check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds test-only regression tests for CAS pause/unpause replica clamping without introducing topology-specific constraints or deployment manifests.
Ote Binary Stdout Contract ✅ Passed The PR adds no process-level stdout writes violating the OTE Binary Stdout Contract. All test output uses appropriate mechanisms: t.Logf() for test functions and controller-runtime logging defaults to stderr.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed New tests use standard Go testing framework with no IPv4 assumptions, external connectivity requirements, hardcoded addresses, or external registry URLs.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 16, 2026
@openshift-ci

openshift-ci Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2026
@openshift-ci openshift-ci Bot requested review from bryan-cox and devguyio April 16, 2026 14:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 551-592: The test incorrectly requires the MachineDeployment
replica count to equal replicasBeforePause and waits for `max` ready nodes
immediately after unpause; remove the strict equality check comparing
`replicasAfterUnpause` to `replicasBeforePause` and drop the immediate
`e2eutil.WaitForNReadyNodes(..., max, ...)` assertion. Instead, after confirming
the MachineDeployment is unpaused (the existing `EventuallyObject` block), log
the current replica count if desired and proceed directly to wait for
convergence back to `numNodes` using `e2eutil.WaitForNReadyNodes(t, ctx,
guestClient, numNodes, hostedCluster.Spec.Platform.Type)` (the existing Step 9),
ensuring the test only asserts eventual scale-down to `numNodes` and not an
immediate preservation of `max`. Reference symbols: `replicasAfterUnpause`,
`replicasBeforePause`, `max`, `numNodes`, and `e2eutil.WaitForNReadyNodes`.
- Around line 438-493: This subtest never enables autoscaling on the NodePool
returned by getFirstNodePool, so make the test explicitly enable autoscaling and
clear fixed replicas before waiting for scale-up: fetch the NodePool (nodepool),
set nodepool.Spec.AutoScaling to the desired hyperv1.NodePoolAutoScaling config
and set nodepool.Spec.Replicas = nil (or remove fixed replicas), then persist
the change with mgtClient.Update inside a retry.RetryOnConflict block (use
crclient.ObjectKeyFromObject(nodepool) to refetch) so the NodePool is
independently configured for autoscaling even if other subtests are not run or
reordered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 32012c66-572c-455c-ac17-d0c2ff29d53a

📥 Commits

Reviewing files that changed from the base of the PR and between 846f2e9 and 1da08b9.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/capi_test.go
  • test/e2e/autoscaling_test.go

Comment thread test/e2e/autoscaling_test.go
Comment thread test/e2e/autoscaling_test.go Outdated
@codecov

codecov Bot commented Apr 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 39.85%. Comparing base (96a2bcb) to head (ec5bc72).
⚠️ Report is 141 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8262      +/-   ##
==========================================
- Coverage   39.85%   39.85%   -0.01%     
==========================================
  Files         751      751              
  Lines       92554    92556       +2     
==========================================
- Hits        36889    36888       -1     
- Misses      52992    52994       +2     
- Partials     2673     2674       +1     

see 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 34.03% <ø> (ø)
cpo-hostedcontrolplane 40.52% <ø> (-0.02%) ⬇️
cpo-other 40.08% <ø> (ø)
hypershift-operator 50.38% <ø> (ø)
other 30.70% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jparrill jparrill changed the title test(nodepool): add regression tests for CAS pause/unpause replica clamping CNTRLPLANE-3040: add regression tests for CAS pause/unpause replica clamping Apr 16, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 16, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
  • Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
  • Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

  • MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
  • Replica clamping: unchanged, decremented to min, below min, above max
  • Non-AWS platform min=0effectiveMin=1 enforcement (KubeVirt)
  • Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

  1. Scale up NodePool to max with workload
  2. Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
  3. Pause NodePool → verify MachineDeployment pause annotation
  4. Delete workload → wait 3min for CAS to attempt scale-down
  5. Verify replicas unchanged while paused
  6. Unpause → verify stability (no catastrophic scale-down)
  7. Clean up to neutral state

DRY improvements

  • getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
  • waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
  • Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
  • Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

  • go vet passes on both packages
  • TestPauseUnpauseCycle — 9/9 cases PASS
  • E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
  • E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
  • CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added test coverage for NodePool pause→unpause cycle, validating autoscaler annotations and replica clamping behavior during pause state transitions.
  • Added end-to-end test verifying cluster-autoscaler respects NodePool pause, preventing unwanted scale-down operations while paused.
  • Refactored test helpers for improved code reusability in autoscaling test suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

openshift-ci-robot commented Apr 16, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
  • Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
  • Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

  • MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
  • Replica clamping: unchanged, decremented to min, below min, above max
  • Non-AWS platform min=0effectiveMin=1 enforcement (KubeVirt)
  • Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

  1. Scale up NodePool to max with workload
  2. Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
  3. Pause NodePool → verify MachineDeployment pause annotation
  4. Delete workload → wait 3min for CAS to attempt scale-down
  5. Verify replicas unchanged while paused
  6. Unpause → verify stability (no catastrophic scale-down)
  7. Clean up to neutral state

DRY improvements

  • getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
  • waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
  • Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
  • Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

  • go vet passes on both packages
  • TestPauseUnpauseCycle — 9/9 cases PASS
  • E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
  • E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
  • CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added regression test covering a full NodePool pause → simulated drift → unpause cycle, validating pause propagation, preservation of autoscaler annotations, and replica clamping on reconcile.
  • Added end-to-end test ensuring the cluster-autoscaler does not scale down while a NodePool is paused and that normal scaling resumes after unpause.
  • Refactored autoscaling test helpers and added shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

openshift-ci-robot commented Apr 16, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
  • Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
  • Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

  • MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
  • Replica clamping: unchanged, decremented to min, below min, above max
  • Non-AWS platform min=0effectiveMin=1 enforcement (KubeVirt)
  • Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

  1. Scale up NodePool to max with workload
  2. Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
  3. Pause NodePool → verify MachineDeployment pause annotation
  4. Delete workload → wait 3min for CAS to attempt scale-down
  5. Verify replicas unchanged while paused
  6. Unpause → verify stability (no catastrophic scale-down)
  7. Clean up to neutral state

DRY improvements

  • getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
  • waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
  • Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
  • Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

  • go vet passes on both packages
  • TestPauseUnpauseCycle — 9/9 cases PASS
  • E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
  • E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
  • CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added a regression test covering a full NodePool pause → simulated drift → unpause cycle, validating pause propagation, preservation of autoscaler annotations, and replica clamping during reconcile.
  • Added an end-to-end test ensuring the cluster-autoscaler does not scale down while a NodePool is paused and that normal scaling resumes after unpause.
  • Refactored autoscaling test helpers and introduced shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jparrill

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@devguyio devguyio left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid regression test coverage overall. The 9 table-driven unit cases systematically hit every branch of setMachineDeploymentReplicas and both MachineDeployment/MachineSet code paths. The DRY refactors are clean.

⚠️ Consider merging testAutoscaling and testAutoscalerRespectsNodePoolPause

As-is, adding testAutoscalerRespectsNodePoolPause as a separate subtest has significant cost:

  • ~10 extra minutes of CI wall time — a full scale-up/down cycle on real AWS infrastructure, duplicating what testAutoscaling already does
  • ~10 extra minutes of EC2 spend per run — provisioning and draining max nodes a second time
  • Doubles the flake surface — two independent scale-up/down cycles means two chances for node provisioning timeouts, API throttling, or slow kubelet registration to cause spurious failures
  • Multiplied across every PR — this runs on every PR that triggers the autoscaling e2e job

The two subtests follow very similar flows:

testAutoscaling (existing):

  1. Enable autoscaling (min=numNodes, max=numNodes+2)
  2. Create workload → scale up to max
  3. Delete workload → verify scale-down to numNodes

testAutoscalerRespectsNodePoolPause (new):

  1. Enable autoscaling (same min/max)
  2. Configure aggressive CAS timers
  3. Create workload → scale up to max
  4. Pause → delete workload → verify NO scale-down
  5. Unpause → verify scale-down to numNodes

Steps 1-3 and the final scale-down are essentially duplicated. A merged flow could be:

  1. Enable autoscaling + configure aggressive CAS timers
  2. Create workload → scale up to max (proves scale-up works)
  3. Pause NodePool → delete workload → verify replicas unchanged (proves pause blocks CAS)
  4. Unpause → verify scale-down to numNodes (proves scale-down works after unpause)

One scale-up/down cycle instead of two. Same coverage, tells a stronger story: "CAS scales up, respects pause, and resumes scale-down cleanly" as one coherent lifecycle.

The tradeoff is fault isolation — if the pause logic breaks, a merged test would also mask whether basic scale-up/down still works. That said, the unit tests already cover the controller logic thoroughly, so the e2e is really validating the CAS integration end-to-end where the merged flow makes more sense.


Review generated with Claude Code

Comment thread test/e2e/autoscaling_test.go Outdated
Comment thread test/e2e/autoscaling_test.go Outdated
@cwbotbot

cwbotbot commented Apr 17, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

@openshift-ci-robot

openshift-ci-robot commented Apr 17, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
  • Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
  • Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

  • MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
  • Replica clamping: unchanged, decremented to min, below min, above max
  • Non-AWS platform min=0effectiveMin=1 enforcement (KubeVirt)
  • Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

  1. Scale up NodePool to max with workload
  2. Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
  3. Pause NodePool → verify MachineDeployment pause annotation
  4. Delete workload → wait 3min for CAS to attempt scale-down
  5. Verify replicas unchanged while paused
  6. Unpause → verify stability (no catastrophic scale-down)
  7. Clean up to neutral state

DRY improvements

  • getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
  • waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
  • Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
  • Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

  • go vet passes on both packages
  • TestPauseUnpauseCycle — 9/9 cases PASS
  • E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
  • E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
  • CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added a regression test covering a full NodePool pause → simulated replica drift → unpause cycle, validating pause propagation, preservation of autoscaler annotations, and replica clamping on reconcile.
  • Added an end-to-end test ensuring the cluster-autoscaler detects paused node groups, avoids scale-down while paused, and that normal scaling resumes after unpause.
  • Refactor
  • Consolidated autoscaling test helpers and introduced shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/autoscaling_test.go (1)

420-478: Make the log-poll helper honor ctx.Done().

The raw sleeps here keep waiting even after the subtest has been canceled, which can burn the full three-minute window on failure paths.

♻️ Suggested direction
 	cfg, err := e2eutil.GetConfig()
 	if err != nil {
 		t.Logf("Failed to get management cluster config for CAS log polling, falling back to timeout: %v", err)
-		time.Sleep(timeout)
+		select {
+		case <-ctx.Done():
+		case <-time.After(timeout):
+		}
 		return false
 	}
 	kubeClient, err := kubeclient.NewForConfig(cfg)
 	if err != nil {
 		t.Logf("Failed to create kubeclient for CAS log polling, falling back to timeout: %v", err)
-		time.Sleep(timeout)
+		select {
+		case <-ctx.Done():
+		case <-time.After(timeout):
+		}
 		return false
 	}
@@
 		pods, err := kubeClient.CoreV1().Pods(controlPlaneNamespace).List(ctx, metav1.ListOptions{
 			LabelSelector: "app=cluster-autoscaler",
 		})
 		if err != nil || len(pods.Items) == 0 {
-			time.Sleep(casLogPollInterval)
+			select {
+			case <-ctx.Done():
+				return false
+			case <-time.After(casLogPollInterval):
+			}
 			continue
 		}
@@
 		stream, err := kubeClient.CoreV1().Pods(controlPlaneNamespace).GetLogs(pods.Items[0].Name, logOpts).Stream(ctx)
 		if err != nil {
-			time.Sleep(casLogPollInterval)
+			select {
+			case <-ctx.Done():
+				return false
+			case <-time.After(casLogPollInterval):
+			}
 			continue
 		}
@@
-		time.Sleep(casLogPollInterval)
+		select {
+		case <-ctx.Done():
+			return false
+		case <-time.After(casLogPollInterval):
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/autoscaling_test.go` around lines 420 - 478,
pollCASLogsForPausedNodeGroup currently uses time.Sleep calls that ignore ctx
cancellation; update the function so every sleep or blocking wait honors
ctx.Done() (e.g., replace time.Sleep(timeout) and time.Sleep(casLogPollInterval)
with a select that returns early on <-ctx.Done(), and when waiting for the pod
log stream use context-aware GetLogs(...).Stream(ctx) already accepts ctx but
also ensure the scanner/reading loop breaks if ctx is done); specifically modify
the error/fallback branches and the loop waiting on casLogPollInterval to select
between time.After(casLogPollInterval) and <-ctx.Done(), and return false (or
exit) immediately when ctx is cancelled so pollCASLogsForPausedNodeGroup stops
promptly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 576-585: After successfully pausing the NodePool in the
retry.RetryOnConflict block, add a deferred cleanup that will unpause the same
nodepool to avoid leaving the shared fixture paused; implement the defer to
fetch the latest NodePool (using mgtClient.Get or the same retry.RetryOnConflict
pattern), clear nodepool.Spec.PausedUntil (set to nil/""), and update it
(mgtClient.Update) with retries on conflict to ensure the unpause runs even if
the test fails before the explicit unpause at Line 637; reference the same
symbols nodepool, mgtClient.Get, mgtClient.Update, and retry.RetryOnConflict to
locate where to insert the defer.

---

Nitpick comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 420-478: pollCASLogsForPausedNodeGroup currently uses time.Sleep
calls that ignore ctx cancellation; update the function so every sleep or
blocking wait honors ctx.Done() (e.g., replace time.Sleep(timeout) and
time.Sleep(casLogPollInterval) with a select that returns early on <-ctx.Done(),
and when waiting for the pod log stream use context-aware
GetLogs(...).Stream(ctx) already accepts ctx but also ensure the scanner/reading
loop breaks if ctx is done); specifically modify the error/fallback branches and
the loop waiting on casLogPollInterval to select between
time.After(casLogPollInterval) and <-ctx.Done(), and return false (or exit)
immediately when ctx is cancelled so pollCASLogsForPausedNodeGroup stops
promptly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 90440ea1-460c-4d9b-90ce-ace734345bb7

📥 Commits

Reviewing files that changed from the base of the PR and between 32754b4 and 7396132.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/capi_test.go
  • test/e2e/autoscaling_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • hypershift-operator/controllers/nodepool/capi_test.go

Comment thread test/e2e/autoscaling_test.go Outdated
@openshift-ci-robot

openshift-ci-robot commented Apr 17, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3040 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add unit and E2E regression tests for OCPBUGS-78152 / CNTRLPLANE-3040
  • Prevents regression of the bug where CAS accumulated replica decrements on paused MachineDeployments, causing catastrophic scale-down on unpause
  • Extracts shared test helpers (getFirstNodePool, waitForAutoscalerDeploymentReady) and constants to reduce duplication across autoscaling tests

Changes

Unit tests (capi_test.go)

TestPauseUnpauseCycle — 9 table-driven test cases:

  • MachineDeployment (Replace) and MachineSet (InPlace) upgrade types
  • Replica clamping: unchanged, decremented to min, below min, above max
  • Non-AWS platform min=0effectiveMin=1 enforcement (KubeVirt)
  • Verifies: pause annotation set/removed, replicas clamped within [min,max], autoscaler annotations preserved

E2E test (autoscaling_test.go)

testAutoscalerRespectsNodePoolPause — placed between testAutoscaling and testAutoscalingBalancing:

  1. Scale up NodePool to max with workload
  2. Configure aggressive CAS timers (unneeded=60s, delayAfterAdd=30s)
  3. Pause NodePool → verify MachineDeployment pause annotation
  4. Delete workload → wait 3min for CAS to attempt scale-down
  5. Verify replicas unchanged while paused
  6. Unpause → verify stability (no catastrophic scale-down)
  7. Clean up to neutral state

DRY improvements

  • getFirstNodePool() — replaces 5-line list+assert pattern in 3 test functions
  • waitForAutoscalerDeploymentReady() — replaces 15-line EventuallyObject pattern in 2 test functions
  • Constants: aggressiveDelayAfterAddSeconds, aggressiveUnneededDurationSeconds, casScaleDownWaitDuration
  • Uses capiv1.PausedAnnotation constant instead of string literal

Test plan

  • go vet passes on both packages
  • TestPauseUnpauseCycle — 9/9 cases PASS
  • E2E compiles cleanly (go vet -tags e2e ./test/e2e/)
  • E2E verified on live cluster: replicas stayed at 3 during 3-minute pause, unpause was clean
  • CI e2e-autoscaling job passes

Refs: OCPBUGS-78152, CNTRLPLANE-3040

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added a regression test for the NodePool pause → replica drift → unpause cycle, verifying pause propagation, autoscaler annotation preservation, and replica clamping on reconcile.
  • Added an end-to-end test confirming the cluster-autoscaler detects paused node groups, prevents scale-down while paused, and resumes normal scaling after unpause.
  • Refactor
  • Consolidated autoscaling test helpers and shared timing constants to reduce duplication and stabilize timing-sensitive assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/autoscaling_test.go (1)

569-578: ⚠️ Potential issue | 🟠 Major

Add a deferred unpause cleanup to prevent fixture corruption.

If any assertion fails between pause (line 577) and unpause (line 637), the shared NodePool remains paused and subsequent sibling subtests (e.g., TestAutoscalingBalancing) inherit a broken fixture. Add a deferred cleanup immediately after the pause succeeds.

🛡️ Recommended safeguard
 		err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
 			if err := mgtClient.Get(ctx, crclient.ObjectKeyFromObject(nodepool), nodepool); err != nil {
 				return err
 			}
 			nodepool.Spec.PausedUntil = ptr.To("true")
 			return mgtClient.Update(ctx, nodepool)
 		})
 		g.Expect(err).NotTo(HaveOccurred(), "failed to pause NodePool")
 		t.Logf("Paused NodePool %s", nodepool.Name)
+
+		// Ensure unpause on test failure to avoid breaking subsequent subtests.
+		defer func() {
+			cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 2*time.Minute)
+			defer cleanupCancel()
+			_ = retry.RetryOnConflict(retry.DefaultRetry, func() error {
+				if err := mgtClient.Get(cleanupCtx, crclient.ObjectKeyFromObject(nodepool), nodepool); err != nil {
+					if apierrors.IsNotFound(err) {
+						return nil
+					}
+					return err
+				}
+				if nodepool.Spec.PausedUntil == nil {
+					return nil // Already unpaused
+				}
+				nodepool.Spec.PausedUntil = nil
+				return mgtClient.Update(cleanupCtx, nodepool)
+			})
+		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/autoscaling_test.go` around lines 569 - 578, After successfully
pausing the NodePool, add a deferred cleanup that unpauses it to avoid leaving
the shared fixture paused if the test fails; specifically, immediately after the
block that sets nodepool.Spec.PausedUntil = ptr.To("true") and checks err, add a
defer that uses retry.RetryOnConflict (same pattern) with mgtClient.Get + set
nodepool.Spec.PausedUntil = nil (or remove the pause flag) and mgtClient.Update
to restore the NodePool, logging failures but not crashing the test harness;
reference the existing nodepool variable, mgtClient, and retry.RetryOnConflict
to implement this cleanup.
🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/capi_test.go (1)

2541-2548: Consider using platform-appropriate templates for KubeVirt test cases.

Using capiaws.AWSMachineTemplate for KubeVirt platform test cases (lines 2379-2387 and 2419-2427) is technically inconsistent, though it works because the test focuses on replica clamping logic rather than infrastructure references.

♻️ Optional: use KubeVirt template for non-AWS platforms
+			var template client.Object
+			if tc.platformType == hyperv1.KubevirtPlatform {
+				template = &capikubevirt.KubevirtMachineTemplate{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:      "test-template",
+						Namespace: controlPlaneNamespace,
+					},
+				}
+			} else {
+				template = &capiaws.AWSMachineTemplate{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:      "test-template",
+						Namespace: controlPlaneNamespace,
+					},
+				}
+			}
-			template := &capiaws.AWSMachineTemplate{
-				ObjectMeta: metav1.ObjectMeta{
-					Name:      "test-template",
-					Namespace: controlPlaneNamespace,
-				},
-			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/capi_test.go` around lines 2541 -
2548, The test uses capiaws.AWSMachineTemplate even for KubeVirt-platform cases;
update those KubeVirt-specific test instances to use the corresponding KubeVirt
template type (e.g., capikubevirt.KubevirtMachineTemplate) instead of
capiaws.AWSMachineTemplate so the infrastructure objects match the platform
under test—locate the declarations named template in capi_test.go and replace
the AWSMachineTemplate type with the KubeVirt machine template type, keeping the
same metadata setup and test logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 569-578: After successfully pausing the NodePool, add a deferred
cleanup that unpauses it to avoid leaving the shared fixture paused if the test
fails; specifically, immediately after the block that sets
nodepool.Spec.PausedUntil = ptr.To("true") and checks err, add a defer that uses
retry.RetryOnConflict (same pattern) with mgtClient.Get + set
nodepool.Spec.PausedUntil = nil (or remove the pause flag) and mgtClient.Update
to restore the NodePool, logging failures but not crashing the test harness;
reference the existing nodepool variable, mgtClient, and retry.RetryOnConflict
to implement this cleanup.

---

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 2541-2548: The test uses capiaws.AWSMachineTemplate even for
KubeVirt-platform cases; update those KubeVirt-specific test instances to use
the corresponding KubeVirt template type (e.g.,
capikubevirt.KubevirtMachineTemplate) instead of capiaws.AWSMachineTemplate so
the infrastructure objects match the platform under test—locate the declarations
named template in capi_test.go and replace the AWSMachineTemplate type with the
KubeVirt machine template type, keeping the same metadata setup and test logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: baa9ddf1-0440-4794-8cf5-c3a9b4754469

📥 Commits

Reviewing files that changed from the base of the PR and between 7396132 and 98e69f1.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/capi_test.go
  • test/e2e/autoscaling_test.go

@jparrill

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2045127312260730880 | Cost: $3.3262295 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Apr 28, 2026
Comment thread hypershift-operator/controllers/nodepool/capi_test.go
Comment thread hypershift-operator/controllers/nodepool/capi_test.go
Comment thread docs/mkdocs.yml
Comment thread test/e2e/autoscaling_test.go Outdated
Comment thread test/e2e/autoscaling_test.go Outdated
Comment thread test/e2e/autoscaling_test.go Outdated
@jparrill jparrill force-pushed the CNTRLPLANE-3040 branch 2 times, most recently from ba544cb to 09ad081 Compare April 30, 2026 10:16
@jparrill

Copy link
Copy Markdown
Contributor Author

Rebased

…amping

Add unit and E2E tests for OCPBUGS-78152 / CNTRLPLANE-3040 to prevent
CAS from accumulating replica decrements on paused MachineDeployments.

Unit tests: 9 table-driven cases covering replica clamping for both
MachineDeployment and MachineSet upgrade types, including non-AWS
effectiveMin=1 enforcement.

E2E test: verifies CAS respects pause annotation by polling CAS logs
for paused node group detection, with deferred unpause cleanup.

Extracts shared helpers and constants to reduce duplication across
autoscaling tests.

Refs: OCPBUGS-78152, CNTRLPLANE-3040

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@jparrill

Copy link
Copy Markdown
Contributor Author

Rebased again

@devguyio

Copy link
Copy Markdown
Contributor

/lgtm
/hold leaving it for @enxebre to unhold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 20, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

// - When reconcileMachineDeployment/reconcileMachineSet runs after unpause, the
// pause annotation is removed and autoscaler annotations are preserved
// - setMachineDeploymentReplicas clamps replicas within [min, max] bounds regardless
// of external modifications (e.g. CAS decrementing replicas while paused)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't all this really effectively the same than adding a few test cases for TestSetMachineDeploymentReplicas and TestSetMachineSetReplicas?

@jparrill jparrill May 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — the clamping logic itself is the same. The difference is that TestPauseUnpauseCycle exercises the full reconcile flow end-to-end:

Pause annotation set on NodePool
        │
        ▼
External replica modification (CAS decrements while paused)
        │
        ▼
reconcileMachineDeployment / reconcileMachineSet
        │
        ▼
Pause annotation removed
        │
        ▼
Replicas clamped back to [min, max]

Adding cases to TestSetMachineDeploymentReplicas/TestSetMachineSetReplicas would cover the clamping in isolation, but wouldn't validate that all the pieces work together through the reconcile cycle. I think both are worth keeping — unit cases for the function, integration-style test for the flow.

}

const (
aggressiveDelayAfterAddSeconds int32 = 30

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this setting aggressive config for existing tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — from the point this test runs until the end of the autoscaling suite, the aggressive policy stays set on the HostedCluster. I could restore the original config after the test, but is it worth it? The e2e tests are passing with the aggressive timers and we're saving wall time on scale-down waits for the remaining subtests. What do you think?

@enxebre

enxebre commented May 20, 2026

Copy link
Copy Markdown
Member

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2026
@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057041823968268288 | Cost: $3.4503719999999998 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@jparrill

Copy link
Copy Markdown
Contributor Author

/retest-required

@jparrill

Copy link
Copy Markdown
Contributor Author

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jparrill: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2059216110523256832 | Cost: $4.240028000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2059216110842023936 | Cost: $2.6145214999999995 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented May 26, 2026

Copy link
Copy Markdown

I now have all the evidence needed. The picture is clear. Here is the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

DONE 569 tests, 25 skipped, 10 failures in 0.030s

4 distinct root failures (6 are parent-test propagation):
1. TestKarpenterUpgradeControlPlane/Main — NodeClaim failed to register after drift (5m timeout)
2. TestNTOMachineConfigAppliedInPlace — DaemonSet pod count mismatch (expected 2, got 1)
3. TestCreateClusterPrivateWithRouteKAS/Teardown — AWS resource cleanup timeout (3 resources)
4. TestKarpenter/Teardown — AWS resource cleanup timeout (10 resources)

Summary

None of the 10 test failures are related to PR #8262. The PR adds unit tests in capi_test.go and a new E2E test TestAutoscalerRespectsNodePoolPause in the TestAutoscaling suite — all of which passed successfully. The failures are in four unrelated tests: a Karpenter node registration flake during control plane upgrade drift replacement, a DaemonSet scheduling issue caused by management cluster resource pressure (310 FailedScheduling events), and two AWS infrastructure teardown timeouts where EBS volumes and NLBs were not deleted within the cleanup deadline. These are all known pre-existing flaky test patterns in the HyperShift e2e-aws suite. A re-run should pass.

Root Cause

All 10 failures decompose into 4 independent root causes, none related to the PR:

1. TestKarpenterUpgradeControlPlane/Main — After the control plane upgrade completed and Karpenter detected drift on NodeClaim on-demand-jp8qh, a replacement NodeClaim on-demand-jhwh8 was launched (EC2 instance provisioned) but never registered with the cluster (Registered=Unknown: NodeNotFound). The test observed 2 NodeClaims (old drifted + new unregistered) instead of the expected 1. This is a known race condition where EC2 instance launch succeeds but kubelet bootstrap or node registration times out within the 5-minute window.

2. TestNodePool/HostedCluster0/Main/TestNTOMachineConfigAppliedInPlace — The NTO MachineConfig in-place rollout test created a 2-replica NodePool, applied a Tuned configuration (config update completed in 4m20s), but only 1 of 2 verification DaemonSet pods reached Ready state after 15 minutes. The management cluster was under significant scheduling pressure with 310 FailedScheduling events due to anti-affinity rules and too-many-pods conditions across the 6 management nodes running many parallel hosted clusters. The parent test TestNodePool/HostedCluster0 also reported a benign condition-validation race (ClusterVersionSucceeding=True arrived before the test framework's 2s validation window expected it to still be progressing).

3. TestCreateClusterPrivateWithRouteKAS/Teardown — The hosted cluster private-x4lpn was destroyed successfully, but 3 AWS resources (2 EC2 node volumes, 1 NLB for router-default) were not deleted within the teardown timeout of ~22 minutes. This is an AWS API eventual-consistency issue during rapid teardown.

4. TestKarpenter/Teardown — Same class of issue as #3 but with 10 remaining AWS resources spanning multiple Karpenter NodePools (version-test, kubelet-config-test, instance-profile-test, on-demand, capacity-reservation-test, arbitrary-subnet-test). Teardown took ~25 minutes. The bastion destruction also failed because the HostedCluster CR was already deleted.

Recommendations
  1. Re-run the job — All failures are transient infrastructure/timing issues that should not recur consistently. The PR's changes are clean.
  2. No code changes needed in this PR — The new TestAutoscalerRespectsNodePoolPause test passed, and all unit tests in capi_test.go compiled and ran successfully.
  3. Consider merging — The 10 failures are all pre-existing flaky tests unrelated to the CAS pause/unpause changes. If re-run passes, this PR is safe to merge.
Evidence
Evidence Detail
PR's new test result TestAutoscaling/Main/TestAutoscalerRespectsNodePoolPausePASSED (680.15s)
PR's parent test result TestAutoscalingPASSED (4689.04s), including all subtests and teardown
Failure 1 TestKarpenterUpgradeControlPlane/Main — NodeClaim on-demand-jhwh8 launched but Registered=Unknown: NodeNotFound, 2 NodeClaims observed instead of 1
Failure 2 TestNTOMachineConfigAppliedInPlace — DaemonSet kube-system/node-pool-v4ptt-test-ntomachineconfig-inplace expected 2 pods, got 1 after 15m timeout
Failure 3 TestCreateClusterPrivateWithRouteKAS/Teardown — 3 AWS resources not cleaned up (2 EBS volumes + 1 NLB) after 1362s
Failure 4 TestKarpenter/Teardown — 10 AWS resources not cleaned up after 1535s
Management cluster pressure 310 FailedScheduling events observed (anti-affinity + too-many-pods across 6 nodes)
Overall test stats 569 tests: 534 passed, 10 failed, 25 skipped
Failed step hypershift-aws-run-e2e-nested (test phase) — exit code 1
Pre/post phases Pre phase passed (16m46s), post phase passed (27m31s)
Prow artifacts .work/prow-job-analyze-test-failure/2059216110842023936/logs/

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD fa30196 and 2 for PR HEAD ec5bc72 in total

@openshift-ci

openshift-ci Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 014ce23 into openshift:main May 26, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants