Skip to content

CNTRLPLANE-3318: Support targeted node scale-down for NodePools#8369

Open
jparrill wants to merge 5 commits into
openshift:mainfrom
jparrill:CNTRLPLANE-3318
Open

CNTRLPLANE-3318: Support targeted node scale-down for NodePools#8369
jparrill wants to merge 5 commits into
openshift:mainfrom
jparrill:CNTRLPLANE-3318

Conversation

@jparrill

@jparrill jparrill commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add hypershift.openshift.io/scale-down annotation constant to the API (NodeScaleDownAnnotation)
  • Implement HCCO Node Controller logic to sync the annotation to CAPI's cluster.x-k8s.io/delete-machine on the corresponding Machine, giving it top priority for deletion during scale-down
  • Removing the annotation from the Node causes the controller to remove delete-machine from the Machine (change-of-mind support)
  • Add E2E coverage in the existing autoscaling test: annotate the oldest node, verify HCCO syncs, verify autoscaler deletes the annotated node
  • Add user-facing documentation under how-to/automated-machine-management/targeted-scale-down.md

Jira

CNTRLPLANE-3318

Test plan

  • Unit tests for reconcileDeleteMachineAnnotation (10 table-driven cases + patch error injection)
  • Unit tests for getMachineForNode (4 cases)
  • E2E: annotate oldest node → verify HCCO sets delete-machine on Machine → autoscaler scale-down → verify annotated node was deleted
  • go build -tags e2e ./test/e2e/... passes
  • go vet -tags e2e ./test/e2e/... passes
  • Integration test script validated on real cluster (4 tests: set annotation, remove annotation, strict value matching, full happy path)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an annotation users can apply to individual nodes to mark them for prioritized deletion during NodePool scale-down; the controller propagates that priority to the corresponding management Machine so scale-down prefers annotated nodes.
  • Documentation

    • Added a how-to guide with usage examples, constraints (value must be exactly "true"), and operational notes on targeted scale-down.
  • Tests

    • Expanded unit and e2e tests to cover setting/removing the annotation, error paths, and validating targeted scale-down behavior.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 29, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 29, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3318 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add hypershift.openshift.io/scale-down annotation constant to the API (NodeScaleDownAnnotation)
  • Implement HCCO Node Controller logic to sync the annotation to CAPI's cluster.x-k8s.io/delete-machine on the corresponding Machine, giving it top priority for deletion during scale-down
  • Removing the annotation from the Node causes the controller to remove delete-machine from the Machine (change-of-mind support)
  • Add E2E coverage in the existing autoscaling test: annotate the oldest node, verify HCCO syncs, verify autoscaler deletes the annotated node
  • Add user-facing documentation under how-to/automated-machine-management/targeted-scale-down.md

Jira

CNTRLPLANE-3318

Test plan

  • Unit tests for reconcileDeleteMachineAnnotation (10 table-driven cases + patch error injection)
  • Unit tests for getMachineForNode (4 cases)
  • E2E: annotate oldest node → verify HCCO sets delete-machine on Machine → autoscaler scale-down → verify annotated node was deleted
  • go build -tags e2e ./test/e2e/... passes
  • go vet -tags e2e ./test/e2e/... passes
  • Integration test script validated on real cluster (4 tests: set annotation, remove annotation, strict value matching, full happy path)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds a new exported annotation constant (NodeScaleDownAnnotation) and implements reconciliation in the HCCO Node Controller to mirror that Node annotation to the owning management-cluster Machine as cluster.x-k8s.io/delete-machine=yes during NodePool scale-down. The reconciler retrieves the Machine up front, applies or removes the Machine delete annotation based on the Node annotation (skipping when Machine lacks nodePool info or is being deleted), and bases label syncing on the Machine. Unit tests and an e2e autoscaling test were added/updated. Documentation describing the annotation-driven targeted scale-down workflow was added.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Node as Node (hosted cluster)
    participant Controller as HCCO Node Controller
    participant Machine as Machine (management cluster)
    participant Autoscaler as Scale-Down/Autoscaler

    User->>Node: Annotate with hypershift.openshift.io/scale-down=true
    Controller->>Node: Watch & Reconcile
    Node-->>Controller: Return annotation value
    Controller->>Machine: Retrieve owning Machine
    Machine-->>Controller: Return Machine object
    alt Annotation == "true"
        Controller->>Machine: Patch cluster.x-k8s.io/delete-machine=yes
    else Annotation missing/false
        Controller->>Machine: Remove cluster.x-k8s.io/delete-machine annotation
    end
    Machine-->>Controller: Patch acknowledged
    Autoscaler->>Machine: Check for deletion-priority annotation
    Machine-->>Autoscaler: Deletion annotation present/absent
    Autoscaler->>Node: Prioritize for deletion during scale-down (if present)
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning E2E tests in autoscaling_test.go lack SNO compatibility protections and will fail on single-node deployments where scaling beyond one node is impossible. Add [Skipped:SingleReplicaTopology] labels to TestAutoscaling and TestNodePoolAutoscalingScaleFromZero, or add runtime IsSingleNode() checks to skip on SNO clusters.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main feature being added: support for targeted node scale-down in NodePools, which is the primary objective across all changeset modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Reviewed Ginkgo test definitions for dynamic or non-deterministic information in test titles.
Test Structure And Quality ✅ Passed PR contains well-structured Go unit tests with single responsibility, proper setup/cleanup, explicit timeouts, meaningful assertions, and consistency with existing codebase patterns.
Microshift Test Compatibility ✅ Passed PR modifies existing Go standard library tests, not adding new Ginkgo e2e tests, so the MicroShift-incompatible APIs check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints assuming standard HA topology; changes are confined to per-node annotation synchronization mechanism working across all supported topologies.
Ote Binary Stdout Contract ✅ Passed The PR does not introduce any violations of the OTE Binary Stdout Contract. The added TestMain function only calls capiv1.AddToScheme() and os.Exit(), neither of which produce non-JSON stdout output.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The pull request does not introduce Ginkgo e2e tests, and the modified e2e test in autoscaling_test.go contains no IPv4 assumptions or external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from sdminonne and sjenning April 29, 2026 12:02
@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 29, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go (1)

158-159: Assert the delete-machine value, not just its presence.

The happy-path cases still pass if the controller writes any value here. Since reconcileDeleteMachineAnnotation is supposed to set cluster.x-k8s.io/delete-machine=yes, the positive assertions should check for "yes" as well.

Possible test tightening
-			_, hasDeleteAnnotation := updatedMachine.Annotations[capiv1.DeleteMachineAnnotation]
+			value, hasDeleteAnnotation := updatedMachine.Annotations[capiv1.DeleteMachineAnnotation]
 			g.Expect(hasDeleteAnnotation).To(Equal(tc.expectDeleteMachineOnMachine))
+			if tc.expectDeleteMachineOnMachine {
+				g.Expect(value).To(Equal("yes"))
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go`
around lines 158 - 159, The test currently only checks presence of
capiv1.DeleteMachineAnnotation on updatedMachine but must assert the actual
value set by reconcileDeleteMachineAnnotation; update the assertion to read the
annotation value (e.g., val, ok :=
updatedMachine.Annotations[capiv1.DeleteMachineAnnotation]) and if
tc.expectDeleteMachineOnMachine is true assert ok is true and val equals "yes",
otherwise assert ok is false (or that val is not "yes"). This ensures the test
verifies the controller writes "cluster.x-k8s.io/delete-machine=yes" rather than
any value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`:
- Around line 50-58: The current reconciliation swallows transport/cache/context
errors from r.getMachineForNode; update getMachineForNode to return a sentinel
error (e.g., errMissingMachineReference) when CAPI annotations are missing, and
change the caller logic in Node reconciliation to treat only that sentinel and
apierrors.IsNotFound as expected (returning ctrl.Result{}, nil) while
propagating any other errors (including transport/context/cache errors from
r.client.Get) by returning ctrl.Result{}, err. Reference getMachineForNode,
errMissingMachineReference, and r.client.Get when making these changes.

In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`:
- Around line 21-49: The markdown code fences in targeted-scale-down.md are
using a style that trips markdownlint MD046; update all three code blocks (the
ones showing the oc --kubeconfig annotate node command, the oc -n
<nodepool-namespace> scale nodepool command, and the multi-node annotate
example) to use the repository's configured fenced-code language tag (e.g.,
change the current ```bash fences to the project's expected tag such as ```sh or
```console) so they match the repo’s code block style and stop triggering MD046.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go`:
- Around line 158-159: The test currently only checks presence of
capiv1.DeleteMachineAnnotation on updatedMachine but must assert the actual
value set by reconcileDeleteMachineAnnotation; update the assertion to read the
annotation value (e.g., val, ok :=
updatedMachine.Annotations[capiv1.DeleteMachineAnnotation]) and if
tc.expectDeleteMachineOnMachine is true assert ok is true and val equals "yes",
otherwise assert ok is false (or that val is not "yes"). This ensures the test
verifies the controller writes "cluster.x-k8s.io/delete-machine=yes" rather than
any value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1586a3e4-175d-4d38-bb7e-d047ddb23423

📥 Commits

Reviewing files that changed from the base of the PR and between 60802b1 and 91d6822.

⛔ Files ignored due to path filters (1)
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/nodepool_types.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go
  • docs/content/how-to/automated-machine-management/targeted-scale-down.md
  • docs/mkdocs.yml
  • test/e2e/autoscaling_test.go

Comment on lines +50 to +58
machine, err := r.getMachineForNode(ctx, node)
if err != nil {
var apiErr *apierrors.StatusError
if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) {
// Return error and retry only if the API interaction failed. Other errors are because the nodeToNodePoolName expected
// annotations are not in place yet, so we'll reconcile triggered by the event which sets them in the Node.
return ctrl.Result{}, err
} else {
log.Error(err, "failed to get nodePool name from Node")
return ctrl.Result{}, nil
}
log.Error(err, "failed to get Machine for Node, CAPI annotations may not be set yet")
return ctrl.Result{}, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate unexpected machine lookup failures.

This branch only surfaces StatusErrors. If r.client.Get fails with a transport/cache/context error, it gets treated as “annotations may not be set yet” and reconciliation exits cleanly, which can silently stop both label sync and delete-machine sync until another Node event arrives.

Suggested direction
 	machine, err := r.getMachineForNode(ctx, node)
 	if err != nil {
-		var apiErr *apierrors.StatusError
-		if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) {
-			return ctrl.Result{}, err
-		}
-		log.Error(err, "failed to get Machine for Node, CAPI annotations may not be set yet")
-		return ctrl.Result{}, nil
+		switch {
+		case apierrors.IsNotFound(err), errors.Is(err, errMissingMachineReference):
+			log.Info("Machine reference for Node is not available yet", "node", node.Name)
+			return ctrl.Result{}, nil
+		default:
+			return ctrl.Result{}, err
+		}
 	}

getMachineForNode would need to return a sentinel like errMissingMachineReference for the missing-annotation cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`
around lines 50 - 58, The current reconciliation swallows
transport/cache/context errors from r.getMachineForNode; update
getMachineForNode to return a sentinel error (e.g., errMissingMachineReference)
when CAPI annotations are missing, and change the caller logic in Node
reconciliation to treat only that sentinel and apierrors.IsNotFound as expected
(returning ctrl.Result{}, nil) while propagating any other errors (including
transport/context/cache errors from r.client.Get) by returning ctrl.Result{},
err. Reference getMachineForNode, errMissingMachineReference, and r.client.Get
when making these changes.

Comment on lines +21 to +49
```bash
oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down=true
```

Then scale down the NodePool:

```bash
oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1>
```

The annotated node will be selected for deletion before any non-annotated nodes.

### Changing your mind

If you annotated a node but decide you no longer want it removed, simply remove the annotation **before** the scale-down completes:

```bash
oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down-
```

The controller will remove the `delete-machine` annotation from the corresponding Machine, returning it to normal deletion priority.

### Marking multiple nodes

You can annotate multiple nodes simultaneously. When a scale-down occurs, CAPI will prioritize all annotated machines for deletion. If the scale-down removes fewer machines than are annotated, CAPI selects among the annotated machines using its default ordering (newest first).

```bash
oc --kubeconfig <hosted-cluster-kubeconfig> annotate node node-1 node-2 hypershift.openshift.io/scale-down=true
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Match the configured Markdown code block style.

markdownlint is already flagging these examples with MD046, so this page will keep docs lint noisy until the blocks are converted to the repo’s expected style.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 21-21: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


[warning] 27-27: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


[warning] 37-37: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


[warning] 47-47: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`
around lines 21 - 49, The markdown code fences in targeted-scale-down.md are
using a style that trips markdownlint MD046; update all three code blocks (the
ones showing the oc --kubeconfig annotate node command, the oc -n
<nodepool-namespace> scale nodepool command, and the multi-node annotate
example) to use the repository's configured fenced-code language tag (e.g.,
change the current ```bash fences to the project's expected tag such as ```sh or
```console) so they match the repo’s code block style and stop triggering MD046.

@jparrill jparrill changed the title CNTRLPLANE-3318: Support targeted node deletion on NodePool scale-down CNTRLPLANE-3318: Support targeted node scale-down for NodePools Apr 29, 2026
@codecov

codecov Bot commented Apr 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 63.15789% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.47%. Comparing base (60802b1) to head (39dc245).
⚠️ Report is 510 commits behind head on main.

Files with missing lines Patch % Lines
...stedclusterconfigoperator/controllers/node/node.go 63.15% 13 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8369   +/-   ##
=======================================
  Coverage   36.46%   36.47%           
=======================================
  Files         765      765           
  Lines       93256    93274   +18     
=======================================
+ Hits        34010    34024   +14     
- Misses      56532    56535    +3     
- Partials     2714     2715    +1     
Files with missing lines Coverage Δ
...stedclusterconfigoperator/controllers/node/node.go 44.64% <63.15%> (+6.34%) ⬆️
Flag Coverage Δ
cmd-support 30.34% <ø> (ø)
cpo-hostedcontrolplane 37.05% <ø> (ø)
cpo-other 35.75% <63.15%> (+0.06%) ⬆️
hypershift-operator 47.89% <ø> (ø)
other 28.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add hypershift.openshift.io/scale-down annotation constant to the
NodePool API types. When set to "true" on a Node in the hosted cluster,
the HCCO Node Controller will set cluster.x-k8s.io/delete-machine=yes
on the corresponding Machine, giving it top priority for deletion
during NodePool scale-down.

Ref: CNTRLPLANE-3318

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Add documentation explaining how to use the
hypershift.openshift.io/scale-down annotation for targeted node
deletion during NodePool scale-down operations.

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
docs/content/how-to/automated-machine-management/targeted-scale-down.md (1)

21-49: ⚠️ Potential issue | 🟡 Minor

Code block style still violates MD046 in this doc.

Line 21, Line 27, Line 37, and Line 47 are still fenced blocks and continue to trigger markdownlint (Expected: indented; Actual: fenced).

Proposed lint-compliant edit
-```bash
-oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down=true
-```
+    oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down=true
@@
-```bash
-oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1>
-```
+    oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1>
@@
-```bash
-oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down-
-```
+    oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down-
@@
-```bash
-oc --kubeconfig <hosted-cluster-kubeconfig> annotate node node-1 node-2 hypershift.openshift.io/scale-down=true
-```
+    oc --kubeconfig <hosted-cluster-kubeconfig> annotate node node-1 node-2 hypershift.openshift.io/scale-down=true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`
around lines 21 - 49, Replace the fenced triple-backtick bash blocks with
indented code blocks to satisfy MD046: convert the four fenced snippets that
contain the oc commands (the annotate node <node-name>
hypershift.openshift.io/scale-down=true command, the oc -n <nodepool-namespace>
scale nodepool <nodepool-name> --replicas=<current - 1> command, the annotate
node <node-name> hypershift.openshift.io/scale-down- command, and the annotate
node node-1 node-2 hypershift.openshift.io/scale-down=true command) to use
four-space indentation rather than ``` fences so markdownlint no longer flags
MD046.
control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go (1)

50-58: ⚠️ Potential issue | 🟠 Major

Do not swallow non-NotFound machine lookup failures.

Line 52-57 still drops many unexpected errors (e.g., cache/client/context failures) by logging and returning success. This can silently stop both label sync and delete-machine sync until another event. Treat only “missing reference”/NotFound as expected; return other errors.

Suggested fix direction
+var errMissingMachineReference = errors.New("missing machine reference annotations")
+
 func (r *reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
@@
 	machine, err := r.getMachineForNode(ctx, node)
 	if err != nil {
-		var apiErr *apierrors.StatusError
-		if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) {
-			return ctrl.Result{}, err
-		}
-		log.Error(err, "failed to get Machine for Node, CAPI annotations may not be set yet")
-		return ctrl.Result{}, nil
+		switch {
+		case apierrors.IsNotFound(err), errors.Is(err, errMissingMachineReference):
+			log.Info("Machine reference for Node is not available yet", "node", node.Name)
+			return ctrl.Result{}, nil
+		default:
+			return ctrl.Result{}, err
+		}
 	}
@@
 func (r *reconciler) getMachineForNode(ctx context.Context, node *corev1.Node) (*capiv1.Machine, error) {
 	machineName, ok := node.GetAnnotations()[capiv1.MachineAnnotation]
 	if !ok || machineName == "" {
-		return nil, fmt.Errorf("failed to find MachineAnnotation on Node %q", node.Name)
+		return nil, fmt.Errorf("%w: missing %s on Node %q", errMissingMachineReference, capiv1.MachineAnnotation, node.Name)
 	}
@@
 	machineNamespace, ok := node.GetAnnotations()[capiv1.ClusterNamespaceAnnotation]
 	if !ok || machineNamespace == "" {
-		return nil, fmt.Errorf("failed to find ClusterNamespaceAnnotation on Node %q", node.Name)
+		return nil, fmt.Errorf("%w: missing %s on Node %q", errMissingMachineReference, capiv1.ClusterNamespaceAnnotation, node.Name)
 	}

Also applies to: 150-159

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`
around lines 50 - 58, The current error handling around r.getMachineForNode in
node.go swallows non-NotFound failures by logging and returning success; change
it so only apierrors.IsNotFound (missing reference) is treated as expected and
returns ctrl.Result{}, nil, while any other error (including
context/cache/client failures) is returned up the stack (return ctrl.Result{},
err). Update the block that inspects errors returned from r.getMachineForNode
(and the analogous block around lines 150-159) to use errors.As to detect
*apierrors.StatusError and then call apierrors.IsNotFound to decide; for all
non-NotFound errors, do not log-and-continue—return the error. Ensure log
messages remain informative when returning errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 203-208: The test currently verifies scale-down by inspecting only
the slice returned from e2eutil.WaitForNReadyNodes, which misses nodes that are
NotReady; change the check to inspect all cluster nodes instead: replace the
call to e2eutil.WaitForNReadyNodes/remainingNodes with a listing of all nodes
(e.g., via guestClient.CoreV1().Nodes().List or an existing helper like
e2eutil.WaitForNNodes) and then assert that none of those nodes have Name ==
targetNodeName; update the loop that references remainingNodes and the t.Logf
accordingly so the assertion covers every Node object, not just Ready ones (keep
references to guestClient, targetNodeName, and the assertion using g.Expect).

---

Duplicate comments:
In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`:
- Around line 50-58: The current error handling around r.getMachineForNode in
node.go swallows non-NotFound failures by logging and returning success; change
it so only apierrors.IsNotFound (missing reference) is treated as expected and
returns ctrl.Result{}, nil, while any other error (including
context/cache/client failures) is returned up the stack (return ctrl.Result{},
err). Update the block that inspects errors returned from r.getMachineForNode
(and the analogous block around lines 150-159) to use errors.As to detect
*apierrors.StatusError and then call apierrors.IsNotFound to decide; for all
non-NotFound errors, do not log-and-continue—return the error. Ensure log
messages remain informative when returning errors.

In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`:
- Around line 21-49: Replace the fenced triple-backtick bash blocks with
indented code blocks to satisfy MD046: convert the four fenced snippets that
contain the oc commands (the annotate node <node-name>
hypershift.openshift.io/scale-down=true command, the oc -n <nodepool-namespace>
scale nodepool <nodepool-name> --replicas=<current - 1> command, the annotate
node <node-name> hypershift.openshift.io/scale-down- command, and the annotate
node node-1 node-2 hypershift.openshift.io/scale-down=true command) to use
four-space indentation rather than ``` fences so markdownlint no longer flags
MD046.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2351a309-7791-437b-8ee4-64b336d361b6

📥 Commits

Reviewing files that changed from the base of the PR and between d13c955 and 3c909a0.

⛔ Files ignored due to path filters (2)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (6)
  • api/hypershift/v1beta1/nodepool_types.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go
  • docs/content/how-to/automated-machine-management/targeted-scale-down.md
  • docs/mkdocs.yml
  • test/e2e/autoscaling_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/mkdocs.yml
  • api/hypershift/v1beta1/nodepool_types.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go

Comment on lines +203 to +208
remainingNodes := e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
for _, n := range remainingNodes {
g.Expect(n.Name).NotTo(Equal(targetNodeName),
"annotated node %s should have been deleted during scale-down", targetNodeName)
}
t.Logf("Targeted scale-down verified: annotated node %s was deleted", targetNodeName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert deletion against all Nodes, not only Ready Nodes.

At Line 203-208, the check only scans the ready-node slice. The test can pass if the targeted node still exists but is NotReady while another node was deleted.

Proposed hardening
-		remainingNodes := e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
-		for _, n := range remainingNodes {
-			g.Expect(n.Name).NotTo(Equal(targetNodeName),
-				"annotated node %s should have been deleted during scale-down", targetNodeName)
-		}
+		_ = e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
+		e2eutil.EventuallyObjects(t, ctx,
+			fmt.Sprintf("node %s to be removed from cluster", targetNodeName),
+			func(ctx context.Context) ([]*corev1.Node, error) {
+				nodeList := &corev1.NodeList{}
+				if err := guestClient.List(ctx, nodeList); err != nil {
+					return nil, err
+				}
+				items := make([]*corev1.Node, len(nodeList.Items))
+				for i := range nodeList.Items {
+					items[i] = &nodeList.Items[i]
+				}
+				return items, nil
+			},
+			[]e2eutil.Predicate[[]*corev1.Node]{
+				func(nodes []*corev1.Node) (bool, string, error) {
+					for _, n := range nodes {
+						if n.Name == targetNodeName {
+							return false, "target node still exists", nil
+						}
+					}
+					return true, "target node removed", nil
+				},
+			},
+			nil,
+			e2eutil.WithTimeout(5*time.Minute),
+		)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/autoscaling_test.go` around lines 203 - 208, The test currently
verifies scale-down by inspecting only the slice returned from
e2eutil.WaitForNReadyNodes, which misses nodes that are NotReady; change the
check to inspect all cluster nodes instead: replace the call to
e2eutil.WaitForNReadyNodes/remainingNodes with a listing of all nodes (e.g., via
guestClient.CoreV1().Nodes().List or an existing helper like
e2eutil.WaitForNNodes) and then assert that none of those nodes have Name ==
targetNodeName; update the loop that references remainingNodes and the t.Logf
accordingly so the assertion covers every Node object, not just Ready ones (keep
references to guestClient, targetNodeName, and the assertion using g.Expect).

@JoelSpeed JoelSpeed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

For API change

@openshift-ci

openshift-ci Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2026
Extend testAutoscaling to verify CNTRLPLANE-3318: after scaling to 3
nodes, annotate the oldest node with hypershift.openshift.io/scale-down
and assert HCCO propagates cluster.x-k8s.io/delete-machine to the
corresponding Machine. After autoscaler scales back to 1, verify the
annotated node was the one deleted.

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@openshift-ci-robot

openshift-ci-robot commented Apr 29, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3318 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add hypershift.openshift.io/scale-down annotation constant to the API (NodeScaleDownAnnotation)
  • Implement HCCO Node Controller logic to sync the annotation to CAPI's cluster.x-k8s.io/delete-machine on the corresponding Machine, giving it top priority for deletion during scale-down
  • Removing the annotation from the Node causes the controller to remove delete-machine from the Machine (change-of-mind support)
  • Add E2E coverage in the existing autoscaling test: annotate the oldest node, verify HCCO syncs, verify autoscaler deletes the annotated node
  • Add user-facing documentation under how-to/automated-machine-management/targeted-scale-down.md

Jira

CNTRLPLANE-3318

Test plan

  • Unit tests for reconcileDeleteMachineAnnotation (10 table-driven cases + patch error injection)
  • Unit tests for getMachineForNode (4 cases)
  • E2E: annotate oldest node → verify HCCO sets delete-machine on Machine → autoscaler scale-down → verify annotated node was deleted
  • go build -tags e2e ./test/e2e/... passes
  • go vet -tags e2e ./test/e2e/... passes
  • Integration test script validated on real cluster (4 tests: set annotation, remove annotation, strict value matching, full happy path)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added an annotation users can apply to individual nodes to mark them for prioritized deletion during NodePool scale-down; the controller propagates that priority to the corresponding management Machine so scale-down prefers annotated nodes.

  • Documentation

  • Added a how-to guide with usage examples, constraints (value must be exactly "true"), and operational notes on targeted scale-down.

  • Tests

  • Expanded unit and e2e tests to cover setting/removing the annotation, error paths, and validating targeted scale-down behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/autoscaling_test.go (1)

203-208: ⚠️ Potential issue | 🟡 Minor

Still only asserting deletion against Ready nodes (can false-pass).

This was previously flagged and still applies: checking only WaitForNReadyNodes can miss the target node if it remains in the cluster as NotReady. Verify absence across all nodes instead.

Proposed hardening
-		remainingNodes := e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
-		for _, n := range remainingNodes {
-			g.Expect(n.Name).NotTo(Equal(targetNodeName),
-				"annotated node %s should have been deleted during scale-down", targetNodeName)
-		}
+		_ = e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
+		e2eutil.EventuallyObjects(t, ctx,
+			fmt.Sprintf("node %s to be removed from cluster", targetNodeName),
+			func(ctx context.Context) ([]*corev1.Node, error) {
+				nodeList := &corev1.NodeList{}
+				if err := guestClient.List(ctx, nodeList); err != nil {
+					return nil, err
+				}
+				items := make([]*corev1.Node, len(nodeList.Items))
+				for i := range nodeList.Items {
+					items[i] = &nodeList.Items[i]
+				}
+				return items, nil
+			},
+			[]e2eutil.Predicate[[]*corev1.Node]{
+				func(nodes []*corev1.Node) (bool, string, error) {
+					for _, n := range nodes {
+						if n.Name == targetNodeName {
+							return false, "target node still exists", nil
+						}
+					}
+					return true, "target node removed", nil
+				},
+			},
+			nil,
+			e2eutil.WithTimeout(5*time.Minute),
+		)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/autoscaling_test.go` around lines 203 - 208, The test currently only
verifies deletion against the slice returned by e2eutil.WaitForNReadyNodes
(ready nodes) which can false-pass if the annotated node becomes NotReady;
modify the check to list all cluster nodes via the Kubernetes client
(guestClient.CoreV1().Nodes().List or equivalent helper) and assert that no
returned Node.Name equals targetNodeName, replacing or augmenting the loop over
remainingNodes from WaitForNReadyNodes; keep the existing log message but
perform the absence assertion against the full node list to ensure the annotated
node is truly gone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 203-208: The test currently only verifies deletion against the
slice returned by e2eutil.WaitForNReadyNodes (ready nodes) which can false-pass
if the annotated node becomes NotReady; modify the check to list all cluster
nodes via the Kubernetes client (guestClient.CoreV1().Nodes().List or equivalent
helper) and assert that no returned Node.Name equals targetNodeName, replacing
or augmenting the loop over remainingNodes from WaitForNReadyNodes; keep the
existing log message but perform the absence assertion against the full node
list to ensure the annotated node is truly gone.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 76ae52b2-c998-42f1-b779-845e9526e159

📥 Commits

Reviewing files that changed from the base of the PR and between 3c909a0 and 39dc245.

📒 Files selected for processing (1)
  • test/e2e/autoscaling_test.go

@enxebre

enxebre commented Apr 30, 2026

Copy link
Copy Markdown
Member

/hold

The current proposal in the PR just moves that annotation to Nodes. It doesn't enable any practical enhancement feature wise. This is a known approach and has limitations, e.g. what happens if you have multiple nodes annotated.

Different consumers are going to have different expectations. We can not just add annotations for every upcoming use case.
We need to consider a solution that scales and enables multiple business cases. Current specific narrow use case is unblocked by the existing workarounds (machine annotation / make node unhealthy).

We might want to consider having something CAPI upstream more flexible and scalable like
a scale down strategy and consider external-scale-down-priority on each Machine, enabling holistic ordering based on different criteria...

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2026
@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jparrill

Copy link
Copy Markdown
Contributor Author

Hey @enxebre it's there any outcome from the upstream discussion regarding this topic/PR?

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: tide (merge automation)
  • Build ID: N/A (tide merge gate, not a CI test run)
  • PR: #8369CNTRLPLANE-3318: Support targeted node scale-down for NodePools
  • State: ERROR — "Not mergeable. PR has a merge conflict."
  • Branch: CNTRLPLANE-3318 (last rebased 2026-04-28, 43 days stale)

Test Failure Analysis

Error

tide: Not mergeable. PR has a merge conflict.
mergeStateStatus: DIRTY
mergeable: CONFLICTING

Summary

This is not a test failure — no Prow CI test jobs ran or failed. The PR is blocked by tide (the merge automation bot) because the branch has fallen 43 days behind main and now has git merge conflicts in at least 4 files. Seven commits merged to main since the PR's base SHA (60802b1b) have modified the same files this PR touches, making a clean merge impossible. The most impactful conflicts are: the CAPI 1.11 upgrade rewrote large sections of autoscaling_test.go (73 additions/17 deletions) where this PR also adds a targeted scale-down test; the CAPI 1.11 import path change in node.go; the osImageStream feature addition to nodepool_types.go; and multiple regenerations of aggregated-docs.md.

Root Cause

The PR branch CNTRLPLANE-3318 was last rebased on 2026-04-28 (base SHA 60802b1b18c8). Since then, 7 commits have landed on main that modify the same files this PR touches, creating irreconcilable merge conflicts:

  1. api/hypershift/v1beta1/nodepool_types.go (and its vendored copy): PR CNTRLPLANE-3022: Add osImageStream to NodePool spec and status #8675 (CNTRLPLANE-3022, merged 2026-06-09) added 37 lines introducing OSImageStream fields to NodePoolSpec and NodePoolStatus. This PR adds NodeScaleDownAnnotation constants to the same file. While they touch different sections, the structural changes to the type definitions likely cause a conflict.

  2. test/e2e/autoscaling_test.go: This is the most severe conflict. Two commits modified this file substantially:

  3. control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go: The CAPI 1.11 upgrade changed the import from capiv1 "sigs.k8s.io/cluster-api/api/v1beta1" to capiv1 "sigs.k8s.io/cluster-api/api/core/v1beta1". This PR modifies the same file to add scale-down annotation reconciliation logic, and uses the old import path.

  4. docs/content/reference/aggregated-docs.md: Auto-generated documentation file modified by at least 3 commits on main (v2 docs, CNTRLPLANE-3022, Azure topology Phase 1). This file conflicts on any concurrent API change.

Recommendations
  1. Rebase the PR onto current main — This is the only fix. Run:

    git fetch upstream
    git rebase upstream/main

    Resolve conflicts in the 4 files listed above. Pay special attention to:

    • Update CAPI import paths to sigs.k8s.io/cluster-api/api/core/v1beta1 in node.go
    • Reconcile autoscaling_test.go with the new CAPI 1.11 native capacity logic
    • Regenerate aggregated-docs.md after resolving API type conflicts
    • Ensure nodepool_types.go constants coexist with the new OSImageStream fields
  2. Verify unit tests pass locally after rebase — The CAPI 1.11 import path change will require updating the PR's test code and controller code to use the new module path.

  3. Consider rebasing more frequently — The branch was 43 days stale. For a repo with this level of activity (~7 conflicting merges in 6 weeks), rebasing weekly would avoid large conflict resolution sessions.

  4. After rebasing, push to trigger Prow CI jobs — Currently no e2e jobs (e2e-aws, e2e-kubevirt-aws-ovn-reduced, etc.) have run because tide blocks all CI when the PR is unmergeable.

Evidence
Evidence Detail
GitHub merge state mergeable: false, mergeable_state: dirty, mergeStateStatus: DIRTY
tide status state: error — "Not mergeable. PR has a merge conflict."
PR base SHA 60802b1b18c8 (merged 2026-04-28 — 43 days ago)
PR head SHA 39dc245ed9ec (last commit 2026-04-29)
Conflicting file: nodepool_types.go PR #8675 (CNTRLPLANE-3022, 2026-06-09) added OSImageStream fields (+37 lines to spec/status)
Conflicting file: autoscaling_test.go Commit e9bfcaa4c065 (2026-06-02) rewrote testScaleFromZero with CAPI 1.11 native capacity (+73/-17 lines)
Conflicting file: node.go Commit 33db39700007 (2026-06-02) changed CAPI import path: api/v1beta1api/core/v1beta1
Conflicting file: aggregated-docs.md 3 commits modified auto-generated docs (2026-06-04, 2026-06-09, 2026-06-10)
CI jobs not running All Prow e2e jobs show "Waiting for pipeline condition" — blocked by merge conflict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/testing Indicates the PR includes changes for e2e testing do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants