CNTRLPLANE-3318: Support targeted node scale-down for NodePools#8369
CNTRLPLANE-3318: Support targeted node scale-down for NodePools#8369jparrill wants to merge 5 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@jparrill: This pull request references CNTRLPLANE-3318 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughAdds a new exported annotation constant ( Sequence Diagram(s)sequenceDiagram
actor User
participant Node as Node (hosted cluster)
participant Controller as HCCO Node Controller
participant Machine as Machine (management cluster)
participant Autoscaler as Scale-Down/Autoscaler
User->>Node: Annotate with hypershift.openshift.io/scale-down=true
Controller->>Node: Watch & Reconcile
Node-->>Controller: Return annotation value
Controller->>Machine: Retrieve owning Machine
Machine-->>Controller: Return Machine object
alt Annotation == "true"
Controller->>Machine: Patch cluster.x-k8s.io/delete-machine=yes
else Annotation missing/false
Controller->>Machine: Remove cluster.x-k8s.io/delete-machine annotation
end
Machine-->>Controller: Patch acknowledged
Autoscaler->>Machine: Check for deletion-priority annotation
Machine-->>Autoscaler: Deletion annotation present/absent
Autoscaler->>Node: Prioritize for deletion during scale-down (if present)
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
8ff992c to
91d6822
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go (1)
158-159: Assert the delete-machine value, not just its presence.The happy-path cases still pass if the controller writes any value here. Since
reconcileDeleteMachineAnnotationis supposed to setcluster.x-k8s.io/delete-machine=yes, the positive assertions should check for"yes"as well.Possible test tightening
- _, hasDeleteAnnotation := updatedMachine.Annotations[capiv1.DeleteMachineAnnotation] + value, hasDeleteAnnotation := updatedMachine.Annotations[capiv1.DeleteMachineAnnotation] g.Expect(hasDeleteAnnotation).To(Equal(tc.expectDeleteMachineOnMachine)) + if tc.expectDeleteMachineOnMachine { + g.Expect(value).To(Equal("yes")) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go` around lines 158 - 159, The test currently only checks presence of capiv1.DeleteMachineAnnotation on updatedMachine but must assert the actual value set by reconcileDeleteMachineAnnotation; update the assertion to read the annotation value (e.g., val, ok := updatedMachine.Annotations[capiv1.DeleteMachineAnnotation]) and if tc.expectDeleteMachineOnMachine is true assert ok is true and val equals "yes", otherwise assert ok is false (or that val is not "yes"). This ensures the test verifies the controller writes "cluster.x-k8s.io/delete-machine=yes" rather than any value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`:
- Around line 50-58: The current reconciliation swallows transport/cache/context
errors from r.getMachineForNode; update getMachineForNode to return a sentinel
error (e.g., errMissingMachineReference) when CAPI annotations are missing, and
change the caller logic in Node reconciliation to treat only that sentinel and
apierrors.IsNotFound as expected (returning ctrl.Result{}, nil) while
propagating any other errors (including transport/context/cache errors from
r.client.Get) by returning ctrl.Result{}, err. Reference getMachineForNode,
errMissingMachineReference, and r.client.Get when making these changes.
In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`:
- Around line 21-49: The markdown code fences in targeted-scale-down.md are
using a style that trips markdownlint MD046; update all three code blocks (the
ones showing the oc --kubeconfig annotate node command, the oc -n
<nodepool-namespace> scale nodepool command, and the multi-node annotate
example) to use the repository's configured fenced-code language tag (e.g.,
change the current ```bash fences to the project's expected tag such as ```sh or
```console) so they match the repo’s code block style and stop triggering MD046.
---
Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go`:
- Around line 158-159: The test currently only checks presence of
capiv1.DeleteMachineAnnotation on updatedMachine but must assert the actual
value set by reconcileDeleteMachineAnnotation; update the assertion to read the
annotation value (e.g., val, ok :=
updatedMachine.Annotations[capiv1.DeleteMachineAnnotation]) and if
tc.expectDeleteMachineOnMachine is true assert ok is true and val equals "yes",
otherwise assert ok is false (or that val is not "yes"). This ensures the test
verifies the controller writes "cluster.x-k8s.io/delete-machine=yes" rather than
any value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1586a3e4-175d-4d38-bb7e-d047ddb23423
⛔ Files ignored due to path filters (1)
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (6)
api/hypershift/v1beta1/nodepool_types.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/node/node.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.godocs/content/how-to/automated-machine-management/targeted-scale-down.mddocs/mkdocs.ymltest/e2e/autoscaling_test.go
| machine, err := r.getMachineForNode(ctx, node) | ||
| if err != nil { | ||
| var apiErr *apierrors.StatusError | ||
| if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) { | ||
| // Return error and retry only if the API interaction failed. Other errors are because the nodeToNodePoolName expected | ||
| // annotations are not in place yet, so we'll reconcile triggered by the event which sets them in the Node. | ||
| return ctrl.Result{}, err | ||
| } else { | ||
| log.Error(err, "failed to get nodePool name from Node") | ||
| return ctrl.Result{}, nil | ||
| } | ||
| log.Error(err, "failed to get Machine for Node, CAPI annotations may not be set yet") | ||
| return ctrl.Result{}, nil | ||
| } |
There was a problem hiding this comment.
Propagate unexpected machine lookup failures.
This branch only surfaces StatusErrors. If r.client.Get fails with a transport/cache/context error, it gets treated as “annotations may not be set yet” and reconciliation exits cleanly, which can silently stop both label sync and delete-machine sync until another Node event arrives.
Suggested direction
machine, err := r.getMachineForNode(ctx, node)
if err != nil {
- var apiErr *apierrors.StatusError
- if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) {
- return ctrl.Result{}, err
- }
- log.Error(err, "failed to get Machine for Node, CAPI annotations may not be set yet")
- return ctrl.Result{}, nil
+ switch {
+ case apierrors.IsNotFound(err), errors.Is(err, errMissingMachineReference):
+ log.Info("Machine reference for Node is not available yet", "node", node.Name)
+ return ctrl.Result{}, nil
+ default:
+ return ctrl.Result{}, err
+ }
}getMachineForNode would need to return a sentinel like errMissingMachineReference for the missing-annotation cases.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`
around lines 50 - 58, The current reconciliation swallows
transport/cache/context errors from r.getMachineForNode; update
getMachineForNode to return a sentinel error (e.g., errMissingMachineReference)
when CAPI annotations are missing, and change the caller logic in Node
reconciliation to treat only that sentinel and apierrors.IsNotFound as expected
(returning ctrl.Result{}, nil) while propagating any other errors (including
transport/context/cache errors from r.client.Get) by returning ctrl.Result{},
err. Reference getMachineForNode, errMissingMachineReference, and r.client.Get
when making these changes.
| ```bash | ||
| oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down=true | ||
| ``` | ||
|
|
||
| Then scale down the NodePool: | ||
|
|
||
| ```bash | ||
| oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1> | ||
| ``` | ||
|
|
||
| The annotated node will be selected for deletion before any non-annotated nodes. | ||
|
|
||
| ### Changing your mind | ||
|
|
||
| If you annotated a node but decide you no longer want it removed, simply remove the annotation **before** the scale-down completes: | ||
|
|
||
| ```bash | ||
| oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down- | ||
| ``` | ||
|
|
||
| The controller will remove the `delete-machine` annotation from the corresponding Machine, returning it to normal deletion priority. | ||
|
|
||
| ### Marking multiple nodes | ||
|
|
||
| You can annotate multiple nodes simultaneously. When a scale-down occurs, CAPI will prioritize all annotated machines for deletion. If the scale-down removes fewer machines than are annotated, CAPI selects among the annotated machines using its default ordering (newest first). | ||
|
|
||
| ```bash | ||
| oc --kubeconfig <hosted-cluster-kubeconfig> annotate node node-1 node-2 hypershift.openshift.io/scale-down=true | ||
| ``` |
There was a problem hiding this comment.
Match the configured Markdown code block style.
markdownlint is already flagging these examples with MD046, so this page will keep docs lint noisy until the blocks are converted to the repo’s expected style.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 21-21: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
[warning] 27-27: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
[warning] 37-37: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
[warning] 47-47: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`
around lines 21 - 49, The markdown code fences in targeted-scale-down.md are
using a style that trips markdownlint MD046; update all three code blocks (the
ones showing the oc --kubeconfig annotate node command, the oc -n
<nodepool-namespace> scale nodepool command, and the multi-node annotate
example) to use the repository's configured fenced-code language tag (e.g.,
change the current ```bash fences to the project's expected tag such as ```sh or
```console) so they match the repo’s code block style and stop triggering MD046.
91d6822 to
d13c955
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8369 +/- ##
=======================================
Coverage 36.46% 36.47%
=======================================
Files 765 765
Lines 93256 93274 +18
=======================================
+ Hits 34010 34024 +14
- Misses 56532 56535 +3
- Partials 2714 2715 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Add hypershift.openshift.io/scale-down annotation constant to the NodePool API types. When set to "true" on a Node in the hosted cluster, the HCCO Node Controller will set cluster.x-k8s.io/delete-machine=yes on the corresponding Machine, giving it top priority for deletion during NodePool scale-down. Ref: CNTRLPLANE-3318 Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Add documentation explaining how to use the hypershift.openshift.io/scale-down annotation for targeted node deletion during NodePool scale-down operations. Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
d13c955 to
3c909a0
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
docs/content/how-to/automated-machine-management/targeted-scale-down.md (1)
21-49:⚠️ Potential issue | 🟡 MinorCode block style still violates MD046 in this doc.
Line 21, Line 27, Line 37, and Line 47 are still fenced blocks and continue to trigger markdownlint (
Expected: indented; Actual: fenced).Proposed lint-compliant edit
-```bash -oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down=true -``` + oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down=true @@ -```bash -oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1> -``` + oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1> @@ -```bash -oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down- -``` + oc --kubeconfig <hosted-cluster-kubeconfig> annotate node <node-name> hypershift.openshift.io/scale-down- @@ -```bash -oc --kubeconfig <hosted-cluster-kubeconfig> annotate node node-1 node-2 hypershift.openshift.io/scale-down=true -``` + oc --kubeconfig <hosted-cluster-kubeconfig> annotate node node-1 node-2 hypershift.openshift.io/scale-down=true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md` around lines 21 - 49, Replace the fenced triple-backtick bash blocks with indented code blocks to satisfy MD046: convert the four fenced snippets that contain the oc commands (the annotate node <node-name> hypershift.openshift.io/scale-down=true command, the oc -n <nodepool-namespace> scale nodepool <nodepool-name> --replicas=<current - 1> command, the annotate node <node-name> hypershift.openshift.io/scale-down- command, and the annotate node node-1 node-2 hypershift.openshift.io/scale-down=true command) to use four-space indentation rather than ``` fences so markdownlint no longer flags MD046.control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go (1)
50-58:⚠️ Potential issue | 🟠 MajorDo not swallow non-NotFound machine lookup failures.
Line 52-57 still drops many unexpected errors (e.g., cache/client/context failures) by logging and returning success. This can silently stop both label sync and delete-machine sync until another event. Treat only “missing reference”/NotFound as expected; return other errors.
Suggested fix direction
+var errMissingMachineReference = errors.New("missing machine reference annotations") + func (r *reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { @@ machine, err := r.getMachineForNode(ctx, node) if err != nil { - var apiErr *apierrors.StatusError - if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) { - return ctrl.Result{}, err - } - log.Error(err, "failed to get Machine for Node, CAPI annotations may not be set yet") - return ctrl.Result{}, nil + switch { + case apierrors.IsNotFound(err), errors.Is(err, errMissingMachineReference): + log.Info("Machine reference for Node is not available yet", "node", node.Name) + return ctrl.Result{}, nil + default: + return ctrl.Result{}, err + } } @@ func (r *reconciler) getMachineForNode(ctx context.Context, node *corev1.Node) (*capiv1.Machine, error) { machineName, ok := node.GetAnnotations()[capiv1.MachineAnnotation] if !ok || machineName == "" { - return nil, fmt.Errorf("failed to find MachineAnnotation on Node %q", node.Name) + return nil, fmt.Errorf("%w: missing %s on Node %q", errMissingMachineReference, capiv1.MachineAnnotation, node.Name) } @@ machineNamespace, ok := node.GetAnnotations()[capiv1.ClusterNamespaceAnnotation] if !ok || machineNamespace == "" { - return nil, fmt.Errorf("failed to find ClusterNamespaceAnnotation on Node %q", node.Name) + return nil, fmt.Errorf("%w: missing %s on Node %q", errMissingMachineReference, capiv1.ClusterNamespaceAnnotation, node.Name) }Also applies to: 150-159
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go` around lines 50 - 58, The current error handling around r.getMachineForNode in node.go swallows non-NotFound failures by logging and returning success; change it so only apierrors.IsNotFound (missing reference) is treated as expected and returns ctrl.Result{}, nil, while any other error (including context/cache/client failures) is returned up the stack (return ctrl.Result{}, err). Update the block that inspects errors returned from r.getMachineForNode (and the analogous block around lines 150-159) to use errors.As to detect *apierrors.StatusError and then call apierrors.IsNotFound to decide; for all non-NotFound errors, do not log-and-continue—return the error. Ensure log messages remain informative when returning errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 203-208: The test currently verifies scale-down by inspecting only
the slice returned from e2eutil.WaitForNReadyNodes, which misses nodes that are
NotReady; change the check to inspect all cluster nodes instead: replace the
call to e2eutil.WaitForNReadyNodes/remainingNodes with a listing of all nodes
(e.g., via guestClient.CoreV1().Nodes().List or an existing helper like
e2eutil.WaitForNNodes) and then assert that none of those nodes have Name ==
targetNodeName; update the loop that references remainingNodes and the t.Logf
accordingly so the assertion covers every Node object, not just Ready ones (keep
references to guestClient, targetNodeName, and the assertion using g.Expect).
---
Duplicate comments:
In `@control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go`:
- Around line 50-58: The current error handling around r.getMachineForNode in
node.go swallows non-NotFound failures by logging and returning success; change
it so only apierrors.IsNotFound (missing reference) is treated as expected and
returns ctrl.Result{}, nil, while any other error (including
context/cache/client failures) is returned up the stack (return ctrl.Result{},
err). Update the block that inspects errors returned from r.getMachineForNode
(and the analogous block around lines 150-159) to use errors.As to detect
*apierrors.StatusError and then call apierrors.IsNotFound to decide; for all
non-NotFound errors, do not log-and-continue—return the error. Ensure log
messages remain informative when returning errors.
In `@docs/content/how-to/automated-machine-management/targeted-scale-down.md`:
- Around line 21-49: Replace the fenced triple-backtick bash blocks with
indented code blocks to satisfy MD046: convert the four fenced snippets that
contain the oc commands (the annotate node <node-name>
hypershift.openshift.io/scale-down=true command, the oc -n <nodepool-namespace>
scale nodepool <nodepool-name> --replicas=<current - 1> command, the annotate
node <node-name> hypershift.openshift.io/scale-down- command, and the annotate
node node-1 node-2 hypershift.openshift.io/scale-down=true command) to use
four-space indentation rather than ``` fences so markdownlint no longer flags
MD046.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2351a309-7791-437b-8ee4-64b336d361b6
⛔ Files ignored due to path filters (2)
docs/content/reference/aggregated-docs.mdis excluded by!docs/content/reference/aggregated-docs.mdvendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.gois excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (6)
api/hypershift/v1beta1/nodepool_types.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/node/node.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.godocs/content/how-to/automated-machine-management/targeted-scale-down.mddocs/mkdocs.ymltest/e2e/autoscaling_test.go
✅ Files skipped from review due to trivial changes (2)
- docs/mkdocs.yml
- api/hypershift/v1beta1/nodepool_types.go
🚧 Files skipped from review as they are similar to previous changes (1)
- control-plane-operator/hostedclusterconfigoperator/controllers/node/node_test.go
| remainingNodes := e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type) | ||
| for _, n := range remainingNodes { | ||
| g.Expect(n.Name).NotTo(Equal(targetNodeName), | ||
| "annotated node %s should have been deleted during scale-down", targetNodeName) | ||
| } | ||
| t.Logf("Targeted scale-down verified: annotated node %s was deleted", targetNodeName) |
There was a problem hiding this comment.
Assert deletion against all Nodes, not only Ready Nodes.
At Line 203-208, the check only scans the ready-node slice. The test can pass if the targeted node still exists but is NotReady while another node was deleted.
Proposed hardening
- remainingNodes := e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
- for _, n := range remainingNodes {
- g.Expect(n.Name).NotTo(Equal(targetNodeName),
- "annotated node %s should have been deleted during scale-down", targetNodeName)
- }
+ _ = e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type)
+ e2eutil.EventuallyObjects(t, ctx,
+ fmt.Sprintf("node %s to be removed from cluster", targetNodeName),
+ func(ctx context.Context) ([]*corev1.Node, error) {
+ nodeList := &corev1.NodeList{}
+ if err := guestClient.List(ctx, nodeList); err != nil {
+ return nil, err
+ }
+ items := make([]*corev1.Node, len(nodeList.Items))
+ for i := range nodeList.Items {
+ items[i] = &nodeList.Items[i]
+ }
+ return items, nil
+ },
+ []e2eutil.Predicate[[]*corev1.Node]{
+ func(nodes []*corev1.Node) (bool, string, error) {
+ for _, n := range nodes {
+ if n.Name == targetNodeName {
+ return false, "target node still exists", nil
+ }
+ }
+ return true, "target node removed", nil
+ },
+ },
+ nil,
+ e2eutil.WithTimeout(5*time.Minute),
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/autoscaling_test.go` around lines 203 - 208, The test currently
verifies scale-down by inspecting only the slice returned from
e2eutil.WaitForNReadyNodes, which misses nodes that are NotReady; change the
check to inspect all cluster nodes instead: replace the call to
e2eutil.WaitForNReadyNodes/remainingNodes with a listing of all nodes (e.g., via
guestClient.CoreV1().Nodes().List or an existing helper like
e2eutil.WaitForNNodes) and then assert that none of those nodes have Name ==
targetNodeName; update the loop that references remainingNodes and the t.Logf
accordingly so the assertion covers every Node object, not just Ready ones (keep
references to guestClient, targetNodeName, and the assertion using g.Expect).
JoelSpeed
left a comment
There was a problem hiding this comment.
/approve
For API change
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, jparrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Extend testAutoscaling to verify CNTRLPLANE-3318: after scaling to 3 nodes, annotate the oldest node with hypershift.openshift.io/scale-down and assert HCCO propagates cluster.x-k8s.io/delete-machine to the corresponding Machine. After autoscaler scales back to 1, verify the annotated node was the one deleted. Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
3c909a0 to
39dc245
Compare
|
@jparrill: This pull request references CNTRLPLANE-3318 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/autoscaling_test.go (1)
203-208:⚠️ Potential issue | 🟡 MinorStill only asserting deletion against Ready nodes (can false-pass).
This was previously flagged and still applies: checking only
WaitForNReadyNodescan miss the target node if it remains in the cluster as NotReady. Verify absence across all nodes instead.Proposed hardening
- remainingNodes := e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type) - for _, n := range remainingNodes { - g.Expect(n.Name).NotTo(Equal(targetNodeName), - "annotated node %s should have been deleted during scale-down", targetNodeName) - } + _ = e2eutil.WaitForNReadyNodes(t, ctx, guestClient, numNodes, hostedCluster.Spec.Platform.Type) + e2eutil.EventuallyObjects(t, ctx, + fmt.Sprintf("node %s to be removed from cluster", targetNodeName), + func(ctx context.Context) ([]*corev1.Node, error) { + nodeList := &corev1.NodeList{} + if err := guestClient.List(ctx, nodeList); err != nil { + return nil, err + } + items := make([]*corev1.Node, len(nodeList.Items)) + for i := range nodeList.Items { + items[i] = &nodeList.Items[i] + } + return items, nil + }, + []e2eutil.Predicate[[]*corev1.Node]{ + func(nodes []*corev1.Node) (bool, string, error) { + for _, n := range nodes { + if n.Name == targetNodeName { + return false, "target node still exists", nil + } + } + return true, "target node removed", nil + }, + }, + nil, + e2eutil.WithTimeout(5*time.Minute), + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/autoscaling_test.go` around lines 203 - 208, The test currently only verifies deletion against the slice returned by e2eutil.WaitForNReadyNodes (ready nodes) which can false-pass if the annotated node becomes NotReady; modify the check to list all cluster nodes via the Kubernetes client (guestClient.CoreV1().Nodes().List or equivalent helper) and assert that no returned Node.Name equals targetNodeName, replacing or augmenting the loop over remainingNodes from WaitForNReadyNodes; keep the existing log message but perform the absence assertion against the full node list to ensure the annotated node is truly gone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/autoscaling_test.go`:
- Around line 203-208: The test currently only verifies deletion against the
slice returned by e2eutil.WaitForNReadyNodes (ready nodes) which can false-pass
if the annotated node becomes NotReady; modify the check to list all cluster
nodes via the Kubernetes client (guestClient.CoreV1().Nodes().List or equivalent
helper) and assert that no returned Node.Name equals targetNodeName, replacing
or augmenting the loop over remainingNodes from WaitForNReadyNodes; keep the
existing log message but perform the absence assertion against the full node
list to ensure the annotated node is truly gone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 76ae52b2-c998-42f1-b779-845e9526e159
📒 Files selected for processing (1)
test/e2e/autoscaling_test.go
|
/hold The current proposal in the PR just moves that annotation to Nodes. It doesn't enable any practical enhancement feature wise. This is a known approach and has limitations, e.g. what happens if you have multiple nodes annotated. Different consumers are going to have different expectations. We can not just add annotations for every upcoming use case. We might want to consider having something CAPI upstream more flexible and scalable like |
|
@jparrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Hey @enxebre it's there any outcome from the upstream discussion regarding this topic/PR? |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is not a test failure — no Prow CI test jobs ran or failed. The PR is blocked by tide (the merge automation bot) because the branch has fallen 43 days behind Root CauseThe PR branch
Recommendations
Evidence
|
Summary
hypershift.openshift.io/scale-downannotation constant to the API (NodeScaleDownAnnotation)cluster.x-k8s.io/delete-machineon the corresponding Machine, giving it top priority for deletion during scale-downdelete-machinefrom the Machine (change-of-mind support)how-to/automated-machine-management/targeted-scale-down.mdJira
CNTRLPLANE-3318
Test plan
reconcileDeleteMachineAnnotation(10 table-driven cases + patch error injection)getMachineForNode(4 cases)delete-machineon Machine → autoscaler scale-down → verify annotated node was deletedgo build -tags e2e ./test/e2e/...passesgo vet -tags e2e ./test/e2e/...passes🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests