Skip to content

CNTRLPLANE-3515: test(e2e): add day-2 label/taint no-rollout verification#8595

Open
mgencur wants to merge 1 commit into
openshift:mainfrom
mgencur:node_labels_and_taints
Open

CNTRLPLANE-3515: test(e2e): add day-2 label/taint no-rollout verification#8595
mgencur wants to merge 1 commit into
openshift:mainfrom
mgencur:node_labels_and_taints

Conversation

@mgencur

@mgencur mgencur commented May 27, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Extend NodePoolDay2TagsTest to patch nodeLabels and taints day-2 and verify no rolling upgrade is triggered via NodePool conditions and MachineDeployment generation.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3515

Special notes for your reviewer:

The original test also scaled up the NodePool and verified that the new nodes actually have the labels/taints. But I think it's not necessary to test it. The existing NodePool upgrade tests already set labels/taints at the beginning and verify these are properly applied to nodes after initial start. These tests together with the new one from this PR cover the whole original test

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Enhanced end-to-end testing for NodePool day‑2 scenarios.
    • Validate applying EC2 resource tags, node labels, and taints without triggering a rolling update.
    • Monitor update-related NodePool conditions and assert rolling-update is not triggered.
    • Re-check MachineDeployment to ensure its generation remains unchanged after the updates.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026
@openshift-ci-robot

openshift-ci-robot commented May 27, 2026

Copy link
Copy Markdown

@mgencur: This pull request references CNTRLPLANE-3515 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Extend NodePoolDay2TagsTest to patch nodeLabels and taints day-2 and verify no rolling upgrade is triggered via NodePool conditions and MachineDeployment generation.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3515

Special notes for your reviewer:

The original test also scaled up the NodePool and verified that the new nodes actually have the labels/taints. But I think it's not necessary to test it. The existing NodePool upgrade tests already set labels/taints at the beginning and verify these are properly applied to nodes after initial start. These tests together with the new one from this PR cover the whole original test

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/testing Indicates the PR includes changes for e2e testing labels May 27, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and enxebre May 27, 2026 10:06
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8e7e552d-a4b2-4ebe-8341-38bc0cd39fc7

📥 Commits

Reviewing files that changed from the base of the PR and between 4f8b5f2 and b5d3554.

📒 Files selected for processing (1)
  • test/e2e/nodepool_day2_tags_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/nodepool_day2_tags_test.go

📝 Walkthrough

Walkthrough

This PR updates the nodepool_day2_tags e2e test: it adds the k8s sets import, captures the MachineDeployment.Generation before applying changes, applies EC2 resource tag updates plus node label and taint changes, polls NodePool status while tracking updating-related condition types (using a set and a rollingUpdateTriggered flag), asserts rollingUpdateTriggered is false, and re-fetches the MachineDeployment to verify its Generation is unchanged.


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Container-Privileges ❌ Error PR adds container manifests with privileged:true, hostPID, hostNetwork, and allowPrivilegeEscalation settings in new files without documented justification. Document justification for privilege requirements in kubelet-config-daemonset.yaml and KubeVirt CSI daemonset manifests.
Test Structure And Quality ⚠️ Warning Two assertions lack meaningful failure messages: lines 128 (HaveKeyWithValue) and 138-140 (ContainElement) have no messages to diagnose failures. Add failure messages to lines 128 and 138-140 assertions following the pattern of other assertions in the test (e.g., "expected day-2 tag to be applied to AWSMachine" and "expected EC2 instance tags to contain day-2 tag").
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding day-2 label/taint no-rollout verification to an E2E test, which aligns with the core purpose of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test uses static name "TestNodePoolDay2Tags" with no dynamic values. All dynamic resource names correctly placed in test body, not test titles.
Microshift Test Compatibility ✅ Passed HyperShift (hosted control planes) and MicroShift (edge distribution) are separate projects. This check is not applicable to HyperShift repository tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test creates single-replica NodePool and doesn't assume multiple nodes, multi-node communication, node scaling, or scheduling across different nodes—all operations work on SNO.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only a test file (nodepool_day2_tags_test.go), not deployment manifests, operators, or controllers. No scheduling constraints are introduced in the code changes.
Ote Binary Stdout Contract ✅ Passed File contains no process-level stdout writes. All code runs within test methods; new sets import has no stdout side effects; no init(), TestMain(), or top-level initializers present.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test contains no hardcoded IPv4 addresses, IPv4-only IP parsing, IPv4 CIDRs, or external public internet connectivity; AWS API calls are account-internal, and test properly skips on non-AWS platforms.
No-Weak-Crypto ✅ Passed The PR modifies a Kubernetes e2e test file with no cryptographic operations, weak crypto algorithms, custom crypto implementations, or non-constant-time secret comparisons.
No-Sensitive-Data-In-Logs ✅ Passed Test file contains one logging call with a static message and no sensitive data. No credentials, tokens, API keys, PII, or other sensitive data are logged anywhere in the code.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.49%. Comparing base (8ea786c) to head (26c6325).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8595   +/-   ##
=======================================
  Coverage   41.49%   41.49%           
=======================================
  Files         756      756           
  Lines       93648    93648           
=======================================
  Hits        38855    38855           
  Misses      52057    52057           
  Partials     2736     2736           
Flag Coverage Δ
cmd-support 34.87% <ø> (ø)
cpo-hostedcontrolplane 43.50% <ø> (ø)
cpo-other 43.17% <ø> (ø)
hypershift-operator 51.57% <ø> (ø)
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/nodepool_day2_tags_test.go`:
- Around line 79-88: The test currently replaces NodePool.Spec.NodeLabels and
NodePool.Spec.Taints outright; instead modify the test to merge the day-2 values
into existing fields: ensure NodePool.Spec.NodeLabels is non-nil (create if nil)
and set the "e2e.day2.validation" key to "true" without dropping other keys, and
for NodePool.Spec.Taints append the new hyperv1.Taint (Key:"e2e-day2",
Value:"test", Effect:corev1.TaintEffectPreferNoSchedule) only if an equivalent
taint (same Key, Value, Effect) does not already exist, so existing taints are
preserved rather than overwritten.
- Around line 153-154: The MachineDeployment generation assertion is hardcoded
to 1; instead capture the pre-update generation (e.g., store md.Generation in a
variable like initialGeneration or preUpdateGen before performing the day-2
tag/label/taint updates) and then assert that md.Generation is still equal to
that stored value after the update; update the test in
test/e2e/nodepool_day2_tags_test.go to use the stored pre-update generation for
the final assertion against md.Generation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9ed09c00-4205-40e5-b9d2-1c5734b44f66

📥 Commits

Reviewing files that changed from the base of the PR and between 89e19f8 and 9ce5563.

📒 Files selected for processing (1)
  • test/e2e/nodepool_day2_tags_test.go

Comment thread test/e2e/nodepool_day2_tags_test.go Outdated
Comment thread test/e2e/nodepool_day2_tags_test.go Outdated
@mgencur mgencur force-pushed the node_labels_and_taints branch from 9ce5563 to 4f8b5f2 Compare May 27, 2026 10:44
Comment thread test/e2e/nodepool_day2_tags_test.go Outdated
@mgencur mgencur force-pushed the node_labels_and_taints branch from 4f8b5f2 to b5d3554 Compare May 29, 2026 04:37
@muraee

muraee commented May 29, 2026

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgencur, muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2026
@jiezhao16

Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@cwbotbot

cwbotbot commented May 29, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@mgencur

mgencur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

5 similar comments
@mgencur

mgencur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/retest

Extend NodePoolDay2TagsTest to patch nodeLabels and taints day-2
and verify no rolling upgrade is triggered via NodePool conditions
and MachineDeployment generation.

Ref: CNTRLPLANE-3515

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mgencur mgencur force-pushed the node_labels_and_taints branch from b5d3554 to 26c6325 Compare June 9, 2026 08:20
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026
@mgencur

mgencur commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Rebasing because Konflux enterprise contract checks were constantly failing.

@jiezhao16

Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@mgencur

mgencur commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@mgencur

mgencur commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@mgencur: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks-4-22 26c6325 link true /test e2e-aks-4-22
ci/prow/e2e-aws-4-22 26c6325 link true /test e2e-aws-4-22

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

Now I have all the evidence needed. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Leaf Failure 1: TestNodePool/HostedCluster0/Main/TestNodePoolPrevReleaseN3 (2700.01s)
  util.go:583: Failed to wait for 1 nodes to become ready for NodePool
  e2e-clusters-2sgml/node-pool-sgcxc-5q9x7 in 45m0s: context deadline exceeded
  - observed **v1.Node collection invalid: expected 1 nodes, got 0

Leaf Failure 2: TestKarpenter/Main/Parallel_provisioning_tests/OpenshiftEC2NodeClass_Kubelet_propagation (538.13s)
  karpenter_test.go:1331: Get "https://10.0.139.107:10250/containerLogs/kube-system/
  kubelet-config-checker/checker": http2: client connection lost

Summary

Both failures are pre-existing flakes unrelated to PR #8595. The PR modifies only test/e2e/nodepool_day2_tags_test.go, and the test it changes (TestNodePoolDay2Tags) passed successfully in 403.23s. The two failing tests — TestNodePoolPrevReleaseN3 (a 4.19 NodePool node failing to register within 45 minutes) and Karpenter's OpenshiftEC2NodeClass_Kubelet_propagation (transient HTTP/2 connection loss to the kubelet API) — are in completely different test files and exercise unrelated code paths. The TestNodePool/HostedCluster0 parent-level condition validation failure is a secondary artifact caused by the PrevReleaseN3 timeout consuming the full test budget.

Root Cause

Failure 1 — TestNodePoolPrevReleaseN3 (infrastructure/payload flake):
This test creates a NodePool using the OCP 4.19 CI payload (4.19.0-0.ci-2026-06-03-210413) on a HostedCluster running 4.22. The AWS EC2 instance was provisioned but never registered as a Ready node within the 45-minute timeout — 0 nodes were observed when 1 was expected. The sibling tests PrevReleaseN1 (4.21) and PrevReleaseN2 (4.20) both passed in ~501s, and PrevReleaseN4 (4.18, version-skew-unsupported) passed in 15s. This points to a transient issue specific to the 4.19 CI payload or its node bootstrap process, not a code regression. The 4.19 payload used was from June 3 (7 days old), and stale CI payloads can have image pull or bootstrap issues.

Failure 2 — OpenshiftEC2NodeClass_Kubelet_propagation (transient network error):
The test successfully provisioned a Karpenter node, created a kubelet-config-checker pod, but then hit an http2: client connection lost error when trying to fetch container logs from the kubelet API at 10.0.139.107:10250. This is a well-known transient error caused by HTTP/2 connection instability between the API server proxy and the kubelet. The error occurred 6 times in the build log, indicating intermittent connectivity issues during this test run. This is not caused by any code change.

Parent Failure — TestNodePool/HostedCluster0 condition validation:
After PrevReleaseN3 timed out (consuming 2700s of the ~4800s total HostedCluster0 test window), the framework's post-test teardown validation ran a 2-second condition check. It expected conditions in initial/unknown state but found the cluster fully healthy (e.g., DataPlaneConnectionAvailable=True instead of Unknown). This is a known framework limitation — the teardown condition check uses initial-state expectations rather than final-state expectations — and is a direct consequence of the PrevReleaseN3 timeout, not an independent failure.

Recommendations
  1. Retest / merge with /retest: Both failures are pre-existing flakes unrelated to the PR's code changes. The PR's own test (TestNodePoolDay2Tags) passed. A /retest should clear these.

  2. TestNodePoolPrevReleaseN3: Consider investigating whether the 4.19 CI payload (4.19.0-0.ci-2026-06-03-210413) has known bootstrap issues on newer management clusters. The 7-day-old payload may have stale image references. If this flakes persistently, consider adding a freshness check for N-3 payloads or increasing the timeout with better diagnostics.

  3. Karpenter Kubelet propagation: The http2: client connection lost error is a known transient issue in Kubernetes kubelet API proxy. No code action needed — this is infrastructure-level flakiness.

  4. Framework condition validation: The teardown condition check that expects Unknown/False conditions on a healthy cluster is a known false-positive pattern. Consider fixing the framework to validate final-state conditions (all True/healthy) rather than initial-state conditions at teardown time.

Evidence
Evidence Detail
PR changed file test/e2e/nodepool_day2_tags_test.go (only file modified)
PR's test result TestNodePoolDay2TagsPASSED (403.23s)
Leaf failure 1 TestNodePoolPrevReleaseN3 — node for 4.19 NodePool never came up (45m timeout)
Leaf failure 2 Kubelet_propagationhttp2: client connection lost fetching kubelet logs
PrevRelease N1 (4.21) PASSED (501.02s)
PrevRelease N2 (4.20) PASSED (501.02s)
PrevRelease N3 (4.19) FAILED (2700.01s) — 0/1 nodes ready
PrevRelease N4 (4.18) PASSED (15.01s) — skew-unsupported, fast path
N3 payload 4.19.0-0.ci-2026-06-03-210413 (7 days old)
http2 errors in log 6 occurrences of http2: client connection lost
HostedCluster0 condition failure Teardown expects initial conditions, got healthy cluster — secondary to N3 timeout
Test suite summary 597 tests, 33 skipped, 8 failures (2 leaf, 6 parent propagation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants