Skip to content

CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176

Open
weliang1 wants to merge 5 commits into
openshift:mainfrom
weliang1:add-ovn-zero-workers-test
Open

CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176
weliang1 wants to merge 5 commits into
openshift:mainfrom
weliang1:add-ovn-zero-workers-test

Conversation

@weliang1

@weliang1 weliang1 commented Apr 7, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6066

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added an e2e test validating a highly-available control plane with zero worker replicas. Verifies control-plane deployment rollout and readiness, accepts absent node daemonset or enforces zero-scheduled node state, optionally exercises an upgrade/image change path with rollout checks, waits for the hosted network operator to report healthy availability, and performs final stability checks after rollouts.

@openshift-ci-robot

Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6064 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6064

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026
@openshift-ci

openshift-ci Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new e2e test TestOVNControlPlaneZeroWorkers is added to validate OVN control-plane behavior for HyperShift hosted clusters with NodePoolReplicas=0. The test derives the hosted control-plane namespace, waits for the ovnkube-control-plane Deployment to become ready and have ReadyReplicas>0, verifies the ovnkube-node DaemonSet is either absent or reports zero desired pods and matches observed generation, optionally patches hostedCluster.spec.release.image to trigger an upgrade and waits for rollout and image changes (plus control-plane/version checks for minimum HyperShift versions), creates a guest kube client to poll the hosted network ClusterOperator until Available=True and neither Progressing nor Degraded are true, then re-validates control-plane readiness and node state.

Sequence Diagram(s)

sequenceDiagram
    participant TestHarness as Test Harness
    participant HostAPI as HostedCluster API
    participant CPDeploy as ovnkube-control-plane Deployment
    participant NodeDS as ovnkube-node DaemonSet
    participant GuestAPI as Guest Kube API (hosted)
    participant ClusterOp as network ClusterOperator

    TestHarness->>HostAPI: Derive hosted control-plane namespace
    TestHarness->>CPDeploy: Wait for Deployment Available / ReadyReplicas>0
    TestHarness->>NodeDS: Check DaemonSet presence
    alt DaemonSet missing
        Note right of TestHarness: acceptable
    else DaemonSet present
        TestHarness->>NodeDS: Assert DesiredNumberScheduled, NumberAvailable, NumberUnavailable == 0
        TestHarness->>NodeDS: Assert ObservedGeneration == Generation
    end
    alt Upgrade image provided and differs
        TestHarness->>HostAPI: Patch hostedCluster.spec.release.image
        TestHarness->>CPDeploy: Wait for rollout (generation, ready/updated == desired)
        TestHarness->>CPDeploy: Verify container image changed
        Note right of TestHarness: For supported HyperShift versions also wait for control-plane rollout and ControlPlaneVersion
    end
    TestHarness->>GuestAPI: Create guest kube client
    loop Poll until success
        GuestAPI->>ClusterOp: Get network ClusterOperator (unstructured)
        ClusterOp-->>GuestAPI: Conditions (Available/Progressing/Degraded)
    end
    TestHarness->>CPDeploy: Final readiness check (ReadyReplicas>0)
    TestHarness->>NodeDS: Final absence or zero desired pods check
Loading
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weliang1
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 7, 2026
@weliang1 weliang1 changed the title [WIP] CORENET-6064: Add e2e test for zero-worker HyperShift clusters in daemonset rollout [WIP] CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout Apr 7, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6064

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6066

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from b6f7f5d to c2fa99d Compare April 7, 2026 14:42
@weliang1

weliang1 commented Apr 7, 2026

Copy link
Copy Markdown
Author

/jira refresh

@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@weliang1

weliang1 commented Apr 7, 2026

Copy link
Copy Markdown
Author

/jira refresh

@openshift-ci-robot

openshift-ci-robot commented Apr 7, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov

codecov Bot commented Apr 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.50%. Comparing base (a6c3012) to head (5c57528).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8176   +/-   ##
=======================================
  Coverage   41.50%   41.50%           
=======================================
  Files         758      758           
  Lines       93689    93689           
=======================================
  Hits        38882    38882           
  Misses      52070    52070           
  Partials     2737     2737           
Flag Coverage Δ
cmd-support 34.86% <ø> (ø)
cpo-hostedcontrolplane 43.59% <ø> (ø)
cpo-other 43.17% <ø> (ø)
hypershift-operator 51.57% <ø> (ø)
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from c2fa99d to dc2b23a Compare April 7, 2026 15:20
@weliang1 weliang1 changed the title [WIP] CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout [WIP] test: CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout Apr 7, 2026
@weliang1

weliang1 commented Apr 8, 2026

Copy link
Copy Markdown
Author

/test all

@weliang1 weliang1 marked this pull request as ready for review April 8, 2026 13:02
@weliang1 weliang1 changed the title [WIP] test: CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout test: CORENET-6066: Add e2e test for zero-worker HyperShift clusters in daemonset rollout Apr 8, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
@weliang1

weliang1 commented Apr 8, 2026

Copy link
Copy Markdown
Author

/remove-label do-not-merge/work-in-progress

@openshift-ci

openshift-ci Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

@weliang1: The label(s) /remove-label do-not-merge/work-in-progress cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, rebase/manual, cluster-config-api-changed, run-integration-tests, verified, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/skip-dependent-bug-check, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

Details

In response to this:

/remove-label do-not-merge/work-in-progress

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from devguyio and enxebre April 8, 2026 13:04
@weliang1

weliang1 commented Apr 8, 2026

Copy link
Copy Markdown
Author

/test e2e-aws

@openshift-ci-robot

Copy link
Copy Markdown

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@weliang1

weliang1 commented Apr 9, 2026

Copy link
Copy Markdown
Author

/test e2e-aws

@openshift-ci-robot

openshift-ci-robot commented Apr 9, 2026

Copy link
Copy Markdown

@weliang1: This pull request references CORENET-6066 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.

This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:

  • Data plane hibernation (workers scaled to zero for cost savings)
  • Autoscaling from zero (no workers until workload arrives)
  • Management cluster updates when worker nodes are unreachable

Test coverage:

  1. Initial OVN deployment readiness with zero workers
  2. OVN DaemonSet behavior (not created or reports 0 desired)
  3. Control plane upgrade from version X to Y
  4. OVN pod rollout during upgrade
  5. All control plane components complete rollout
  6. Network ClusterOperator remains healthy
  7. No degradation or pod crashes

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CORENET-6066

Special notes for your reviewer:

  • Test validated on live cluster (hypershift-ci-373084)
  • Covers upgrade scenario: 4.22.0-223038 → 051707
  • All 8 validation steps passed with zero pod restarts
  • Test duration: ~10 minutes

Checklist:

  • Subject and description added to both, commit and PR
  • Relevant issues have been referenced
  • This change includes docs (inline godoc comments)
  • This change includes unit tests (this IS an e2e test)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Added an e2e test validating a highly-available control plane with zero worker replicas. Verifies control-plane deployment rollout and readiness, accepts absent node daemonset or enforces zero-scheduled node state, optionally exercises an upgrade/image change path with rollout checks, waits for the hosted network operator to report healthy availability, and performs final stability checks after rollouts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/ovn_control_plane_zero_workers_test.go (1)

126-131: Don't skip the entire test when only the upgrade image is missing.

Line 130 turns the whole test into SKIP, which also drops the non-upgrade coverage from Steps 1-2 and the post-upgrade-independent health checks later in the test. It would be better to gate only the upgrade-specific steps (or split them into a subtest) so zero-worker OVN validation still runs in jobs without LatestReleaseImage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/ovn_control_plane_zero_workers_test.go` around lines 126 - 131, The
test currently calls t.Skip() when upgradeImage (globalOpts.LatestReleaseImage)
is empty or equal to baselineImage, which skips the entire test; instead, change
the flow so only upgrade-specific steps are gated: check upgradeImage and if
missing/equal only skip or return from the upgrade-related block (the steps that
perform the upgrade and post-upgrade validation) or move those steps into a
subtest (t.Run("upgrade", ...)) that is skipped, while allowing the initial
zero-worker OVN validation and post-upgrade-independent health checks to always
run; update references to upgradeImage, baselineImage and any t.Skip calls
accordingly so the rest of the test is still executed when LatestReleaseImage is
not provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 154-210: The rollout predicate for the "ovnkube-control-plane"
Deployment can return true on the pre-upgrade revision; update the Eventually
check in the goroutine that reads deployment (the block creating deployment :=
&appsv1.Deployment{} inside g.Eventually) to also verify the pod image has
changed from the recorded baselineImage before returning true: after checking
ready==desired, updated==desired and observedGeneration==generation, fetch the
first container image from deployment.Spec.Template.Spec.Containers[0].Image
and, if baselineImage is non-empty, require newImage != baselineImage (or skip
the image check only when baselineImage is empty) so Eventually only succeeds
once the Deployment rollout actually reflects the new image.

---

Nitpick comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 126-131: The test currently calls t.Skip() when upgradeImage
(globalOpts.LatestReleaseImage) is empty or equal to baselineImage, which skips
the entire test; instead, change the flow so only upgrade-specific steps are
gated: check upgradeImage and if missing/equal only skip or return from the
upgrade-related block (the steps that perform the upgrade and post-upgrade
validation) or move those steps into a subtest (t.Run("upgrade", ...)) that is
skipped, while allowing the initial zero-worker OVN validation and
post-upgrade-independent health checks to always run; update references to
upgradeImage, baselineImage and any t.Skip calls accordingly so the rest of the
test is still executed when LatestReleaseImage is not provided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4b2e3eab-4f92-42a7-a589-99ea89428359

📥 Commits

Reviewing files that changed from the base of the PR and between 997a620 and ec4c5c9.

📒 Files selected for processing (1)
  • test/e2e/ovn_control_plane_zero_workers_test.go

Comment thread test/e2e/ovn_control_plane_zero_workers_test.go
@weliang1

weliang1 commented Apr 9, 2026

Copy link
Copy Markdown
Author

/test e2e-aws

1 similar comment
@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

@weliang1

Copy link
Copy Markdown
Author

/test verify-deps

@weliang1

Copy link
Copy Markdown
Author

/cc @kyrtapz
Please help review the e2e test case for openshift/cluster-network-operator#2897, thanks!

@openshift-ci openshift-ci Bot requested a review from kyrtapz April 14, 2026 13:56
@weliang1

Copy link
Copy Markdown
Author

@enxebre @devguyio
Please help review the e2e test case for openshift/cluster-network-operator#2897, thanks!

@enxebre

enxebre commented May 13, 2026

Copy link
Copy Markdown
Member

Should this rather be an additional sequential validation for TestUpgradeControlPlane? so we don't create a new HC. i.e after existing validations, scale down to zero and run this.

Besides, should we enable a way to create HCs with no infra?
@devguyio @sjenning

@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

2 similar comments
@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

@weliang1

Copy link
Copy Markdown
Author

/test e2e-aws

weliang1 added a commit to weliang1/hypershift that referenced this pull request May 21, 2026
Switch TestUpgradeControlPlane to use ExecuteWithoutEnsureValidation to
avoid HostedCluster condition validation race after scaling workers back
from zero.

After the zero-worker validation completes and workers are scaled back to
2 replicas, cluster operators (image-registry, ingress) need additional
time to reconcile before HostedCluster conditions reflect healthy state.
Node Ready status does not guarantee operator availability.

The ExecuteWithoutEnsureValidation method was created specifically for
this scenario but was not being used, causing test timeouts on the
EnsureHostedCluster validation step.

Fixes: openshift#8176 (comment)
@weliang1

Copy link
Copy Markdown
Author

Should this rather be an additional sequential validation for TestUpgradeControlPlane? so we don't create a new HC. i.e after existing validations, scale down to zero and run this.

Besides, should we enable a way to create HCs with no infra? @devguyio @sjenning

@enxebre Your feedback was addressed as integrating the test into TestUpgradeControlPlane. cc: @devguyio @sjenning

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026
@openshift-ci

openshift-ci Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

@weliang1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks-4-21 997a620 link true /test e2e-aks-4-21
ci/prow/e2e-aks 997a620 link true /test e2e-aks
ci/prow/e2e-aws-4-21 997a620 link true /test e2e-aws-4-21

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from c4ef3c9 to 63b92a7 Compare May 26, 2026 21:49
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026
@weliang1

weliang1 commented Jun 8, 2026

Copy link
Copy Markdown
Author

/retest-failed

@weliang1

weliang1 commented Jun 9, 2026

Copy link
Copy Markdown
Author

/retest

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown

Now I have all the evidence needed. Here is the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main
  • Build ID: hypershift-operator-main-enterprise-contract-pkpsj (and hypershift-operator-enterprise-contract-jvmhz)
  • Snapshot: hypershift-operator-20260609-134426-000
  • PR: CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout #8176CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout
  • Branch: add-ovn-zero-workers-test (732 commits behind main)
  • Note: Both failing checks (hypershift-operator-main-enterprise-contract and hypershift-operator-enterprise-contract) exhibit the identical failure (42 failures, 248 successes, 8 warnings) because they validate the same snapshot against two EC policy scenarios.

Test Failure Analysis

Error

Integration test for component hypershift-operator-main snapshot
hypershift-operator-20260609-134426-000 and scenario
hypershift-operator-main-enterprise-contract has failed

verify task: 248 success(es), 8 warning(s), 42 failure(s)

Summary

The Enterprise Contract (EC) verification failures are not caused by the code changes in this PR (which only modify test files under test/e2e/). The failures are caused by the PR branch being 732 commits behind main, resulting in the Konflux build using severely outdated base images and incorrect product version metadata from the stale Containerfile.operator. When the EC verify task validates the built image against current release policies, 42 checks fail because the image contains deprecated base image RPMs and mismatched version labels.

Root Cause

PR #8176's branch diverged from main at commit 899fd2ad, which predates two critical changes:

  1. Apr 20, 2026 — MCE 5.0 branch cut (b9ed901e): Updated product version labels from version=4.22 / cpe=2.12 to version=5.0 / cpe=5.0.
  2. May 6, 2026 — Base image bump (3051552f): Updated go-toolset from 1.25.7-1773318690 to 1.25.9-1778054913 and ubi-minimal from 9.7-1773204619 to 9.7-1777857961.

Because the PR branch's Containerfile.operator still references the old base images and labels, the Konflux build produces an image that:

  • Contains RPMs from ubi-minimal:9.7-1773204619 that may be deprecated, unsigned, or no longer meet current EC policy requirements
  • Uses go-toolset:1.25.7 instead of the expected 1.25.9
  • Has incorrect product metadata (version=4.22, cpe=2.12) instead of the expected 5.0
  • Uses Go 1.25.3 (from go.mod) instead of the current 1.25.7

For comparison, PR #8701 (merged the same day, only 8 commits behind main) passed EC with 0 failures (256 successes, 22 warnings) — confirming the issue is branch staleness, not an EC infrastructure problem.

These failures are unrelated to the PR's code changes, which only touch test/e2e/control_plane_upgrade_test.go and test/e2e/util/hypershift_framework.go — Go test files that are not compiled into the operator container image.

Recommendations
  1. Rebase the PR branch onto current main: This is the only fix needed. Run git rebase main or git merge main to incorporate the updated Containerfile.operator with current base images and version labels. The EC checks will pass after rebuild.

  2. Re-run checks after rebase: After rebasing, push the updated branch and the Konflux pipeline will build a new snapshot with the correct base images and metadata.

  3. Consider enabling automatic rebase/merge-queue: For long-lived PRs (this one was opened Apr 7, 2026), a periodic rebase reminder or merge-queue enforcement would prevent this class of stale-branch failures.

Evidence
Evidence Detail
PR branch staleness 732 commits behind main, merge base at 899fd2ad
PR files changed Only test/e2e/control_plane_upgrade_test.go and test/e2e/util/hypershift_framework.go (no production code)
PR 8176 builder image ubi9/go-toolset:1.25.7-1773318690 (outdated)
Current main builder image ubi9/go-toolset:1.25.9-1778054913
PR 8176 runtime image ubi9/ubi-minimal:9.7-1773204619 (outdated)
Current main runtime image ubi9/ubi-minimal:9.7-1777857961
PR 8176 version label version=4.22, cpe=2.12
Current main version label version=5.0, cpe=5.0
PR 8176 Go version go 1.25.3
Current main Go version go 1.25.7
PR 8701 EC result (same day) 256 successes, 22 warnings, 0 failures (passing)
PR 8176 EC result (same day) 248 successes, 8 warnings, 42 failures (failing)
MCE 5.0 branch cut commit b9ed901e (Apr 20, 2026) — predates PR branch base
Base image bump commit 3051552f (May 6, 2026) — predates PR branch base
Both EC scenarios identical hypershift-operator-main-enterprise-contract and hypershift-operator-enterprise-contract both show 42 failures

weliang1 and others added 5 commits June 9, 2026 11:50
…in daemonset rollout

Verifies that OVN control plane components can successfully upgrade
in HyperShift clusters with zero worker nodes.

This test validates:
- Initial OVN deployment readiness with zero workers
- OVN DaemonSet behavior (not created or reports 0 desired)
- Control plane upgrade from version X to Y
- OVN pod rollout during upgrade
- All control plane components complete rollout
- Network ClusterOperator remains healthy
- No degradation or pod crashes

The test addresses scenarios such as:
- Data plane hibernation (workers scaled to zero for cost savings)
- Autoscaling from zero (no workers until workload arrives)
- Management cluster updates when worker nodes are unreachable

Validated on live cluster:
- Cluster: hypershift-ci-373084
- Upgrade: 4.22.0-223038 → 051707
- Workers: 0 throughout test
- Duration: ~10 minutes
- Result: All 8 steps passed, 0 pod restarts

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rker test

Addressed CodeRabbit review feedback:
1. Use cancellable ctx instead of testContext in WaitForGuestClient
2. Add safe type assertions with comma-ok checks for condition parsing
3. Fix confusing log output by removing negated booleans

Framework fix:
4. Use NonePlatform instead of globalOpts.Platform to skip framework
   validation that expects worker nodes. This matches the approach used
   by TestHAEtcdChaos for zero-worker scenarios.

The test validates OVN control plane behavior with zero workers, which
is platform-agnostic. NonePlatform allows the test to focus on OVN-specific
validation without requiring cloud provider resources or worker nodes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
NonePlatform does not deploy OVN-Kubernetes components, causing the test
to fail when looking for ovnkube-control-plane deployment. The test needs
a real platform (AWS) that deploys OVN networking components.

The framework validation correctly handles zero-worker clusters through
clusterOpts.ExpectedNodeCount(), adjusting condition expectations for
clusters without worker nodes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address CodeRabbit finding: The rollout predicate could return true on the
pre-upgrade revision if the deployment was already ready with the old image.

Changes:
- Capture baseline generation in addition to baseline image
- Verify deployment.Generation has changed from baseline
- Verify container image has changed from baseline
- Only return true when both generation and image have changed AND
  all replicas are ready/updated

This ensures Eventually waits for the actual upgrade rollout to complete
rather than returning immediately on the pre-upgrade state.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tests

The standard Execute() method runs EnsureHostedCluster validation in the
after() phase, which incorrectly defaults hasWorkerNodes=true for private
or non-public clusters. This causes ValidateHostedClusterConditions to
expect worker-dependent conditions (DataPlaneConnectionAvailable,
ControlPlaneConnectionAvailable, ClusterVersionAvailable) that cannot be
satisfied in zero-worker cluster configurations.

This commit adds ExecuteWithoutEnsureValidation() method that:
- Skips the problematic after() validation (EnsureHostedCluster)
- Still runs before() validation which correctly uses opts.ExpectedNodeCount()
- Allows tests to provide their own comprehensive validation
- Is specifically designed for non-standard cluster configurations

The TestOVNControlPlaneZeroWorkers test is updated to use this new method,
as it already provides comprehensive Steps 1-8 validation for OVN components
in zero-worker clusters.

This fixes the CI failure where the test timed out waiting for conditions
that cannot be met without worker nodes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@weliang1 weliang1 force-pushed the add-ovn-zero-workers-test branch from 63b92a7 to 5c57528 Compare June 9, 2026 15:50
@weliang1

Copy link
Copy Markdown
Author

Should this rather be an additional sequential validation for TestUpgradeControlPlane? so we don't create a new HC. i.e after existing validations, scale down to zero and run this.
Besides, should we enable a way to create HCs with no infra? @devguyio @sjenning

@enxebre Your feedback was addressed as integrating the test into TestUpgradeControlPlane. cc: @devguyio @sjenning

@enxebre can we lgtm and approve this PR for 4.22? Thanks!
cc: @rpattath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants