CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176
CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176weliang1 wants to merge 5 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@weliang1: This pull request references CORENET-6064 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new e2e test Sequence Diagram(s)sequenceDiagram
participant TestHarness as Test Harness
participant HostAPI as HostedCluster API
participant CPDeploy as ovnkube-control-plane Deployment
participant NodeDS as ovnkube-node DaemonSet
participant GuestAPI as Guest Kube API (hosted)
participant ClusterOp as network ClusterOperator
TestHarness->>HostAPI: Derive hosted control-plane namespace
TestHarness->>CPDeploy: Wait for Deployment Available / ReadyReplicas>0
TestHarness->>NodeDS: Check DaemonSet presence
alt DaemonSet missing
Note right of TestHarness: acceptable
else DaemonSet present
TestHarness->>NodeDS: Assert DesiredNumberScheduled, NumberAvailable, NumberUnavailable == 0
TestHarness->>NodeDS: Assert ObservedGeneration == Generation
end
alt Upgrade image provided and differs
TestHarness->>HostAPI: Patch hostedCluster.spec.release.image
TestHarness->>CPDeploy: Wait for rollout (generation, ready/updated == desired)
TestHarness->>CPDeploy: Verify container image changed
Note right of TestHarness: For supported HyperShift versions also wait for control-plane rollout and ControlPlaneVersion
end
TestHarness->>GuestAPI: Create guest kube client
loop Poll until success
GuestAPI->>ClusterOp: Get network ClusterOperator (unstructured)
ClusterOp-->>GuestAPI: Conditions (Available/Progressing/Degraded)
end
TestHarness->>CPDeploy: Final readiness check (ReadyReplicas>0)
TestHarness->>NodeDS: Final absence or zero desired pods check
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weliang1 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b6f7f5d to
c2fa99d
Compare
|
/jira refresh |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8176 +/- ##
=======================================
Coverage 41.50% 41.50%
=======================================
Files 758 758
Lines 93689 93689
=======================================
Hits 38882 38882
Misses 52070 52070
Partials 2737 2737
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
c2fa99d to
dc2b23a
Compare
|
/test all |
|
/remove-label do-not-merge/work-in-progress |
|
@weliang1: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test e2e-aws |
|
Scheduling tests matching the |
|
/test e2e-aws |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/ovn_control_plane_zero_workers_test.go (1)
126-131: Don't skip the entire test when only the upgrade image is missing.
Line 130turns the whole test intoSKIP, which also drops the non-upgrade coverage from Steps 1-2 and the post-upgrade-independent health checks later in the test. It would be better to gate only the upgrade-specific steps (or split them into a subtest) so zero-worker OVN validation still runs in jobs withoutLatestReleaseImage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/ovn_control_plane_zero_workers_test.go` around lines 126 - 131, The test currently calls t.Skip() when upgradeImage (globalOpts.LatestReleaseImage) is empty or equal to baselineImage, which skips the entire test; instead, change the flow so only upgrade-specific steps are gated: check upgradeImage and if missing/equal only skip or return from the upgrade-related block (the steps that perform the upgrade and post-upgrade validation) or move those steps into a subtest (t.Run("upgrade", ...)) that is skipped, while allowing the initial zero-worker OVN validation and post-upgrade-independent health checks to always run; update references to upgradeImage, baselineImage and any t.Skip calls accordingly so the rest of the test is still executed when LatestReleaseImage is not provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 154-210: The rollout predicate for the "ovnkube-control-plane"
Deployment can return true on the pre-upgrade revision; update the Eventually
check in the goroutine that reads deployment (the block creating deployment :=
&appsv1.Deployment{} inside g.Eventually) to also verify the pod image has
changed from the recorded baselineImage before returning true: after checking
ready==desired, updated==desired and observedGeneration==generation, fetch the
first container image from deployment.Spec.Template.Spec.Containers[0].Image
and, if baselineImage is non-empty, require newImage != baselineImage (or skip
the image check only when baselineImage is empty) so Eventually only succeeds
once the Deployment rollout actually reflects the new image.
---
Nitpick comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 126-131: The test currently calls t.Skip() when upgradeImage
(globalOpts.LatestReleaseImage) is empty or equal to baselineImage, which skips
the entire test; instead, change the flow so only upgrade-specific steps are
gated: check upgradeImage and if missing/equal only skip or return from the
upgrade-related block (the steps that perform the upgrade and post-upgrade
validation) or move those steps into a subtest (t.Run("upgrade", ...)) that is
skipped, while allowing the initial zero-worker OVN validation and
post-upgrade-independent health checks to always run; update references to
upgradeImage, baselineImage and any t.Skip calls accordingly so the rest of the
test is still executed when LatestReleaseImage is not provided.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 4b2e3eab-4f92-42a7-a589-99ea89428359
📒 Files selected for processing (1)
test/e2e/ovn_control_plane_zero_workers_test.go
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
/test verify-deps |
|
/cc @kyrtapz |
|
@enxebre @devguyio |
|
/test e2e-aws |
2 similar comments
|
/test e2e-aws |
|
/test e2e-aws |
Switch TestUpgradeControlPlane to use ExecuteWithoutEnsureValidation to avoid HostedCluster condition validation race after scaling workers back from zero. After the zero-worker validation completes and workers are scaled back to 2 replicas, cluster operators (image-registry, ingress) need additional time to reconcile before HostedCluster conditions reflect healthy state. Node Ready status does not guarantee operator availability. The ExecuteWithoutEnsureValidation method was created specifically for this scenario but was not being used, causing test timeouts on the EnsureHostedCluster validation step. Fixes: openshift#8176 (comment)
@enxebre Your feedback was addressed as integrating the test into TestUpgradeControlPlane. cc: @devguyio @sjenning |
|
@weliang1: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
c4ef3c9 to
63b92a7
Compare
|
/retest-failed |
|
/retest |
|
Now I have all the evidence needed. Here is the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Enterprise Contract (EC) verification failures are not caused by the code changes in this PR (which only modify test files under Root CausePR #8176's branch diverged from
Because the PR branch's
For comparison, PR #8701 (merged the same day, only 8 commits behind These failures are unrelated to the PR's code changes, which only touch Recommendations
Evidence
|
…in daemonset rollout Verifies that OVN control plane components can successfully upgrade in HyperShift clusters with zero worker nodes. This test validates: - Initial OVN deployment readiness with zero workers - OVN DaemonSet behavior (not created or reports 0 desired) - Control plane upgrade from version X to Y - OVN pod rollout during upgrade - All control plane components complete rollout - Network ClusterOperator remains healthy - No degradation or pod crashes The test addresses scenarios such as: - Data plane hibernation (workers scaled to zero for cost savings) - Autoscaling from zero (no workers until workload arrives) - Management cluster updates when worker nodes are unreachable Validated on live cluster: - Cluster: hypershift-ci-373084 - Upgrade: 4.22.0-223038 → 051707 - Workers: 0 throughout test - Duration: ~10 minutes - Result: All 8 steps passed, 0 pod restarts Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rker test Addressed CodeRabbit review feedback: 1. Use cancellable ctx instead of testContext in WaitForGuestClient 2. Add safe type assertions with comma-ok checks for condition parsing 3. Fix confusing log output by removing negated booleans Framework fix: 4. Use NonePlatform instead of globalOpts.Platform to skip framework validation that expects worker nodes. This matches the approach used by TestHAEtcdChaos for zero-worker scenarios. The test validates OVN control plane behavior with zero workers, which is platform-agnostic. NonePlatform allows the test to focus on OVN-specific validation without requiring cloud provider resources or worker nodes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
NonePlatform does not deploy OVN-Kubernetes components, causing the test to fail when looking for ovnkube-control-plane deployment. The test needs a real platform (AWS) that deploys OVN networking components. The framework validation correctly handles zero-worker clusters through clusterOpts.ExpectedNodeCount(), adjusting condition expectations for clusters without worker nodes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address CodeRabbit finding: The rollout predicate could return true on the pre-upgrade revision if the deployment was already ready with the old image. Changes: - Capture baseline generation in addition to baseline image - Verify deployment.Generation has changed from baseline - Verify container image has changed from baseline - Only return true when both generation and image have changed AND all replicas are ready/updated This ensures Eventually waits for the actual upgrade rollout to complete rather than returning immediately on the pre-upgrade state. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tests The standard Execute() method runs EnsureHostedCluster validation in the after() phase, which incorrectly defaults hasWorkerNodes=true for private or non-public clusters. This causes ValidateHostedClusterConditions to expect worker-dependent conditions (DataPlaneConnectionAvailable, ControlPlaneConnectionAvailable, ClusterVersionAvailable) that cannot be satisfied in zero-worker cluster configurations. This commit adds ExecuteWithoutEnsureValidation() method that: - Skips the problematic after() validation (EnsureHostedCluster) - Still runs before() validation which correctly uses opts.ExpectedNodeCount() - Allows tests to provide their own comprehensive validation - Is specifically designed for non-standard cluster configurations The TestOVNControlPlaneZeroWorkers test is updated to use this new method, as it already provides comprehensive Steps 1-8 validation for OVN components in zero-worker clusters. This fixes the CI failure where the test timed out waiting for conditions that cannot be met without worker nodes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
63b92a7 to
5c57528
Compare
@enxebre can we lgtm and approve this PR for 4.22? Thanks! |
What this PR does / why we need it:
Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.
This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:
Test coverage:
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/CORENET-6066
Special notes for your reviewer:
Checklist:
🤖 Generated with Claude Code
Summary by CodeRabbit