OCPSTRAT-3298: Predictable NodePool rollout control#8698
Conversation
…ig changes Introduce RolloutHash and RolloutHashWithoutVersion that hash only spec-driven inputs (user MachineConfigs, release version, pull secret, trust bundle, global config), excluding management-side content like HAProxy image digests. Rollout decisions in both Replace and InPlace paths now compare rollout hashes against a new nodePoolCurrentRolloutConfig annotation instead of comparing data secret names. This prevents automated HAProxy image digest bumps from triggering full worker node replacement while ensuring new nodes still receive the latest payload when they ARE replaced for spec-driven reasons. On first reconcile after operator upgrade the annotation is seeded without triggering a rollout, and isUpdatingConfig returns false when the annotation is absent to prevent condition flip-flop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unit tests for RolloutHash behavior verifying that management-side changes (haproxy) do not affect the rollout hash while spec-driven changes (user config, pull secret, trust bundle, global config, version) do. Add annotation seeding tests and e2e v2 test scaffolding for predictable rollout control. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@csrwng: This pull request references OCPSTRAT-3298 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature to target either version "5.0." or "openshift-5.0.", but it targets "openshift-5.1" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThis PR implements rollout-specific hashing to prevent management-side HAProxy and image digest changes from triggering unnecessary node replacement. A new Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/config_test.go`:
- Around line 742-794: Replace the manual annotation writes and plain equality
checks in TestRolloutHashAnnotationSeeding with calls into the real production
seeding helper (the function that computes/returns the rollout hash and a seeded
bool) and assert the boolean return; specifically, when
nodePoolAnnotationCurrentRolloutConfig is absent call the seeding helper and
expect seeded==true and the returned hash to be stored on nodePool.Annotations,
and when the annotation already exists call the helper and expect seeded==false;
also update TestGenerateMCORawConfig to consume and assert on rolloutConfigsRaw
(rather than discarding it) so changes to the generated rollout-only config
(e.g., reintroducing HAProxy) will fail tests.
In `@hypershift-operator/controllers/nodepool/config.go`:
- Around line 155-192: generateMCORawConfig currently only strips inline
cg.haproxyRawConfig via parseWithoutHaproxy, but when cg.haproxyRawConfig == ""
getCoreConfigs still returns the legacy HAProxy core ConfigMap which then ends
up in rolloutConfigsRaw; change generateMCORawConfig so that before calling
parseWithoutHaproxy it builds a filtered configs slice that excludes the legacy
HAProxy core ConfigMap (the same ConfigMap returned by getCoreConfigs for
HAProxy — detect it by the same unique identifier used in getCoreConfigs, e.g.,
its ConfigMap name/label or by comparing its data to cg.haproxyRawConfig), then
pass that filtered slice to parseWithoutHaproxy while still passing the full
configs slice to parse so fullConfig is unchanged (use functions/fields:
generateMCORawConfig, getCoreConfigs, getUserConfigs, getNTOGeneratedConfig,
cg.haproxyRawConfig, parse, parseWithoutHaproxy).
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 402-409: The current seeding writes
nodePoolAnnotationCurrentRolloutConfig using token.RolloutHashWithoutVersion()
unconditionally, which can suppress user-driven rollouts; change the logic in
the reconcile path that sets nodePoolAnnotationCurrentRolloutConfig so it
prefers the last-applied/legacy value (check any legacy annotation like the
prior "lastApplied" key or bootstrap state) and only auto-seed from
token.RolloutHashWithoutVersion() when you can prove no rollout is pending
(e.g., verify isUpdatingConfig() is false or compare current vs target rollout
hashes and ensure they differ/are stable). Update the code that currently
checks/creates nodePool.Annotations and assigns
nodePoolAnnotationCurrentRolloutConfig to first try restoring from
legacy/last-applied state, and fallback to auto-seed from
token.RolloutHashWithoutVersion() only when no pending rollout is detected.
In `@test/e2e/v2/tests/nodepool_rollout_control_test.go`:
- Around line 67-174: The test ManagementImageChangeNoRolloutTest is mutating
the shared default NodePool via getDefaultNodePool; instead create an isolated
NodePool using buildTestNodePool (mirroring SpecDrivenChangeTriggersRolloutTest)
and register cleanup with DeferCleanup to call cleanupNodePool(...) so the test
operates on its own NodePool; update the teardown logic to only remove the
HAProxy annotation on the test NodePool (no IsNotFound branch) and use the test
NodePool object in all subsequent references instead of
defaultNP/getDefaultNodePool.
- Around line 287-396: The test OperatorUpgradeNoRolloutTest mutates the shared
default NodePool via getDefaultNodePool; change it to create and use a dedicated
test NodePool like SpecDrivenChangeTriggersRolloutTest does by calling
buildTestNodePool (or the same factory used there), wait for the NodePool to
become ready (so annotations are seeded), then remove the rollout annotation on
that test NodePool to simulate pre-upgrade state, and register cleanup with
DeferCleanup to call cleanupNodePool (or the same cleanup helper) to delete the
test NodePool and restore state; ensure all references to np and baseline node
lists use the newly created NodePool instead of getDefaultNodePool.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ad3269e8-032f-403d-8aef-f75206c60cb9
📒 Files selected for processing (7)
hypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/config.gohypershift-operator/controllers/nodepool/config_test.gohypershift-operator/controllers/nodepool/nodepool_controller.gotest/e2e/v2/tests/nodepool_lifecycle_test.gotest/e2e/v2/tests/nodepool_rollout_control_test.go
| func TestRolloutHashAnnotationSeeding(t *testing.T) { | ||
| t.Run("When the nodePoolCurrentRolloutConfig annotation is absent it should not signal isUpdatingConfig", func(t *testing.T) { | ||
| g := NewWithT(t) | ||
|
|
||
| rolloutHash := "abc123" | ||
| nodePool := &hyperv1.NodePool{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Annotations: map[string]string{}, | ||
| }, | ||
| } | ||
|
|
||
| _, seeded := nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig] | ||
| g.Expect(seeded).To(BeFalse(), "annotation should be absent before seeding") | ||
|
|
||
| nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig] = rolloutHash | ||
|
|
||
| g.Expect(nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig]).To(Equal(rolloutHash), | ||
| "annotation should be set to the computed rollout hash") | ||
| }) | ||
|
|
||
| t.Run("When the nodePoolCurrentRolloutConfig annotation matches the current hash it should not trigger a rollout", func(t *testing.T) { | ||
| g := NewWithT(t) | ||
|
|
||
| rolloutHash := "abc123" | ||
| nodePool := &hyperv1.NodePool{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Annotations: map[string]string{ | ||
| nodePoolAnnotationCurrentRolloutConfig: rolloutHash, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| isUpdating := rolloutHash != nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig] | ||
| g.Expect(isUpdating).To(BeFalse(), "no rollout should be triggered when hashes match") | ||
| }) | ||
|
|
||
| t.Run("When the nodePoolCurrentRolloutConfig annotation differs from the current hash it should trigger a rollout", func(t *testing.T) { | ||
| g := NewWithT(t) | ||
|
|
||
| oldHash := "abc123" | ||
| newHash := "def456" | ||
| nodePool := &hyperv1.NodePool{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Annotations: map[string]string{ | ||
| nodePoolAnnotationCurrentRolloutConfig: oldHash, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| isUpdating := newHash != nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig] | ||
| g.Expect(isUpdating).To(BeTrue(), "rollout should be triggered when hashes differ") | ||
| }) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
These tests don't exercise the production seeding/rollout path yet.
TestRolloutHashAnnotationSeeding just writes the annotation map and reimplements newHash != oldHash, and TestGenerateMCORawConfig discards rolloutConfigsRaw. A regression that stops seeding, switches back to the wrong hash source, or reintroduces HAProxy into the rollout-only config will still pass here. Please assert the second return value and drive the real helper/controller path for the missing-annotation case.
As per coding guidelines, **/*_test.go: Unit test any code changes and additions.
Also applies to: 1608-1619
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/nodepool/config_test.go` around lines 742 -
794, Replace the manual annotation writes and plain equality checks in
TestRolloutHashAnnotationSeeding with calls into the real production seeding
helper (the function that computes/returns the rollout hash and a seeded bool)
and assert the boolean return; specifically, when
nodePoolAnnotationCurrentRolloutConfig is absent call the seeding helper and
expect seeded==true and the returned hash to be stored on nodePool.Annotations,
and when the annotation already exists call the helper and expect seeded==false;
also update TestGenerateMCORawConfig to consume and assert on rolloutConfigsRaw
(rather than discarding it) so changes to the generated rollout-only config
(e.g., reintroducing HAProxy) will fail tests.
Source: Coding guidelines
| // generateMCORawConfig generates a mco consumable artifact of the mco Config. | ||
| func (cg *ConfigGenerator) generateMCORawConfig(ctx context.Context, caps *hyperv1.Capabilities) (configsRaw string, err error) { | ||
| // It returns two strings: the full config (including haproxy) and the rollout config (excluding haproxy). | ||
| func (cg *ConfigGenerator) generateMCORawConfig(ctx context.Context, caps *hyperv1.Capabilities) (configsRaw, rolloutConfigsRaw string, err error) { | ||
| var configs []corev1.ConfigMap | ||
|
|
||
| // Look for core ignition configs in the control plane namespace. | ||
| coreConfigs, err := cg.getCoreConfigs(ctx) | ||
| if err != nil { | ||
| return "", err | ||
| return "", "", err | ||
| } | ||
| configs = append(configs, coreConfigs...) | ||
|
|
||
| userConfig, err := cg.getUserConfigs(ctx) | ||
| if err != nil { | ||
| return "", err | ||
| return "", "", err | ||
| } | ||
| configs = append(configs, userConfig...) | ||
|
|
||
| if capabilities.IsNodeTuningCapabilityEnabled(caps) { | ||
| // Look for NTO generated MachineConfigs from the hosted control plane namespace | ||
| nodeTuningGeneratedConfigs, err := getNTOGeneratedConfig(ctx, cg) | ||
| if err != nil { | ||
| return "", err | ||
| return "", "", err | ||
| } | ||
| configs = append(configs, nodeTuningGeneratedConfigs...) | ||
| } | ||
|
|
||
| return cg.parse(configs) | ||
| fullConfig, err := cg.parse(configs) | ||
| if err != nil { | ||
| return "", "", err | ||
| } | ||
|
|
||
| rolloutConfig, err := cg.parseWithoutHaproxy(configs) | ||
| if err != nil { | ||
| return "", "", err | ||
| } | ||
|
|
||
| return fullConfig, rolloutConfig, nil |
There was a problem hiding this comment.
Exclude legacy HAProxy core configs from rolloutConfigsRaw too.
parseWithoutHaproxy() only drops the inline cg.haproxyRawConfig. When haproxyRawConfig == "", getCoreConfigs() still expects the legacy HAProxy core ConfigMap, and that manifest is still parsed into configs, so RolloutHash* will continue to move on old-CPO / upgrade-skew clusters. That means management-side HAProxy changes can still trigger worker rollouts in exactly the path this PR is trying to stabilize.
Based on PR objectives, rollout hashing should stay insensitive to HAProxy-only management changes during upgrade skew as well.
Also applies to: 262-266
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/nodepool/config.go` around lines 155 - 192,
generateMCORawConfig currently only strips inline cg.haproxyRawConfig via
parseWithoutHaproxy, but when cg.haproxyRawConfig == "" getCoreConfigs still
returns the legacy HAProxy core ConfigMap which then ends up in
rolloutConfigsRaw; change generateMCORawConfig so that before calling
parseWithoutHaproxy it builds a filtered configs slice that excludes the legacy
HAProxy core ConfigMap (the same ConfigMap returned by getCoreConfigs for
HAProxy — detect it by the same unique identifier used in getCoreConfigs, e.g.,
its ConfigMap name/label or by comparing its data to cg.haproxyRawConfig), then
pass that filtered slice to parseWithoutHaproxy while still passing the full
configs slice to parse so fullConfig is unchanged (use functions/fields:
generateMCORawConfig, getCoreConfigs, getUserConfigs, getNTOGeneratedConfig,
cg.haproxyRawConfig, parse, parseWithoutHaproxy).
| // Seed the rollout config annotation on first reconcile after operator upgrade. | ||
| // This must happen before any rollout decision to prevent spurious rollouts. | ||
| if _, ok := nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig]; !ok { | ||
| if nodePool.Annotations == nil { | ||
| nodePool.Annotations = make(map[string]string) | ||
| } | ||
| nodePool.Annotations[nodePoolAnnotationCurrentRolloutConfig] = token.RolloutHashWithoutVersion() | ||
| } |
There was a problem hiding this comment.
Don't seed nodePoolCurrentRolloutConfig from the target hash unconditionally.
On the first reconcile after upgrade, this writes the new rollout hash before you recover what the workers are actually running. If a user-driven config change is pending during that window, isUpdatingConfig() and the CAPI rollout gates will see current == target and can skip the requested rollout entirely. Seed from the last applied state (legacy annotations / current bootstrap state) or only auto-seed when you can prove no rollout is pending.
Based on PR objectives, the upgrade seeding path must avoid suppressing user-driven rollouts while controller migration is in progress.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines
402 - 409, The current seeding writes nodePoolAnnotationCurrentRolloutConfig
using token.RolloutHashWithoutVersion() unconditionally, which can suppress
user-driven rollouts; change the logic in the reconcile path that sets
nodePoolAnnotationCurrentRolloutConfig so it prefers the last-applied/legacy
value (check any legacy annotation like the prior "lastApplied" key or bootstrap
state) and only auto-seed from token.RolloutHashWithoutVersion() when you can
prove no rollout is pending (e.g., verify isUpdatingConfig() is false or compare
current vs target rollout hashes and ensure they differ/are stable). Update the
code that currently checks/creates nodePool.Annotations and assigns
nodePoolAnnotationCurrentRolloutConfig to first try restoring from
legacy/last-applied state, and fallback to auto-seed from
token.RolloutHashWithoutVersion() only when no pending rollout is detected.
| func ManagementImageChangeNoRolloutTest(getTestCtx internal.TestContextGetter) { | ||
| When("a management-side image changes on an existing NodePool", func() { | ||
| It("should NOT trigger a rollout when only the HAProxy image annotation changes", func() { | ||
| testCtx := getTestCtx() | ||
| testCtx.ValidateHostedClusterClient() | ||
|
|
||
| hc := testCtx.GetHostedCluster() | ||
| hcClient := testCtx.GetHostedClusterClient() | ||
| ctx := testCtx.Context | ||
|
|
||
| defaultNP := getDefaultNodePool(ctx, testCtx.MgmtClient, hc) | ||
| Expect(defaultNP).NotTo(BeNil(), "default NodePool should exist") | ||
|
|
||
| np := &hyperv1.NodePool{} | ||
| Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(defaultNP), np)).To(Succeed()) | ||
|
|
||
| Expect(np.Annotations).To(HaveKey(nodePoolCurrentRolloutConfigAnnotation), | ||
| "NodePool %s should have the rollout config annotation set by the controller", np.Name) | ||
| baselineRolloutHash := np.Annotations[nodePoolCurrentRolloutConfigAnnotation] | ||
| Expect(baselineRolloutHash).NotTo(BeEmpty()) | ||
|
|
||
| baselineConfigHash := np.Annotations[nodePoolCurrentConfigAnnotation] | ||
|
|
||
| foundUpdatingConfig := false | ||
| for _, cond := range np.Status.Conditions { | ||
| if cond.Type == hyperv1.NodePoolUpdatingConfigConditionType { | ||
| foundUpdatingConfig = true | ||
| Expect(string(cond.Status)).To(Equal(string(metav1.ConditionFalse)), | ||
| "UpdatingConfig should be False before test mutation") | ||
| break | ||
| } | ||
| } | ||
| Expect(foundUpdatingConfig).To(BeTrue(), | ||
| "NodePool %s should have UpdatingConfig condition", np.Name) | ||
|
|
||
| baselineNodes := &corev1.NodeList{} | ||
| Expect(hcClient.List(ctx, baselineNodes, crclient.MatchingLabels{ | ||
| hyperv1.NodePoolLabel: np.Name, | ||
| })).To(Succeed()) | ||
| Expect(baselineNodes.Items).NotTo(BeEmpty(), | ||
| "NodePool %s should have at least one ready node", np.Name) | ||
| baselineNodeNames := nodeNames(baselineNodes) | ||
|
|
||
| By("patching the HAProxy image annotation to simulate a management-side image change") | ||
| original := np.DeepCopy() | ||
| if np.Annotations == nil { | ||
| np.Annotations = map[string]string{} | ||
| } | ||
| np.Annotations[hyperv1.NodePoolHAProxyImageAnnotation] = "quay.io/openshift/origin-haproxy-router:e2e-dummy-digest" | ||
| Expect(testCtx.MgmtClient.Patch(ctx, np, crclient.MergeFrom(original))).To(Succeed(), | ||
| "failed to patch NodePool %s with HAProxy image annotation", np.Name) | ||
| GinkgoWriter.Printf("Patched NodePool %s with dummy HAProxy image annotation\n", np.Name) | ||
|
|
||
| DeferCleanup(func() { | ||
| fresh := &hyperv1.NodePool{} | ||
| err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), fresh) | ||
| if apierrors.IsNotFound(err) { | ||
| return | ||
| } | ||
| Expect(err).NotTo(HaveOccurred(), "cleanup: failed to get NodePool %s", np.Name) | ||
| patch := fresh.DeepCopy() | ||
| delete(patch.Annotations, hyperv1.NodePoolHAProxyImageAnnotation) | ||
| Expect(testCtx.MgmtClient.Patch(ctx, patch, crclient.MergeFrom(fresh))).To(Succeed(), | ||
| "cleanup: failed to remove HAProxy image annotation from NodePool %s", np.Name) | ||
| }) | ||
|
|
||
| By("verifying that no rollout is triggered over 2 minutes") | ||
| Consistently(func(g Gomega) { | ||
| fresh := &hyperv1.NodePool{} | ||
| g.Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), fresh)).To(Succeed()) | ||
|
|
||
| g.Expect(fresh.Annotations).To(HaveKeyWithValue( | ||
| nodePoolCurrentRolloutConfigAnnotation, baselineRolloutHash), | ||
| "rollout config annotation should not change for management-side image changes") | ||
|
|
||
| for _, cond := range fresh.Status.Conditions { | ||
| if cond.Type == hyperv1.NodePoolUpdatingConfigConditionType { | ||
| g.Expect(string(cond.Status)).To(Equal(string(metav1.ConditionFalse)), | ||
| "UpdatingConfig should remain False — management-side image change must not trigger rollout") | ||
| break | ||
| } | ||
| } | ||
|
|
||
| currentNodes := &corev1.NodeList{} | ||
| g.Expect(hcClient.List(ctx, currentNodes, crclient.MatchingLabels{ | ||
| hyperv1.NodePoolLabel: np.Name, | ||
| })).To(Succeed()) | ||
| g.Expect(len(currentNodes.Items)).To(Equal(len(baselineNodes.Items)), | ||
| "node count should not change") | ||
| }, 2*time.Minute, 15*time.Second).Should(Succeed()) | ||
|
|
||
| By("verifying node identity is unchanged (no replacement)") | ||
| finalNodes := &corev1.NodeList{} | ||
| Expect(hcClient.List(ctx, finalNodes, crclient.MatchingLabels{ | ||
| hyperv1.NodePoolLabel: np.Name, | ||
| })).To(Succeed()) | ||
| Expect(nodeNames(finalNodes)).To(ConsistOf(baselineNodeNames), | ||
| "node names should be identical — no node replacement should have occurred") | ||
|
|
||
| By("verifying the payload config hash may have changed but rollout hash did not") | ||
| finalNP := &hyperv1.NodePool{} | ||
| Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), finalNP)).To(Succeed()) | ||
| Expect(finalNP.Annotations[nodePoolCurrentRolloutConfigAnnotation]).To(Equal(baselineRolloutHash), | ||
| "rollout hash annotation must remain stable") | ||
| _ = baselineConfigHash | ||
| }) | ||
| }) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Create a dedicated test NodePool instead of mutating the default NodePool.
This test retrieves and mutates the default NodePool (lines 77, 116) rather than creating a dedicated test NodePool. While lifecycle tests are allowed to mutate cluster state, operating on the shared default NodePool risks interference with other tests running in parallel.
Follow the pattern established by SpecDrivenChangeTriggersRolloutTest (lines 193-210): create a dedicated NodePool using buildTestNodePool and register cleanup with DeferCleanup(func() { cleanupNodePool(...) }).
♻️ Suggested refactor to create dedicated NodePool
Replace lines 77-119 with:
- defaultNP := getDefaultNodePool(ctx, testCtx.MgmtClient, hc)
- Expect(defaultNP).NotTo(BeNil(), "default NodePool should exist")
-
- np := &hyperv1.NodePool{}
- Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(defaultNP), np)).To(Succeed())
+ defaultNP := getDefaultNodePool(ctx, testCtx.MgmtClient, hc)
+ Expect(defaultNP).NotTo(BeNil(), "default NodePool should exist")
+
+ var oneReplica int32 = 1
+ np := buildTestNodePool(defaultNP, "mgmt-no-rollout", func(pool *hyperv1.NodePool) {
+ pool.Spec.Replicas = &oneReplica
+ pool.Spec.Management.Replace = &hyperv1.ReplaceUpgrade{
+ Strategy: hyperv1.UpgradeStrategyRollingUpdate,
+ RollingUpdate: &hyperv1.RollingUpdate{
+ MaxUnavailable: ptr.To(intstr.FromInt32(0)),
+ MaxSurge: ptr.To(intstr.FromInt32(oneReplica)),
+ },
+ }
+ })
+
+ err := testCtx.MgmtClient.Create(ctx, np)
+ Expect(err).NotTo(HaveOccurred(), "failed to create NodePool %s", np.Name)
+ GinkgoWriter.Printf("Created NodePool %s\n", np.Name)
+ DeferCleanup(func() {
+ cleanupNodePool(ctx, testCtx.MgmtClient, np)
+ })
+
+ e2eutil.WaitForReadyNodesByNodePool(GinkgoTB(), ctx, hcClient, np, hc.Spec.Platform.Type)
+
+ Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), np)).To(Succeed())Then update the cleanup at lines 120-131 to simply remove the HAProxy annotation without checking IsNotFound:
DeferCleanup(func() {
- fresh := &hyperv1.NodePool{}
- err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), fresh)
- if apierrors.IsNotFound(err) {
- return
- }
- Expect(err).NotTo(HaveOccurred(), "cleanup: failed to get NodePool %s", np.Name)
- patch := fresh.DeepCopy()
- delete(patch.Annotations, hyperv1.NodePoolHAProxyImageAnnotation)
- Expect(testCtx.MgmtClient.Patch(ctx, patch, crclient.MergeFrom(fresh))).To(Succeed(),
- "cleanup: failed to remove HAProxy image annotation from NodePool %s", np.Name)
+ // NodePool cleanup happens via cleanupNodePool registered above
})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/v2/tests/nodepool_rollout_control_test.go` around lines 67 - 174,
The test ManagementImageChangeNoRolloutTest is mutating the shared default
NodePool via getDefaultNodePool; instead create an isolated NodePool using
buildTestNodePool (mirroring SpecDrivenChangeTriggersRolloutTest) and register
cleanup with DeferCleanup to call cleanupNodePool(...) so the test operates on
its own NodePool; update the teardown logic to only remove the HAProxy
annotation on the test NodePool (no IsNotFound branch) and use the test NodePool
object in all subsequent references instead of defaultNP/getDefaultNodePool.
| func OperatorUpgradeNoRolloutTest(getTestCtx internal.TestContextGetter) { | ||
| When("the HyperShift operator is upgraded to support rollout hash", func() { | ||
| It("should seed the rollout hash annotation on first reconcile without triggering a rollout", func() { | ||
| testCtx := getTestCtx() | ||
| testCtx.ValidateHostedClusterClient() | ||
|
|
||
| hc := testCtx.GetHostedCluster() | ||
| hcClient := testCtx.GetHostedClusterClient() | ||
| ctx := testCtx.Context | ||
|
|
||
| defaultNP := getDefaultNodePool(ctx, testCtx.MgmtClient, hc) | ||
| Expect(defaultNP).NotTo(BeNil(), "default NodePool should exist") | ||
|
|
||
| np := &hyperv1.NodePool{} | ||
| Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(defaultNP), np)).To(Succeed()) | ||
|
|
||
| Expect(np.Annotations).To(HaveKey(nodePoolCurrentRolloutConfigAnnotation), | ||
| "NodePool %s should have the rollout config annotation before test", np.Name) | ||
|
|
||
| baselineConfigHash := np.Annotations[nodePoolCurrentConfigAnnotation] | ||
|
|
||
| foundUpdatingConfig := false | ||
| for _, cond := range np.Status.Conditions { | ||
| if cond.Type == hyperv1.NodePoolUpdatingConfigConditionType { | ||
| foundUpdatingConfig = true | ||
| Expect(string(cond.Status)).To(Equal(string(metav1.ConditionFalse)), | ||
| "UpdatingConfig should be False before simulating upgrade") | ||
| break | ||
| } | ||
| } | ||
| Expect(foundUpdatingConfig).To(BeTrue(), | ||
| "NodePool %s should have UpdatingConfig condition", np.Name) | ||
|
|
||
| baselineNodes := &corev1.NodeList{} | ||
| Expect(hcClient.List(ctx, baselineNodes, crclient.MatchingLabels{ | ||
| hyperv1.NodePoolLabel: np.Name, | ||
| })).To(Succeed()) | ||
| Expect(baselineNodes.Items).NotTo(BeEmpty(), | ||
| "NodePool %s should have at least one ready node", np.Name) | ||
| baselineNodeNames := nodeNames(baselineNodes) | ||
|
|
||
| By("removing the rollout config annotation to simulate pre-upgrade state") | ||
| original := np.DeepCopy() | ||
| delete(np.Annotations, nodePoolCurrentRolloutConfigAnnotation) | ||
| Expect(testCtx.MgmtClient.Patch(ctx, np, crclient.MergeFrom(original))).To(Succeed(), | ||
| "failed to remove rollout config annotation from NodePool %s", np.Name) | ||
| GinkgoWriter.Printf("Removed rollout config annotation from NodePool %s to simulate pre-upgrade state\n", np.Name) | ||
|
|
||
| By("forcing a reconciliation via a dummy annotation") | ||
| Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), np)).To(Succeed()) | ||
| original = np.DeepCopy() | ||
| np.Annotations["hypershift.openshift.io/e2e-reconcile-trigger"] = fmt.Sprintf("%d", time.Now().Unix()) | ||
| Expect(testCtx.MgmtClient.Patch(ctx, np, crclient.MergeFrom(original))).To(Succeed(), | ||
| "failed to add reconcile trigger annotation to NodePool %s", np.Name) | ||
| DeferCleanup(func() { | ||
| fresh := &hyperv1.NodePool{} | ||
| err := testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), fresh) | ||
| if apierrors.IsNotFound(err) { | ||
| return | ||
| } | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| patch := fresh.DeepCopy() | ||
| delete(patch.Annotations, "hypershift.openshift.io/e2e-reconcile-trigger") | ||
| Expect(testCtx.MgmtClient.Patch(ctx, patch, crclient.MergeFrom(fresh))).To(Succeed(), | ||
| "cleanup: failed to remove reconcile trigger annotation") | ||
| }) | ||
|
|
||
| By("waiting for the controller to re-seed the rollout config annotation") | ||
| Eventually(func(g Gomega) { | ||
| fresh := &hyperv1.NodePool{} | ||
| g.Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), fresh)).To(Succeed()) | ||
| g.Expect(fresh.Annotations).To(HaveKey(nodePoolCurrentRolloutConfigAnnotation), | ||
| "controller should re-seed the rollout config annotation") | ||
| g.Expect(fresh.Annotations[nodePoolCurrentRolloutConfigAnnotation]).NotTo(BeEmpty()) | ||
| }, 1*time.Minute, 10*time.Second).Should(Succeed()) | ||
|
|
||
| By("verifying that no rollout was triggered during re-seeding") | ||
| Consistently(func(g Gomega) { | ||
| fresh := &hyperv1.NodePool{} | ||
| g.Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), fresh)).To(Succeed()) | ||
|
|
||
| for _, cond := range fresh.Status.Conditions { | ||
| if cond.Type == hyperv1.NodePoolUpdatingConfigConditionType { | ||
| g.Expect(string(cond.Status)).To(Equal(string(metav1.ConditionFalse)), | ||
| "UpdatingConfig should remain False — annotation seeding must not trigger rollout") | ||
| break | ||
| } | ||
| } | ||
|
|
||
| g.Expect(fresh.Annotations[nodePoolCurrentConfigAnnotation]).To(Equal(baselineConfigHash), | ||
| "current config annotation should not change during seeding") | ||
|
|
||
| currentNodes := &corev1.NodeList{} | ||
| g.Expect(hcClient.List(ctx, currentNodes, crclient.MatchingLabels{ | ||
| hyperv1.NodePoolLabel: np.Name, | ||
| })).To(Succeed()) | ||
| g.Expect(len(currentNodes.Items)).To(Equal(len(baselineNodes.Items)), | ||
| "node count should not change during annotation seeding") | ||
| }, 2*time.Minute, 15*time.Second).Should(Succeed()) | ||
|
|
||
| By("verifying node identity is unchanged (no replacement)") | ||
| finalNodes := &corev1.NodeList{} | ||
| Expect(hcClient.List(ctx, finalNodes, crclient.MatchingLabels{ | ||
| hyperv1.NodePoolLabel: np.Name, | ||
| })).To(Succeed()) | ||
| Expect(nodeNames(finalNodes)).To(ConsistOf(baselineNodeNames), | ||
| "node names should be identical — no node replacement should have occurred") | ||
| }) | ||
| }) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Create a dedicated test NodePool instead of mutating the default NodePool.
This test retrieves and mutates the default NodePool (lines 297, 330) by removing the rollout config annotation to simulate a pre-upgrade state. While lifecycle tests are allowed to mutate cluster state, operating on the shared default NodePool risks interference with other tests.
Follow the pattern established by SpecDrivenChangeTriggersRolloutTest (lines 193-210): create a dedicated NodePool using buildTestNodePool, wait for it to become ready (which will seed the annotation), then remove the annotation to test the seeding behavior, and register cleanup with DeferCleanup(func() { cleanupNodePool(...) }).
♻️ Suggested refactor to create dedicated NodePool
Replace lines 297-333 with:
defaultNP := getDefaultNodePool(ctx, testCtx.MgmtClient, hc)
Expect(defaultNP).NotTo(BeNil(), "default NodePool should exist")
- np := &hyperv1.NodePool{}
- Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(defaultNP), np)).To(Succeed())
+ var oneReplica int32 = 1
+ np := buildTestNodePool(defaultNP, "upgrade-seed", func(pool *hyperv1.NodePool) {
+ pool.Spec.Replicas = &oneReplica
+ })
+
+ err := testCtx.MgmtClient.Create(ctx, np)
+ Expect(err).NotTo(HaveOccurred(), "failed to create NodePool %s", np.Name)
+ GinkgoWriter.Printf("Created NodePool %s\n", np.Name)
+ DeferCleanup(func() {
+ cleanupNodePool(ctx, testCtx.MgmtClient, np)
+ })
+
+ e2eutil.WaitForReadyNodesByNodePool(GinkgoTB(), ctx, hcClient, np, hc.Spec.Platform.Type)
+
+ Expect(testCtx.MgmtClient.Get(ctx, crclient.ObjectKeyFromObject(np), np)).To(Succeed())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/v2/tests/nodepool_rollout_control_test.go` around lines 287 - 396,
The test OperatorUpgradeNoRolloutTest mutates the shared default NodePool via
getDefaultNodePool; change it to create and use a dedicated test NodePool like
SpecDrivenChangeTriggersRolloutTest does by calling buildTestNodePool (or the
same factory used there), wait for the NodePool to become ready (so annotations
are seeded), then remove the rollout annotation on that test NodePool to
simulate pre-upgrade state, and register cleanup with DeferCleanup to call
cleanupNodePool (or the same cleanup helper) to delete the test NodePool and
restore state; ensure all references to np and baseline node lists use the newly
created NodePool instead of getDefaultNodePool.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8698 +/- ##
==========================================
- Coverage 41.49% 41.46% -0.03%
==========================================
Files 756 756
Lines 93648 93609 -39
==========================================
- Hits 38855 38817 -38
+ Misses 52057 52056 -1
Partials 2736 2736
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryTwo unit tests fail because the PR changed the Root CauseThe PR introduces a new "rollout-only" config hash to prevent management-side image changes (e.g., HAProxy digest bumps) from triggering unnecessary node rollouts. As part of this change:
The fix is straightforward: update the test fixtures in both files to set Recommendations
Evidence
|
What this PR does / why we need it:
Decouples NodePool rollout triggering from management-side configuration changes (e.g. HAProxy image digest bumps) so that only user-driven spec changes cause worker node replacement.
Today the NodePool controller uses a single hash over the entire rendered ignition config — including management-side image references — to drive rollout decisions. Any change to this hash triggers a full Replace/InPlace rollout. This means automated HAProxy image updates (which happen on every HyperShift operator upgrade) cause unnecessary node churn.
This PR introduces:
RolloutHash()/RolloutHashWithoutVersion(): New hash methods that include only spec-driven inputs (user MachineConfigs, release version, pull secret, trust bundle, global config), excluding management-side content like HAProxynodePoolCurrentRolloutConfigannotation: Tracks the current rollout config hash, used for rollout decisions instead of comparing data secret namesisUpdatingConfig()safety: Returns false when the annotation is absent to prevent condition flip-flop during upgradeThe existing
Hash()and payload secret naming are unchanged — new nodes always get the latest payload including management-side content when they ARE replaced for spec-driven reasons.Which issue(s) this PR fixes:
Fixes OCPSTRAT-3298
Special notes for your reviewer:
RolloutHashWithoutVersion()intentionally includesglobalConfigunlike the existingHashWithoutVersion()which omits it. Proxy/image config changes should trigger rollouts.parse()→doParse()refactor is necessary because haproxy content is prepended insideparse()and cannot be stripped from the final string after the fact.secret_janitor_test.gobuild error is pre-existing and unrelated to this PR.Checklist:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes