Skip to content

OCPBUGS-87818: Prevent spurious ovnkube-node DaemonSet rollout during install#3022

Open
mkowalski wants to merge 1 commit into
openshift:masterfrom
mkowalski:fix/ocpbugs-87818-master
Open

OCPBUGS-87818: Prevent spurious ovnkube-node DaemonSet rollout during install#3022
mkowalski wants to merge 1 commit into
openshift:masterfrom
mkowalski:fix/ocpbugs-87818-master

Conversation

@mkowalski

@mkowalski mkowalski commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Problem

During cluster install, the CNO reconciles multiple times in quick succession (~33s apart). The 008-script-lib.yaml ConfigMap template renders with slightly different content across passes because template variables populated from live cluster state (bootstrapResult) may not have fully settled. This causes the ovnkube-script-lib-hash annotation on the ovnkube-node DaemonSet pod template to change, triggering an unnecessary rolling update.

On nodes where the gen-1 pod init container (kubecfg-setup) is still pulling the ~1.4 GB OVN image when the rolling update begins, the pod becomes a zombie — it has a deletionTimestamp but cannot be terminated because the init container is blocked on the CRI-level image pull. The DaemonSet controller sees the zombie pod on that node and refuses to create a gen-2 replacement (because maxUnavailable: 10% rounds to 0 for small clusters). This permanently blocks CNI initialization on that node, causing cascading install failure.

Root Cause

The OVNKubeConfigHash is computed by rendering 008-script-lib.yaml as a Go template and SHA1-hashing the entire JSON-marshaled manifest (including metadata). Some template variables resolve differently between the first and second CNO reconciliation pass during bootstrap, producing different hashes even though the actual ConfigMap .data content is the same. The different hash updates the DaemonSet pod template annotation → DaemonSet generation bump → rolling update → zombie pod deadlock.

Fix

  1. Hash only the .data section of the rendered ConfigMap instead of the full manifest including metadata — eliminates hash instability from metadata differences.
  2. Compare the rendered ConfigMap .data with the deployed ConfigMap in the cluster. If they match, preserve the existing DaemonSet hash annotation rather than writing the newly computed one.

This ensures the DaemonSet annotation only changes when the ConfigMap content has actually been updated, preventing spurious rolling updates during install.

Impact

  • 6 Component Readiness regressions in 4.22-main (41655, 41656, 41681, 41700, 41701, 41702)
  • 100% reproducible across AWS amd64/arm64, both Default and TechPreview FeatureSets
  • Sippy Triage: 576
  • Jira: OCPBUGS-87818

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

  • Bug Fixes

    • Refined config-hash logic: hashing now uses FNV-128a and is based only on ConfigMap data. During reconciliation the operator compares rendered vs deployed ConfigMap data (by key counts and per-key string equality) and preserves an existing rollout annotation when data is unchanged to avoid unnecessary DaemonSet restarts. Mismatches or missing info cause the hash to be recomputed.
  • Tests

    • Added tests verifying hash preservation when data matches and hash updates when data differs.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b2b857e9-f687-4da4-a74f-9904708324e4

📥 Commits

Reviewing files that changed from the base of the PR and between 399d700 and 4a9bfb8.

📒 Files selected for processing (2)
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/network/ovn_kubernetes_test.go
  • pkg/network/ovn_kubernetes.go

Walkthrough

RenderOVNKubernetes now hashes only rendered ConfigMap .data with FNV-128a, compares that .data to the deployed ovnkube-script-lib ConfigMap, and preserves or updates the ovnkube-node ovnkube-script-lib-hash annotation accordingly.

Changes

OVN Kubernetes ConfigMap hash stabilization

Layer / File(s) Summary
Import & hasher change
pkg/network/ovn_kubernetes.go
Replace SHA1 import with hash/fnv and prepare FNV-128a usage.
Render and hash ConfigMap .data; compare deployed ConfigMap
pkg/network/ovn_kubernetes.go
Render 008-script-lib.yaml ConfigMap(s), json.Marshal the rendered .data, compute FNV-128a hashes, record rendered .data for ovnkube-script-lib, fetch deployed ConfigMap and compare .data; preserve ovnkube-node annotation when data matches, otherwise use newly computed hash. Also switch node and control-plane hashers to FNV-128a.
Tests: helpers and hash-behavior tests
pkg/network/ovn_kubernetes_test.go
Add getOVNKubeConfigHashFromObjs, testOVNBootstrapResult, and two tests validating that the DaemonSet annotation is preserved when deployed .data matches rendered .data and updated when it differs.
sequenceDiagram
  participant Renderer
  participant KubeAPI
  participant DeployedConfigMap as "ConfigMap(ovnkube-script-lib)"
  participant DeployedDaemonSet as "DaemonSet(ovnkube-node)"
  Renderer->>Renderer: render ConfigMap templates -> get rendered .data
  Renderer->>Renderer: json.Marshal(rendered .data) -> compute newHash (FNV-128a)
  Renderer->>KubeAPI: GET ConfigMap ovnkube-script-lib
  KubeAPI->>DeployedConfigMap: return .data
  Renderer->>DeployedConfigMap: compare rendered .data == deployed .data
  alt match
    Renderer->>KubeAPI: GET DaemonSet ovnkube-node
    KubeAPI->>DeployedDaemonSet: return pod-template annotations
    DeployedDaemonSet->>Renderer: provide ovnkube-script-lib-hash -> use as configHash
  else differ
    Renderer->>Renderer: use newHash as configHash
  end
  Renderer->>Renderer: set data.Data["OVNKubeConfigHash"] = configHash
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main objective: preventing spurious ovnkube-node DaemonSet rollout during install by fixing hash computation and comparison logic for the OVNKubeConfigHash.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed ovn_kubernetes_test.go contains no Ginkgo It/Describe/Context/When constructs; newly added tests use static function names and no dynamic values in titles.
Test Structure And Quality ✅ Passed pkg/network/ovn_kubernetes_test.go contains no Ginkgo (no Ginkgo/It/Describe blocks); the new hash tests use testing.T+Gomega, only fake clients, no Eventually/Consistently, and no NotTo(HaveOccurr...
Microshift Test Compatibility ✅ Passed PR adds Go unit tests only (e.g., ovn_kubernetes_test.go uses testing.T + Gomega); no *_e2e_test.go and no Ginkgo/Ginkgo APIs found in changed files, so MicroShift Ginkgo check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR #3022 changes only pkg/network/ovn_kubernetes.go and a unit test (testing.T/Gomega); no new Ginkgo e2e Describe/It blocks, so no SNO multi-node assumptions to flag.
Topology-Aware Scheduling Compatibility ✅ Passed Inspected pkg/network/ovn_kubernetes.go and ovn_kubernetes_test.go: PR changes OVNKube* hash computation/preservation (FNV-128a) only; no new scheduling constraints (affinity/anti-affinity/topology...
Ote Binary Stdout Contract ✅ Passed PR #3022 only changes pkg/network/ovn_kubernetes.go and ovn_kubernetes_test.go; no OTE binary main/init/TestMain/suite setup or stdout writes are introduced there.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR #3022 changes only pkg/network/ovn_kubernetes.go and ovn_kubernetes_test.go; diff contains no Ginkgo e2e constructs (It/Describe/Context/When) or hardcoded IPv4/external connectivity patterns.
No-Weak-Crypto ✅ Passed Reviewed pkg/network/ovn_kubernetes*.go for MD5/SHA1/DES/RC4/3DES/Blowfish/ECB; none found. Hashing uses standard hash/fnv (fnv.New128a) only.
Container-Privileges ✅ Passed In pkg/network/ovn_kubernetes.go and ovn_kubernetes_test.go, no occurrences of privileged/hostPID/hostNetwork/hostIPC/SYS_ADMIN/allowPrivilegeEscalation/securityContext; changes are limited to hash...
No-Sensitive-Data-In-Logs ✅ Passed PR diff adds only one klog line that prints the ovnkube-script-lib-hash value; no password/token/PII/hostname or other sensitive data appears in the changed hunks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17357 characters] ...

red in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from kyrtapz and marty-power June 8, 2026 15:31
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkowalski
Once this PR has been reviewed and has the lgtm label, please assign kyrtapz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 496-523: The loop over cmPaths currently ignores the error
returned by uns.NestedMap and proceeds with an empty map, producing incorrect
rollout hashes; change the code in the block that calls uns.NestedMap (inside
the cmPaths loop after render.RenderTemplate) to capture and check the returned
error, and if non-nil return it (wrapped with context like "failed to read
ConfigMap .data for template %q") instead of treating the data as {}—apply this
to the use site of uns.NestedMap, keeping the existing json.Marshal, h.Write and
renderedCMData logic otherwise so the function fails fast on type/path errors
rather than generating a bogus hash.
- Around line 529-559: The current code swallows any error from the two
client.Default().CRClient().Get calls (for existingCM and existingDS) and treats
transient errors like NotFound, which can cause spurious rollouts; update the
Get error handling in the ovnkube-script-lib/ovnkube-node lookup path so that
you only treat apierrors.IsNotFound(err) as a safe fallback and for any other
non-nil error return the error (or requeue) to abort/retry the reconcile instead
of silently falling back to newHash; specifically modify the Get for existingCM
and the nested Get for existingDS to check apierrors.IsNotFound(err) and
propagate other errors, while preserving the configHash assignment logic when
existingHash is present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d0e41bc1-9f74-4a02-97e9-e7b82f543196

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc1804 and 7a87041.

📒 Files selected for processing (1)
  • pkg/network/ovn_kubernetes.go

Comment thread pkg/network/ovn_kubernetes.go Outdated
Comment thread pkg/network/ovn_kubernetes.go
@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 7a87041 to 96a26f5 Compare June 8, 2026 15:41

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates how the OVN Kubernetes “script lib” ConfigMap is hashed and how that hash is propagated into the ovnkube-node DaemonSet rollout annotation, with the goal of preventing unnecessary DaemonSet rollouts during cluster install when template inputs may temporarily fluctuate.

Changes:

  • Hashes only the rendered ConfigMap .data (instead of the full rendered object including metadata) to avoid hash instability from metadata changes.
  • When the rendered ConfigMap .data matches the deployed ovnkube-script-lib ConfigMap .data, preserves the existing ovnkube-script-lib-hash annotation from the deployed ovnkube-node DaemonSet instead of using a newly computed hash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/network/ovn_kubernetes.go Outdated
Comment thread pkg/network/ovn_kubernetes.go Outdated
Comment thread pkg/network/ovn_kubernetes.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread pkg/network/ovn_kubernetes.go Outdated
Comment thread pkg/network/ovn_kubernetes.go Outdated
Comment thread pkg/network/ovn_kubernetes_test.go Outdated
Comment thread pkg/network/ovn_kubernetes_test.go Outdated
@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from af3f05d to a18c491 Compare June 8, 2026 16:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/network/ovn_kubernetes_test.go`:
- Around line 5015-5017: The test currently ignores the error returned by
uns.NestedStringMap when extracting annotations (the call that assigns ann, _, _
:= uns.NestedStringMap(...)); capture the error (e.g., ann, found, err :=
uns.NestedStringMap(...) or ann, _, err := ...) and assert it is nil using the
test helper in this file (e.g., require.NoError(t, err) or t.Fatalf/require with
a clear message) before using
ann["network.operator.openshift.io/ovnkube-script-lib-hash"]; do the same for
the other occurrence that currently discards the error (the similar
uns.NestedStringMap call around the second path) so malformed rendered objects
cause test failures instead of being masked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 93d50e2f-8dad-4478-acec-bd34e0a1bace

📥 Commits

Reviewing files that changed from the base of the PR and between af3f05d and a18c491.

📒 Files selected for processing (2)
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/network/ovn_kubernetes.go

Comment thread pkg/network/ovn_kubernetes_test.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from a18c491 to 3287522 Compare June 8, 2026 16:59
@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 539-543: The code uses context.TODO() in two CRClient().Get calls
inside renderOVNKubernetes (fetching "ovnkube-script-lib" ConfigMap and
"ovnkube-node" DaemonSet) — change these to use the reconcile context by adding
a context.Context parameter to network.Render → renderDefaultNetwork →
renderOVNKubernetes (propagate the existing ctx from the caller, e.g.,
UpdateOperConfig/reconcile), and replace context.TODO() with that ctx in both
CRClient().Get calls so the reconciler context is properly threaded through.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 58f36345-c760-4dfe-ae92-caa021e22614

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc1804 and 3287522.

📒 Files selected for processing (2)
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_test.go

Comment thread pkg/network/ovn_kubernetes.go
@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 3287522 to 819c53e Compare June 8, 2026 17:17
… install

During install, the CNO reconciles multiple times in quick succession.
The 008-script-lib.yaml ConfigMap template may render with slightly
different content across passes (due to template variables populated
from live cluster state that has not fully settled). This causes the
ovnkube-script-lib-hash annotation on the ovnkube-node DaemonSet pod
template to change, triggering an unnecessary rolling update within
~33 seconds of initial pod creation.

On nodes where the gen-1 pod init container is still pulling the
~1.4 GB OVN image when the rolling update begins, the pod becomes a
zombie: it has a deletionTimestamp but cannot be terminated (the init
container is blocked on the CRI-level image pull). The DaemonSet
controller sees the zombie pod and refuses to create a replacement,
permanently blocking CNI initialization on that node.

Fix this by:
1. Hashing only the .data section of the rendered ConfigMap (instead
   of the full manifest including metadata) to eliminate hash
   instability from metadata differences.
2. Comparing the rendered ConfigMap .data with the deployed ConfigMap
   in the cluster. If they match, preserving the existing DaemonSet
   hash annotation rather than writing the newly computed one.

This ensures the DaemonSet annotation only changes when the ConfigMap
content has actually been updated, preventing spurious rolling updates.

Signed-off-by: Mateusz Kowalski <mko@redhat.com>
Generated-by: OpenClaw
AI-model: claude-opus-4
@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 819c53e to d654286 Compare June 8, 2026 18:04
@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 4f66adb to d654286 Compare June 9, 2026 06:59
@mkowalski

Copy link
Copy Markdown
Contributor Author

/retest-required

@mkowalski mkowalski changed the title Bug 87818: Prevent spurious ovnkube-node DaemonSet rollout during install OCPBUGS-87818: Prevent spurious ovnkube-node DaemonSet rollout during install Jun 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-87818, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Problem

During cluster install, the CNO reconciles multiple times in quick succession (~33s apart). The 008-script-lib.yaml ConfigMap template renders with slightly different content across passes because template variables populated from live cluster state (bootstrapResult) may not have fully settled. This causes the ovnkube-script-lib-hash annotation on the ovnkube-node DaemonSet pod template to change, triggering an unnecessary rolling update.

On nodes where the gen-1 pod init container (kubecfg-setup) is still pulling the ~1.4 GB OVN image when the rolling update begins, the pod becomes a zombie — it has a deletionTimestamp but cannot be terminated because the init container is blocked on the CRI-level image pull. The DaemonSet controller sees the zombie pod on that node and refuses to create a gen-2 replacement (because maxUnavailable: 10% rounds to 0 for small clusters). This permanently blocks CNI initialization on that node, causing cascading install failure.

Root Cause

The OVNKubeConfigHash is computed by rendering 008-script-lib.yaml as a Go template and SHA1-hashing the entire JSON-marshaled manifest (including metadata). Some template variables resolve differently between the first and second CNO reconciliation pass during bootstrap, producing different hashes even though the actual ConfigMap .data content is the same. The different hash updates the DaemonSet pod template annotation → DaemonSet generation bump → rolling update → zombie pod deadlock.

Fix

  1. Hash only the .data section of the rendered ConfigMap instead of the full manifest including metadata — eliminates hash instability from metadata differences.
  2. Compare the rendered ConfigMap .data with the deployed ConfigMap in the cluster. If they match, preserve the existing DaemonSet hash annotation rather than writing the newly computed one.

This ensures the DaemonSet annotation only changes when the ConfigMap content has actually been updated, preventing spurious rolling updates during install.

Impact

  • 6 Component Readiness regressions in 4.22-main (41655, 41656, 41681, 41700, 41701, 41702)
  • 100% reproducible across AWS amd64/arm64, both Default and TechPreview FeatureSets
  • Sippy Triage: 576
  • Jira: OCPBUGS-87818

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

  • Bug Fixes

  • Improved configuration-hash handling to compare rendered config data with the deployed config and preserve an existing rollout annotation when data is unchanged, preventing unnecessary DaemonSet restarts and improving cluster stability.

  • Tests

  • Added tests verifying hash preservation when configuration data is identical and hash recomputation when data differs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from d654286 to 613492f Compare June 11, 2026 15:53
@mkowalski

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-87818, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 613492f to 97eca53 Compare June 11, 2026 15:58
@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai summary

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Summary regeneration triggered.

@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai resume

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Reviews resumed.

@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 97eca53 to 399d700 Compare June 11, 2026 17:24
@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai, The existing test file uses testing.T + Gomega (no Ginkgo), and my new tests follow the exact same pattern. The check is flagging that the tests use testing.T style instead of Ginkgo Describe/Context/It — but that's the established convention in this file. The check is a false positive for this file. The entire ovn_kubernetes_test.go file (4500+ existing lines, 40+ test functions) uses testing.T + Gomega assertions — there are zero Ginkgo Describe/Context/It blocks anywhere in it. My tests follow the exact same pattern as all existing tests in the file.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the test file for testing patterns - look for Ginkgo vs testing.T usage
rg -c 'func Test' pkg/network/ovn_kubernetes_test.go
rg -c 'Describe\|Context\|It(' pkg/network/ovn_kubernetes_test.go || echo "No Ginkgo blocks found"
rg -n 'func Test' pkg/network/ovn_kubernetes_test.go | head -20
wc -l pkg/network/ovn_kubernetes_test.go

Length of output: 1902



@mkowalski You're absolutely right. The shell script confirms it: pkg/network/ovn_kubernetes_test.go has 39 existing func Test* functions all using testing.T + Gomega, and there are zero Ginkgo Describe/Context/It blocks anywhere in the 5179-line file. Your new tests follow the established convention exactly. I'll withdraw that comment.


✏️ Learnings added
Learnt from: mkowalski
URL: https://github.com/openshift/cluster-network-operator/pull/3022

Timestamp: 2026-06-11T17:28:27.558Z
Learning: In the `openshift/cluster-network-operator` repository, `pkg/network/ovn_kubernetes_test.go` exclusively uses the standard `testing.T` + Gomega assertions pattern (e.g., `RegisterTestingT`, `Expect`). There are no Ginkgo `Describe`/`Context`/`It` blocks in this file. New tests added to this file should follow the same `testing.T + Gomega` convention, not Ginkgo. Flagging tests in this file for not using Ginkgo is a false positive.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@mkowalski

Copy link
Copy Markdown
Contributor Author

https://github.com/coderabbitai full review

@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@mkowalski mkowalski force-pushed the fix/ocpbugs-87818-master branch from 399d700 to 4a9bfb8 Compare June 11, 2026 19:23
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp 4a9bfb8 link true /test e2e-metal-ipi-ovn-dualstack-bgp
ci/prow/e2e-gcp-ovn 4a9bfb8 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-upgrade-ipsec 4a9bfb8 link true /test e2e-aws-ovn-upgrade-ipsec
ci/prow/e2e-ovn-ipsec-step-registry 4a9bfb8 link true /test e2e-ovn-ipsec-step-registry
ci/prow/e2e-aws-ovn-hypershift-conformance 4a9bfb8 link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/5.0-upgrade-from-stable-4.22-e2e-aws-ovn-upgrade 4a9bfb8 link false /test 5.0-upgrade-from-stable-4.22-e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-serial-1of2 4a9bfb8 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-upgrade 4a9bfb8 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-rhcos10-techpreview 4a9bfb8 link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/e2e-azure-ovn-upgrade 4a9bfb8 link true /test e2e-azure-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 4a9bfb8 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-upgrade 4a9bfb8 link true /test e2e-gcp-ovn-upgrade
ci/prow/5.0-upgrade-from-stable-4.22-e2e-azure-ovn-upgrade 4a9bfb8 link false /test 5.0-upgrade-from-stable-4.22-e2e-azure-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec 4a9bfb8 link true /test e2e-metal-ipi-ovn-ipv6-ipsec
ci/prow/e2e-aws-ovn-serial-2of2 4a9bfb8 link true /test e2e-aws-ovn-serial-2of2
ci/prow/security 4a9bfb8 link false /test security
ci/prow/5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-upgrade 4a9bfb8 link false /test 5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw 4a9bfb8 link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants