OCPBUGS-87818: Prevent spurious ovnkube-node DaemonSet rollout during install#3022
OCPBUGS-87818: Prevent spurious ovnkube-node DaemonSet rollout during install#3022mkowalski wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughRenderOVNKubernetes now hashes only rendered ConfigMap ChangesOVN Kubernetes ConfigMap hash stabilization
sequenceDiagram
participant Renderer
participant KubeAPI
participant DeployedConfigMap as "ConfigMap(ovnkube-script-lib)"
participant DeployedDaemonSet as "DaemonSet(ovnkube-node)"
Renderer->>Renderer: render ConfigMap templates -> get rendered .data
Renderer->>Renderer: json.Marshal(rendered .data) -> compute newHash (FNV-128a)
Renderer->>KubeAPI: GET ConfigMap ovnkube-script-lib
KubeAPI->>DeployedConfigMap: return .data
Renderer->>DeployedConfigMap: compare rendered .data == deployed .data
alt match
Renderer->>KubeAPI: GET DaemonSet ovnkube-node
KubeAPI->>DeployedDaemonSet: return pod-template annotations
DeployedDaemonSet->>Renderer: provide ovnkube-script-lib-hash -> use as configHash
else differ
Renderer->>Renderer: use newHash as configHash
end
Renderer->>Renderer: set data.Data["OVNKubeConfigHash"] = configHash
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope ... [truncated 17357 characters] ... red in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 496-523: The loop over cmPaths currently ignores the error
returned by uns.NestedMap and proceeds with an empty map, producing incorrect
rollout hashes; change the code in the block that calls uns.NestedMap (inside
the cmPaths loop after render.RenderTemplate) to capture and check the returned
error, and if non-nil return it (wrapped with context like "failed to read
ConfigMap .data for template %q") instead of treating the data as {}—apply this
to the use site of uns.NestedMap, keeping the existing json.Marshal, h.Write and
renderedCMData logic otherwise so the function fails fast on type/path errors
rather than generating a bogus hash.
- Around line 529-559: The current code swallows any error from the two
client.Default().CRClient().Get calls (for existingCM and existingDS) and treats
transient errors like NotFound, which can cause spurious rollouts; update the
Get error handling in the ovnkube-script-lib/ovnkube-node lookup path so that
you only treat apierrors.IsNotFound(err) as a safe fallback and for any other
non-nil error return the error (or requeue) to abort/retry the reconcile instead
of silently falling back to newHash; specifically modify the Get for existingCM
and the nested Get for existingDS to check apierrors.IsNotFound(err) and
propagate other errors, while preserving the configHash assignment logic when
existingHash is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d0e41bc1-9f74-4a02-97e9-e7b82f543196
📒 Files selected for processing (1)
pkg/network/ovn_kubernetes.go
7a87041 to
96a26f5
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates how the OVN Kubernetes “script lib” ConfigMap is hashed and how that hash is propagated into the ovnkube-node DaemonSet rollout annotation, with the goal of preventing unnecessary DaemonSet rollouts during cluster install when template inputs may temporarily fluctuate.
Changes:
- Hashes only the rendered ConfigMap
.data(instead of the full rendered object including metadata) to avoid hash instability from metadata changes. - When the rendered ConfigMap
.datamatches the deployedovnkube-script-libConfigMap.data, preserves the existingovnkube-script-lib-hashannotation from the deployedovnkube-nodeDaemonSet instead of using a newly computed hash.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
96a26f5 to
af3f05d
Compare
af3f05d to
a18c491
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/network/ovn_kubernetes_test.go`:
- Around line 5015-5017: The test currently ignores the error returned by
uns.NestedStringMap when extracting annotations (the call that assigns ann, _, _
:= uns.NestedStringMap(...)); capture the error (e.g., ann, found, err :=
uns.NestedStringMap(...) or ann, _, err := ...) and assert it is nil using the
test helper in this file (e.g., require.NoError(t, err) or t.Fatalf/require with
a clear message) before using
ann["network.operator.openshift.io/ovnkube-script-lib-hash"]; do the same for
the other occurrence that currently discards the error (the similar
uns.NestedStringMap call around the second path) so malformed rendered objects
cause test failures instead of being masked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 93d50e2f-8dad-4478-acec-bd34e0a1bace
📒 Files selected for processing (2)
pkg/network/ovn_kubernetes.gopkg/network/ovn_kubernetes_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/network/ovn_kubernetes.go
a18c491 to
3287522
Compare
|
@coderabbitai full review |
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 539-543: The code uses context.TODO() in two CRClient().Get calls
inside renderOVNKubernetes (fetching "ovnkube-script-lib" ConfigMap and
"ovnkube-node" DaemonSet) — change these to use the reconcile context by adding
a context.Context parameter to network.Render → renderDefaultNetwork →
renderOVNKubernetes (propagate the existing ctx from the caller, e.g.,
UpdateOperConfig/reconcile), and replace context.TODO() with that ctx in both
CRClient().Get calls so the reconciler context is properly threaded through.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 58f36345-c760-4dfe-ae92-caa021e22614
📒 Files selected for processing (2)
pkg/network/ovn_kubernetes.gopkg/network/ovn_kubernetes_test.go
3287522 to
819c53e
Compare
… install During install, the CNO reconciles multiple times in quick succession. The 008-script-lib.yaml ConfigMap template may render with slightly different content across passes (due to template variables populated from live cluster state that has not fully settled). This causes the ovnkube-script-lib-hash annotation on the ovnkube-node DaemonSet pod template to change, triggering an unnecessary rolling update within ~33 seconds of initial pod creation. On nodes where the gen-1 pod init container is still pulling the ~1.4 GB OVN image when the rolling update begins, the pod becomes a zombie: it has a deletionTimestamp but cannot be terminated (the init container is blocked on the CRI-level image pull). The DaemonSet controller sees the zombie pod and refuses to create a replacement, permanently blocking CNI initialization on that node. Fix this by: 1. Hashing only the .data section of the rendered ConfigMap (instead of the full manifest including metadata) to eliminate hash instability from metadata differences. 2. Comparing the rendered ConfigMap .data with the deployed ConfigMap in the cluster. If they match, preserving the existing DaemonSet hash annotation rather than writing the newly computed one. This ensures the DaemonSet annotation only changes when the ConfigMap content has actually been updated, preventing spurious rolling updates. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: OpenClaw AI-model: claude-opus-4
819c53e to
d654286
Compare
|
@coderabbitai full review |
4f66adb to
d654286
Compare
|
/retest-required |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-87818, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
d654286 to
613492f
Compare
|
/jira refresh |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-87818, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
613492f to
97eca53
Compare
|
@coderabbitai summary |
✅ Action performedSummary regeneration triggered. |
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
@coderabbitai resume |
✅ Action performedReviews resumed. |
97eca53 to
399d700
Compare
|
@coderabbitai, The existing test file uses testing.T + Gomega (no Ginkgo), and my new tests follow the exact same pattern. The check is flagging that the tests use testing.T style instead of Ginkgo Describe/Context/It — but that's the established convention in this file. The check is a false positive for this file. The entire ovn_kubernetes_test.go file (4500+ existing lines, 40+ test functions) uses testing.T + Gomega assertions — there are zero Ginkgo Describe/Context/It blocks anywhere in it. My tests follow the exact same pattern as all existing tests in the file. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Check the test file for testing patterns - look for Ginkgo vs testing.T usage
rg -c 'func Test' pkg/network/ovn_kubernetes_test.go
rg -c 'Describe\|Context\|It(' pkg/network/ovn_kubernetes_test.go || echo "No Ginkgo blocks found"
rg -n 'func Test' pkg/network/ovn_kubernetes_test.go | head -20
wc -l pkg/network/ovn_kubernetes_test.goLength of output: 1902
✏️ Learnings added
|
|
https://github.com/coderabbitai full review |
|
@coderabbitai full review |
✅ Action performedFull review finished. |
399d700 to
4a9bfb8
Compare
|
@mkowalski: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Problem
During cluster install, the CNO reconciles multiple times in quick succession (~33s apart). The
008-script-lib.yamlConfigMap template renders with slightly different content across passes because template variables populated from live cluster state (bootstrapResult) may not have fully settled. This causes theovnkube-script-lib-hashannotation on theovnkube-nodeDaemonSet pod template to change, triggering an unnecessary rolling update.On nodes where the gen-1 pod init container (
kubecfg-setup) is still pulling the ~1.4 GB OVN image when the rolling update begins, the pod becomes a zombie — it has adeletionTimestampbut cannot be terminated because the init container is blocked on the CRI-level image pull. The DaemonSet controller sees the zombie pod on that node and refuses to create a gen-2 replacement (becausemaxUnavailable: 10%rounds to 0 for small clusters). This permanently blocks CNI initialization on that node, causing cascading install failure.Root Cause
The
OVNKubeConfigHashis computed by rendering008-script-lib.yamlas a Go template and SHA1-hashing the entire JSON-marshaled manifest (including metadata). Some template variables resolve differently between the first and second CNO reconciliation pass during bootstrap, producing different hashes even though the actual ConfigMap.datacontent is the same. The different hash updates the DaemonSet pod template annotation → DaemonSet generation bump → rolling update → zombie pod deadlock.Fix
.datasection of the rendered ConfigMap instead of the full manifest including metadata — eliminates hash instability from metadata differences..datawith the deployed ConfigMap in the cluster. If they match, preserve the existing DaemonSet hash annotation rather than writing the newly computed one.This ensures the DaemonSet annotation only changes when the ConfigMap content has actually been updated, preventing spurious rolling updates during install.
Impact
🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.
Summary by CodeRabbit
Bug Fixes
Tests