Skip to content

OCPBUGS-85677: avoid Progressing during node reboot pod recreation#3018

Open
jluhrsen wants to merge 1 commit into
openshift:masterfrom
jluhrsen:OCPBUGS-85677
Open

OCPBUGS-85677: avoid Progressing during node reboot pod recreation#3018
jluhrsen wants to merge 1 commit into
openshift:masterfrom
jluhrsen:OCPBUGS-85677

Conversation

@jluhrsen

@jluhrsen jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects CNO-initiated rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Summary by CodeRabbit

  • Bug Fixes
    • More accurately detects active DaemonSet rollouts to avoid false positives during deployment monitoring — rollout is now considered active only when installation is incomplete or genuine progress on a new revision is observed, reducing incorrect "in-progress" status reports.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-85677, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects CNO-initiated rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e52116d3-d7f9-4866-984c-a8628c739e46

📥 Commits

Reviewing files that changed from the base of the PR and between d4b833b and 194f573.

📒 Files selected for processing (1)
  • pkg/controller/statusmanager/pod_status.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controller/statusmanager/pod_status.go

Walkthrough

The SetFromPods function's DaemonSet rollout active detection adds a guard condition: rollout activity now requires Generation > ObservedGeneration alongside count-based progress stall indicators, preventing false active status when the controller has not yet observed the latest generation.

Changes

DaemonSet Rollout Status Detection

Layer / File(s) Summary
DaemonSet rollout active detection guard
pkg/controller/statusmanager/pod_status.go
The condition determining when a DaemonSet rollout is active now additionally requires that the DaemonSet's Generation exceeds its ObservedGeneration alongside the existing count-based progress stall check.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the bug fix (OCPBUGS-85677) and accurately describes the main change: preventing false Progressing status during node reboot pod recreation by adding a Generation check.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only production code (pod_status.go), not test files. Repository contains no Ginkgo tests, only standard Go tests. Check is not applicable.
Test Structure And Quality ✅ Passed This PR modifies production code (pod_status.go), not test code. The project uses standard Go testing (testing.T), not Ginkgo, so the Ginkgo test quality check is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. The change is a production code fix to pod_status.go logic only.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR contains only controller logic changes to pod_status.go. It adds no new Ginkgo e2e tests (It(), Describe(), etc.). The custom check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Change modifies status detection logic only; adds a generation guard to DaemonSet rollout detection without introducing any scheduling constraints or topology assumptions.
Ote Binary Stdout Contract ✅ Passed PR modifies pod_status.go with pure logic changes (adding Generation > ObservedGeneration condition). No stdout writes, logging initialization, or process-level code changes are introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add any new Ginkgo e2e tests. It only modifies pod_status.go controller logic to add a generation check for DaemonSet rollout detection, making the check inapplicable.
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in the modified pod_status.go file.
Container-Privileges ✅ Passed PR only modifies Go logic in pod_status.go with no changes to Kubernetes manifests, container configurations, or privilege-related settings.
No-Sensitive-Data-In-Logs ✅ Passed PR modifies only conditional logic in pod_status.go with no new logging. Existing logging contains Kubernetes metadata only, no sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17357 characters] ...

red in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from martinkennelly and miheer June 1, 2026 20:50
@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jluhrsen
Once this PR has been reviewed and has the lgtm label, please assign kyrtapz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jluhrsen

jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9c86cb10-5dfb-11f1-81a6-9b75c8f028d7-0

@jluhrsen

jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-85677, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controller/statusmanager/pod_status.go (1)

97-105: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard is not applied to the main DaemonSet update-progress condition.

The new Generation > ObservedGeneration check is only folded into dsRolloutActive, but Line 104 still marks progressing on UpdatedNumberScheduled < CurrentNumberScheduled unconditionally. That preserves the node-reboot false-positive path described in this PR.

Suggested fix
-		} else if ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled {
+		} else if ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled && ds.Generation > ds.Status.ObservedGeneration {
 			progressing = append(progressing, fmt.Sprintf("DaemonSet %q update is rolling out (%d out of %d updated)", dsName.String(), ds.Status.UpdatedNumberScheduled, ds.Status.CurrentNumberScheduled))
 			dsProgressing = true
 		} else if ds.Status.NumberUnavailable > 0 {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/statusmanager/pod_status.go` around lines 97 - 105, The
progressing branch incorrectly uses ds.Status.UpdatedNumberScheduled <
ds.Status.CurrentNumberScheduled without respecting the rollout guard; instead
use the previously computed boolean dsRolloutActive (which already includes
generation > ObservedGeneration) when deciding whether to append the
rolling-update message and set dsProgressing. Update the else-if condition that
currently checks UpdatedNumberScheduled < CurrentNumberScheduled to require
dsRolloutActive (e.g., dsRolloutActive && ds.Status.UpdatedNumberScheduled <
ds.Status.CurrentNumberScheduled) so the generation/observedGeneration guard
prevents false-positive node-reboot progress for DaemonSet updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 97-105: The progressing branch incorrectly uses
ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled without
respecting the rollout guard; instead use the previously computed boolean
dsRolloutActive (which already includes generation > ObservedGeneration) when
deciding whether to append the rolling-update message and set dsProgressing.
Update the else-if condition that currently checks UpdatedNumberScheduled <
CurrentNumberScheduled to require dsRolloutActive (e.g., dsRolloutActive &&
ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled) so the
generation/observedGeneration guard prevents false-positive node-reboot progress
for DaemonSet updates.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b3ae22f7-9d4a-4c06-a369-336eec3e1a73

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4c17a and d4b833b.

📒 Files selected for processing (1)
  • pkg/controller/statusmanager/pod_status.go

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly
detects CNO-initiated rollouts but also triggers during node reboots when pods
are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the
DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Signed-off-by Jamo Luhrsen <jluhrsen@gmail.com>
Co-authored-by Claude Code <anthropic@noreply.com>
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-85677, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects CNO-initiated rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Summary by CodeRabbit

  • Bug Fixes
  • More accurately detects active DaemonSet rollouts to avoid false positives during deployment monitoring — rollout is now considered active only when installation is incomplete or genuine progress on a new revision is observed, reducing incorrect "in-progress" status reports.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jluhrsen

jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e577e9f0-5e03-11f1-9d4c-ab89f3f0ae9e-0

@jluhrsen

jluhrsen commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
/test 4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-azure-ovn-upgrade
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-ovn-ipsec-step-registry

@openshift-ci

openshift-ci Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 194f573 link false /test security
ci/prow/unit 194f573 link true /test unit
ci/prow/e2e-azure-ovn-upgrade 194f573 link true /test e2e-azure-ovn-upgrade
ci/prow/4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 194f573 link false /test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants