Skip to content

OCPBUGS-86306: Stop controllers fighting over HCP status#8562

Merged
openshift-merge-bot[bot] merged 3 commits into
openshift:mainfrom
jkyros:karpenter-fix-hcp-status-race-oplocks
Jun 2, 2026
Merged

OCPBUGS-86306: Stop controllers fighting over HCP status#8562
openshift-merge-bot[bot] merged 3 commits into
openshift:mainfrom
jkyros:karpenter-fix-hcp-status-race-oplocks

Conversation

@jkyros

@jkyros jkyros commented May 20, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

  • We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object
  • Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265
  • This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update
  • Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

The contentious bit here is that:

  • I'm building an RFC 6902 JSON Patch for the HCP status controller's Update() replacement because
  • we have fields in the status (AvailableUpdates, CompletionTime) that are +required and +nullable so
  • our patches get rejected with other patch types (And SSA - the generated applyConfigurations apply "omitempty" to everything so they drop them too. sigh)

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Refactor

    • Switched multiple controllers to optimistic-locking status patches for status writes, improving concurrency handling and updating success/error log messages.
  • Tests

    • Added tests validating JSON Patch generation and application: correct ops for adds/replaces/removes, optimistic-lock test, preservation of unrelated status fields and nil values, and condition array handling.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci

openshift-ci Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 20, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 20, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jkyros: This pull request references Jira Issue OCPBUGS-86306, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

  • We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object
  • Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265
  • This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update
  • Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request replaces direct HostedControlPlane status updates with optimistic-locking status patches across multiple controllers. The hcpstatus controller now generates RFC‑6902 JSON Patch ops (including a resourceVersion test) and issues Status().Patch. Other controllers (hostedcontrolplane, resources, etcdbackup, ignition-server) deep-copy original HCP objects and persist condition changes via Status().Patch with MergeFromWithOptimisticLock. New unit tests validate patch generation and backup-condition patch behavior.

Suggested reviewers

  • enxebre
  • csrwng
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Most assertions lack meaningful failure messages. TestBuildStatusPatch: 47/52 assertions (~90%) without messages. TestUpdateHCPBackupCondition: 10/12 assertions (~83%) without messages. Add failure messages to assertions that lack them (e.g., Expect(err).ToNot(HaveOccurred(), "failed to update condition") instead of without message).
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses the main change: replacing Status().Update() with optimistic-locking Patch calls to prevent controllers from overwriting each other's HCP status fields.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in PR are stable and deterministic: no pod names, timestamps, UUIDs, or dynamic information; uses standard Go testing (not Ginkgo); all subtests use descriptive static strings.
Topology-Aware Scheduling Compatibility ✅ Passed PR contains no scheduling constraints (affinity, nodeSelector, topology spread, PDB) affecting SNO, Two-Node, HyperShift topologies. Changes are internal controller status-writing refactoring only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; only standard Go unit tests with mocked Kubernetes clients. Tests contain no IPv4 assumptions or external connectivity requirements.
No-Weak-Crypto ✅ Passed PR contains no weak cryptography usage, custom crypto implementations, or non-constant-time secret comparisons. Changes are purely Kubernetes controller status patching updates.
Container-Privileges ✅ Passed PR contains only Go controller source code refactoring; no container privilege configurations, Kubernetes manifests, or Dockerfiles are modified.
No-Sensitive-Data-In-Logs ✅ Passed No logging statements in this PR expose sensitive data. All new logs are generic operational messages or Kubernetes condition metadata without secrets, tokens, credentials, PII, or customer data.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 20, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jkyros: This pull request references Jira Issue OCPBUGS-86306, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

What this PR does / why we need it:

  • We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object
  • Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265
  • This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update
  • Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Refactor

  • Improved status condition handling across hosted cluster controllers for more reliable concurrent updates.

  • Tests

  • Added test coverage for status condition patching behavior to ensure accurate field updates without unintended side effects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ignition-server/controllers/local_ignitionprovider.go (1)

768-788: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap reconcileValidReleaseInfoCondition in retry logic for optimistic-lock conflicts.

This helper now uses Status().Patch(..., client.MergeFromWithOptimisticLock{}), which returns 409 conflicts during concurrent HCP status writes. However, the caller at lines 451-452 only logs the error and continues, leaving IgnitionServerValidReleaseInfo stale indefinitely—there's no guaranteed near-term retry.

Wrap the helper logic in retry.RetryOnConflict(retry.DefaultBackoff, func() { ... }) and re-fetch the HostedControlPlane inside the retry block. This pattern is already used in support/k8sutil/object.go and aligns with Kubernetes best practices.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 768 -
788, Wrap the existing status-patch logic in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) so
optimistic-lock 409s are retried: inside the retry block re-fetch the latest
HostedControlPlane using r.Client.Get into hostedControlPlane, create
originalHCP := hostedControlPlane.DeepCopy(), compute/set the
IgnitionServerValidReleaseInfo condition (use
releaseImageProvider.GetMissingImages() and meta.SetStatusCondition as before),
then call r.Client.Status().Patch(ctx, &hostedControlPlane,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
and return the patch error; let RetryOnConflict surface a final error if retries
fail.
🧹 Nitpick comments (3)
control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go (2)

19-160: ⚡ Quick win

Add parallelization to align with unit-test execution guidance.

This test suite does not use t.Parallel(), so it misses the repository’s unit-test parallelism requirement.

Suggested change
 func TestMergeFromPatchDoesNotStompUnrelatedFields(t *testing.T) {
+	t.Parallel()
 	t.Run("When hcpstatus controller patches conditions and versionStatus, it should not include autoNode in the patch", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)
@@
 	t.Run("When hcpstatus controller makes no changes, it should produce an empty patch", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)
@@
 	t.Run("When a condition-only controller patches, the conditions array should contain ALL conditions from the read", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)

As per coding guidelines, "Use race detection and parallel execution for unit tests".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`
around lines 19 - 160, The test suite
TestMergeFromPatchDoesNotStompUnrelatedFields is missing parallelization; add
t.Parallel() at the start of TestMergeFromPatchDoesNotStompUnrelatedFields and
inside each subtest closure (the three t.Run anonymous funcs) so each subtest
and the parent test run with testing parallelism and race detection enabled;
look for the TestMergeFromPatchDoesNotStompUnrelatedFields function and the
three t.Run blocks to insert the t.Parallel() calls.

151-152: ⚡ Quick win

Avoid panic-prone type assertions in tests.

These direct assertions can panic and hide the real failure context. Prefer checked assertions so failures stay diagnostic.

Suggested change
-		status := patchMap["status"].(map[string]interface{})
-		conditions := status["conditions"].([]interface{})
+		status, ok := patchMap["status"].(map[string]interface{})
+		g.Expect(ok).To(BeTrue(), "patch should contain a status object")
+		conditions, ok := status["conditions"].([]interface{})
+		g.Expect(ok).To(BeTrue(), "status.conditions should be a JSON array")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`
around lines 151 - 152, Replace the unchecked type assertions on
patchMap["status"] and status["conditions"] with guarded checks so the test
fails with a clear message instead of panicking: retrieve the value using the
comma-ok form (e.g., statusIface, ok := patchMap["status"]; if !ok {
t.Fatalf(...) } then status, ok := statusIface.(map[string]interface{}); if !ok
{ t.Fatalf(...) }) and likewise check conditionsIface, ok :=
status["conditions"]; if !ok { t.Fatalf(...) } then conditions, ok :=
conditionsIface.([]interface{}); if !ok { t.Fatalf(...) }); alternatively use
your test helper (e.g., require.IsType/require.NotNil) around the same symbols
(patchMap, status, conditions) to make assertions explicit and diagnostic.
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)

570-576: ⚡ Quick win

Skip no-op recovery status patches.

reconcileClusterRecovery always patches after meta.SetStatusCondition, even when the condition is unchanged. Because this path polls every 120 seconds, steady-state recovery checks become repeated write attempts and avoidable optimistic-lock conflicts on the same HCP status object. Guard on the boolean returned by meta.SetStatusCondition to skip writing unchanged conditions.

♻️ Suggested change
 	originalHCP := hcp.DeepCopy()
-	meta.SetStatusCondition(&hcp.Status.Conditions, *condition)
+	if !meta.SetStatusCondition(&hcp.Status.Conditions, *condition) {
+		if !finished {
+			return ctrl.Result{RequeueAfter: 120 * time.Second}, nil
+		}
+		return ctrl.Result{}, nil
+	}
 	log.Info("setting condition", "type", condition.Type, "status", condition.Status, "message", condition.Message)
 	if err := r.cpClient.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})); err != nil {
 		return ctrl.Result{}, fmt.Errorf("failed to patch status on hcp for hosted cluster recovery: %w. Condition error message: %v", err, condition.Message)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 570 - 576, reconcileClusterRecovery is always patching HCP status
even when meta.SetStatusCondition didn't change the condition; capture the
boolean return from meta.SetStatusCondition and only call
r.cpClient.Status().Patch (using originalHCP and the same
MergeFromWithOptions/OptimisticLock) if that boolean is true to skip no-op
writes and avoid redundant optimistic-lock updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go`:
- Around line 79-80: The patch call on h.mgtClusterClient.Status().Patch (using
crclient.MergeFromWithOptions and MergeFromWithOptimisticLock) must treat
optimistic-lock conflicts differently: after the Patch returns an error, detect
conflicts with apierrors.IsConflict(err) and, if true, return reconcile.Result{}
with a nil error so the controller runtime can requeue with backoff; for
non-conflict errors continue returning the wrapped fmt.Errorf("failed to patch
hcp status: %w", err). Locate the error handling directly after the Patch
invocation in the hcp status reconciliation code and insert the
apierrors.IsConflict check there.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 302-305: The patch that sets meta.SetStatusCondition and calls
r.Status().Patch(hcp, client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{})) should be wrapped in
retry.RetryOnConflict so optimistic-lock 409s are retried: on each retry re-get
the latest HCP (fresh read into a new variable), reapply
condition.ObservedGeneration = hcp.Generation and
meta.SetStatusCondition(&hcp.Status.Conditions, condition) against the fresh
HCP, then call r.Status().Patch with client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{}) (or use a new merge-from base as
appropriate) until success or non-conflict error; ensure errors are returned if
retries exhaust.

---

Outside diff comments:
In `@ignition-server/controllers/local_ignitionprovider.go`:
- Around line 768-788: Wrap the existing status-patch logic in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) so
optimistic-lock 409s are retried: inside the retry block re-fetch the latest
HostedControlPlane using r.Client.Get into hostedControlPlane, create
originalHCP := hostedControlPlane.DeepCopy(), compute/set the
IgnitionServerValidReleaseInfo condition (use
releaseImageProvider.GetMissingImages() and meta.SetStatusCondition as before),
then call r.Client.Status().Patch(ctx, &hostedControlPlane,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
and return the patch error; let RetryOnConflict surface a final error if retries
fail.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`:
- Around line 19-160: The test suite
TestMergeFromPatchDoesNotStompUnrelatedFields is missing parallelization; add
t.Parallel() at the start of TestMergeFromPatchDoesNotStompUnrelatedFields and
inside each subtest closure (the three t.Run anonymous funcs) so each subtest
and the parent test run with testing parallelism and race detection enabled;
look for the TestMergeFromPatchDoesNotStompUnrelatedFields function and the
three t.Run blocks to insert the t.Parallel() calls.
- Around line 151-152: Replace the unchecked type assertions on
patchMap["status"] and status["conditions"] with guarded checks so the test
fails with a clear message instead of panicking: retrieve the value using the
comma-ok form (e.g., statusIface, ok := patchMap["status"]; if !ok {
t.Fatalf(...) } then status, ok := statusIface.(map[string]interface{}); if !ok
{ t.Fatalf(...) }) and likewise check conditionsIface, ok :=
status["conditions"]; if !ok { t.Fatalf(...) } then conditions, ok :=
conditionsIface.([]interface{}); if !ok { t.Fatalf(...) }); alternatively use
your test helper (e.g., require.IsType/require.NotNil) around the same symbols
(patchMap, status, conditions) to make assertions explicit and diagnostic.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 570-576: reconcileClusterRecovery is always patching HCP status
even when meta.SetStatusCondition didn't change the condition; capture the
boolean return from meta.SetStatusCondition and only call
r.cpClient.Status().Patch (using originalHCP and the same
MergeFromWithOptions/OptimisticLock) if that boolean is true to skip no-op
writes and avoid redundant optimistic-lock updates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d0b2014e-1faa-4d10-a400-26d60e220b42

📥 Commits

Reviewing files that changed from the base of the PR and between c23dbb8 and 58ac9ee.

📒 Files selected for processing (6)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • ignition-server/controllers/local_ignitionprovider.go

Comment on lines +79 to +80
if err := h.mgtClusterClient.Status().Patch(ctx, hcp, crclient.MergeFromWithOptions(originalHCP, crclient.MergeFromWithOptimisticLock{})); err != nil {
return reconcile.Result{}, fmt.Errorf("failed to patch hcp status: %w", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file at the specified lines
cat -n control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go | sed -n '70,90p'

Repository: openshift/hypershift

Length of output: 1230


🏁 Script executed:

# Let's also see more context around the patch operation
cat -n control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go | sed -n '60,100p'

Repository: openshift/hypershift

Length of output: 2002


🏁 Script executed:

# Check if there are other patch operations in the file to understand the pattern
rg -n "Status\(\)\.Patch|MergeFromWithOptimisticLock" control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go -A 2 -B 2

Repository: openshift/hypershift

Length of output: 450


🏁 Script executed:

# Search for how conflicts are handled elsewhere in the codebase
rg -n "409|Conflict|optimistic" control-plane-operator/ --type go -i | head -20

Repository: openshift/hypershift

Length of output: 4005


🏁 Script executed:

# Check the controller-runtime and kubernetes patterns for conflict handling
rg -n "MergeFromWithOptimisticLock" . --type go | head -10

Repository: openshift/hypershift

Length of output: 1802


🏁 Script executed:

# Check the vendored controller-runtime to understand conflict handling patterns
cat vendor/sigs.k8s.io/controller-runtime/pkg/client/patch.go | sed -n '50,80p'

Repository: openshift/hypershift

Length of output: 1445


🏁 Script executed:

# Search for error handling patterns around optimistic locks
rg -n "apiutil|IsConflict|StatusError" control-plane-operator/ --type go | head -15

Repository: openshift/hypershift

Length of output: 1952


🏁 Script executed:

# Check if there are any examples of conflict detection in the codebase
rg -n "apierrors.IsConflict|StatusError" . --type go | head -20

Repository: openshift/hypershift

Length of output: 3023


🌐 Web query:

Kubernetes controller-runtime conflict 409 error handling optimistic locking best practices

💡 Result:

Handling 409 Conflict errors in Kubernetes is an essential aspect of managing optimistic concurrency [1][2]. When multiple clients (e.g., controllers, users, HPA) modify the same resource, the Kubernetes API server uses the resourceVersion field to ensure updates are based on the latest state [1][3]. If a conflict occurs, the API server rejects the update with a 409 Conflict status [3][4]. Best practices for handling these conflicts include: 1. Let the Controller Reconcile (Recommended for most cases) In standard controller-runtime development, the most idiomatic approach is to return the error from your Reconcile function [5][3]. This allows the controller-runtime's built-in work queue to catch the error, log it, and automatically re-enqueue the request with exponential backoff [6][7][8]. This avoids manual retry loops inside your reconciliation logic and keeps the controller simple and idempotent [9][10]. 2. Use RetryOnConflict (For specific update operations) If you are performing a complex update operation within a reconciliation loop that must succeed before continuing, you can use the k8s.io/client-go/util/retry helper [5][11][12]. This helper manages the re-fetching of the object to get the latest resourceVersion before attempting the update again [1][12][13]. Example: err:= retry.RetryOnConflict(retry.DefaultRetry, func error { // 1. Re-fetch the latest version of the object if err:= r.Get(ctx, req.NamespacedName, latestObj); err!= nil { return err } // 2. Apply modifications to the fresh object latestObj.Status.Phase = "Running" // 3. Attempt update return r.Status.Update(ctx, latestObj) }) 3. Consider Server-Side Apply For complex objects with many fields modified by multiple controllers, Server-Side Apply (SSA) is the preferred alternative to traditional Update calls [5]. SSA allows for field-level ownership, significantly reducing the frequency of 409 conflicts compared to full object updates [5][12]. Key Principles: - Avoid manual, infinite retry loops. - Do not manually manipulate or suppress the resourceVersion field to bypass concurrency checks [2]. - Separate spec updates from status updates; status subresources help minimize conflicts [2]. - Leverage the framework: Return errors to the reconciler to benefit from built-in rate-limiting and backoff [6][8].

Citations:


🏁 Script executed:

# Check for controller-runtime configuration or custom backoff settings
rg -n "DefaultRetry|ExponentialBackoff|SetupWithManager" control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go -A 3

Repository: openshift/hypershift

Length of output: 46


🏁 Script executed:

# Check if there's any custom error handling wrapper in the codebase
rg -n "apierrors.IsConflict" . --type go -B 3 -A 3 | head -40

Repository: openshift/hypershift

Length of output: 3994


Handle optimistic-lock conflicts distinctly from other patch failures.

The current code treats 409 conflicts from optimistic locking the same as other patch errors, which can create misleading error metrics and logs. Conflicts are expected under contention and should be handled separately. Detect conflicts using apierrors.IsConflict(err) and return nil error to allow the framework's work queue to naturally requeue with exponential backoff, keeping other patch errors as genuine failures. See reference implementation in vendor/sigs.k8s.io/cluster-api/util/patch/patch.go for this pattern.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go`
around lines 79 - 80, The patch call on h.mgtClusterClient.Status().Patch (using
crclient.MergeFromWithOptions and MergeFromWithOptimisticLock) must treat
optimistic-lock conflicts differently: after the Patch returns an error, detect
conflicts with apierrors.IsConflict(err) and, if true, return reconcile.Result{}
with a nil error so the controller runtime can requeue with backoff; for
non-conflict errors continue returning the wrapped fmt.Errorf("failed to patch
hcp status: %w", err). Locate the error handling directly after the Patch
invocation in the hcp status reconciliation code and insert the
apierrors.IsConflict check there.

Comment on lines +302 to +305
originalHCP := hcp.DeepCopy()
condition.ObservedGeneration = hcp.Generation
meta.SetStatusCondition(&hcp.Status.Conditions, condition)
return r.Status().Update(ctx, hcp)
return r.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cd hypershift-operator/controllers/etcdbackup && wc -l reconciler.go

Repository: openshift/hypershift

Length of output: 82


🏁 Script executed:

# First, let's look at the updateHCPBackupCondition function
cd hypershift-operator/controllers/etcdbackup && sed -n '295,310p' reconciler.go

Repository: openshift/hypershift

Length of output: 968


🏁 Script executed:

# Now examine the callers at lines 193, 452, and 484
cd hypershift-operator/controllers/etcdbackup && sed -n '188,198p' reconciler.go

Repository: openshift/hypershift

Length of output: 396


🏁 Script executed:

cd hypershift-operator/controllers/etcdbackup && sed -n '447,457p' reconciler.go

Repository: openshift/hypershift

Length of output: 391


🏁 Script executed:

cd hypershift-operator/controllers/etcdbackup && sed -n '479,489p' reconciler.go

Repository: openshift/hypershift

Length of output: 393


🏁 Script executed:

# Check imports to verify retry is available
cd hypershift-operator/controllers/etcdbackup && head -30 reconciler.go

Repository: openshift/hypershift

Length of output: 950


Retry optimistic-lock conflicts instead of silently dropping the HCP condition update.

MergeFromWithOptimisticLock{} surfaces 409 Conflict errors when the HCP is concurrently modified, but all callers at lines 193, 452, and 484 only log the error and continue. This means conflicts can permanently stale EtcdBackupSucceeded conditions on all three terminal paths (in-progress, success, failure).

The adjacent function updateHostedClusterBackupURL already handles this correctly with retry.RetryOnConflict, reflecting the same concurrency concern. Wrap this function's patch in retry.RetryOnConflict with a fresh HCP read to safely retry on conflicts.

Proposed fix
 func (r *HCPEtcdBackupReconciler) updateHCPBackupCondition(ctx context.Context, hcp *hyperv1.HostedControlPlane, condition metav1.Condition) error {
-	originalHCP := hcp.DeepCopy()
-	condition.ObservedGeneration = hcp.Generation
-	meta.SetStatusCondition(&hcp.Status.Conditions, condition)
-	return r.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
+	return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
+		current := &hyperv1.HostedControlPlane{}
+		if err := r.Get(ctx, client.ObjectKeyFromObject(hcp), current); err != nil {
+			return err
+		}
+
+		originalHCP := current.DeepCopy()
+		condition.ObservedGeneration = current.Generation
+		meta.SetStatusCondition(&current.Status.Conditions, condition)
+		return r.Status().Patch(ctx, current, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
+	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 302 -
305, The patch that sets meta.SetStatusCondition and calls r.Status().Patch(hcp,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
should be wrapped in retry.RetryOnConflict so optimistic-lock 409s are retried:
on each retry re-get the latest HCP (fresh read into a new variable), reapply
condition.ObservedGeneration = hcp.Generation and
meta.SetStatusCondition(&hcp.Status.Conditions, condition) against the fresh
HCP, then call r.Status().Patch with client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{}) (or use a new merge-from base as
appropriate) until success or non-conflict error; ensure errors are returned if
retries exhaust.

@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.79365% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.69%. Comparing base (adfbcdd) to head (fb2ea1c).
⚠️ Report is 43 commits behind head on main.

Files with missing lines Patch % Lines
...rconfigoperator/controllers/hcpstatus/hcpstatus.go 52.38% 13 Missing and 7 partials ⚠️
...ostedcontrolplane/hostedcontrolplane_controller.go 22.22% 5 Missing and 2 partials ⚠️
...rconfigoperator/controllers/resources/resources.go 50.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8562      +/-   ##
==========================================
- Coverage   45.84%   40.69%   -5.16%     
==========================================
  Files         440      755     +315     
  Lines       52824    93412   +40588     
==========================================
+ Hits        24218    38010   +13792     
- Misses      26816    52662   +25846     
- Partials     1790     2740     +950     
Files with missing lines Coverage Δ
...hift-operator/controllers/etcdbackup/reconciler.go 76.81% <100.00%> (+0.03%) ⬆️
...ition-server/controllers/local_ignitionprovider.go 38.14% <100.00%> (ø)
...rconfigoperator/controllers/resources/resources.go 56.62% <50.00%> (+0.01%) ⬆️
...ostedcontrolplane/hostedcontrolplane_controller.go 45.05% <22.22%> (+0.02%) ⬆️
...rconfigoperator/controllers/hcpstatus/hcpstatus.go 67.87% <52.38%> (-4.57%) ⬇️

... and 316 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.70% <ø> (?)
cpo-hostedcontrolplane 41.81% <22.22%> (+<0.01%) ⬆️
cpo-other 41.43% <52.00%> (+0.03%) ⬆️
hypershift-operator 50.81% <100.00%> (-0.01%) ⬇️
other 31.62% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jkyros jkyros force-pushed the karpenter-fix-hcp-status-race-oplocks branch from 58ac9ee to 10d5332 Compare May 20, 2026 23:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ignition-server/controllers/local_ignitionprovider.go (1)

767-788: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use retry.RetryOnConflict to prevent silent condition update loss on 409 conflicts.

This function uses optimistic locking but doesn't retry on conflicts. Since runMCO at lines 450-452 only logs the error, any 409 conflict will silently drop the IgnitionServerValidReleaseInfo condition update. Wrap the patch in retry.RetryOnConflict(retry.DefaultBackoff, func() error {...}) with a fresh HCP read inside the closure, following the pattern used in support/k8sutil.UpdateObject and elsewhere in the codebase.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 767 -
788, The patch of hostedControlPlane’s status can silently fail on 409
conflicts; wrap the r.Client.Status().Patch call in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) and inside the
closure re-read the current HostedControlPlane (hostedControlPlane) from the API
to compute/merge the IgnitionServerValidReleaseInfo condition (using
releaseImageProvider.GetMissingImages() as before), then call
r.Client.Status().Patch (or reuse support/k8sutil.UpdateObject pattern)
returning any error so conflicts are retried; keep originalHCP only for the
MergeFrom base inside the closure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`:
- Around line 20-161: Add parallelization by calling t.Parallel() at the start
of the top-level TestMergeFromPatchDoesNotStompUnrelatedFields function and
inside each subtest's closure (the anonymous func passed to t.Run).
Specifically, insert t.Parallel() as the first statement in
TestMergeFromPatchDoesNotStompUnrelatedFields and as the first statement in each
subtest function (the func(t *testing.T) for the three t.Run blocks) so the test
and its subtests run with Go's t.Parallel.

---

Outside diff comments:
In `@ignition-server/controllers/local_ignitionprovider.go`:
- Around line 767-788: The patch of hostedControlPlane’s status can silently
fail on 409 conflicts; wrap the r.Client.Status().Patch call in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) and inside the
closure re-read the current HostedControlPlane (hostedControlPlane) from the API
to compute/merge the IgnitionServerValidReleaseInfo condition (using
releaseImageProvider.GetMissingImages() as before), then call
r.Client.Status().Patch (or reuse support/k8sutil.UpdateObject pattern)
returning any error so conflicts are retried; keep originalHCP only for the
MergeFrom base inside the closure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6b55879f-4645-4cc2-9269-fc6811eee015

📥 Commits

Reviewing files that changed from the base of the PR and between 58ac9ee and 10d5332.

📒 Files selected for processing (6)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • ignition-server/controllers/local_ignitionprovider.go

@jkyros

jkyros commented May 21, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-autonode

@jkyros

jkyros commented May 21, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-autonode

@jkyros

jkyros commented May 21, 2026

Copy link
Copy Markdown
Member Author

Ugh that passed. Great. There's something in here with Patch and fields that are +required +nullable that I think breaks it if we don't mess with it. The grotesque way I solved it with the JSON patch is not ideal. Trying to do better.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2026
@jkyros jkyros force-pushed the karpenter-fix-hcp-status-race-oplocks branch from 9995cf5 to 541e646 Compare May 28, 2026 21:23
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2026
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@jkyros

jkyros commented May 28, 2026

Copy link
Copy Markdown
Member Author

That totally passed locally, I've never seen that flake but...cache checksum issues?
/test lint
/test verify

@openshift-ci openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label May 30, 2026
@jkyros

jkyros commented May 30, 2026

Copy link
Copy Markdown
Member Author

/test images

@jkyros

jkyros commented May 30, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-autonode

@jkyros jkyros marked this pull request as ready for review May 30, 2026 04:04
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and cblecker May 30, 2026 04:04
@jkyros

jkyros commented May 30, 2026

Copy link
Copy Markdown
Member Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jkyros

jkyros commented May 31, 2026

Copy link
Copy Markdown
Member Author

Cleanup/infra failures
/retest-required

@jkyros

jkyros commented May 31, 2026

Copy link
Copy Markdown
Member Author

Infra is still broken, now we're having trouble retrieving image streams 😢

@TwoDCube

TwoDCube commented Jun 1, 2026

Copy link
Copy Markdown
Member

Infra supposedly fixed by ci-tools#5220
/retest-required

@muraee

muraee commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@jkyros

jkyros commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-4-22

@jkyros

jkyros commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

/verified by e2e-aws-autonode

=== RUN   TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB
    karpenter_test.go:1438: Validating AutoNode.VCPUs converges to 0
    karpenter_test.go:1438: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=0 in 0s
    karpenter_test.go:1439: Asserting AutoNode.VCPUs remains stable at 0 for 30s
    karpenter_test.go:1439: AutoNode.VCPUs remained stable at 0 for 30s
    karpenter_test.go:1443: Baseline billing metric vCPUs from native NodePools: 4
    karpenter_test.go:1454: Created Karpenter NodePool
    karpenter_test.go:1456: Created workloads with 2 replicas
    util.go:578: Successfully waited for 2 nodes to become ready in 5m21s
    karpenter_test.go:1459: Both nodes ready, validating billing vCPUs
    karpenter_test.go:1462: Validating AutoNode.VCPUs converges to 8
    karpenter_test.go:1462: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=8 in 25ms
    karpenter_test.go:1463: Asserting AutoNode.VCPUs remains stable at 8 for 30s
    karpenter_test.go:1463: AutoNode.VCPUs remained stable at 8 for 30s
    karpenter_test.go:1464: Validating hypershift_cluster_vcpus converges to 12
    karpenter_test.go:1695: hypershift_cluster_vcpus=12 for cluster e2e-clusters-fkv2n/karpenter-p52jl
    karpenter_test.go:1466: Scaling workload to 1 replica to verify deprovisioning and consolidation
    util.go:578: Successfully waited for 1 nodes to become ready in 1m48s
    karpenter_test.go:1473: Karpenter consolidated the extra node
    karpenter_test.go:1476: Validating AutoNode.VCPUs converges to 4
    karpenter_test.go:1476: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=4 in 25ms
    karpenter_test.go:1477: Asserting AutoNode.VCPUs remains stable at 4 for 30s
    karpenter_test.go:1477: AutoNode.VCPUs remained stable at 4 for 30s
    karpenter_test.go:1478: Validating hypershift_cluster_vcpus converges to 8
    karpenter_test.go:1695: hypershift_cluster_vcpus=8 for cluster e2e-clusters-fkv2n/karpenter-p52jl
    karpenter_test.go:1500: Created cluster-deletion-blocking PodDisruptionBudget

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jkyros: This PR has been marked as verified by e2e-aws-autonode.

Details

In response to this:

/verified by e2e-aws-autonode

=== RUN   TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB
   karpenter_test.go:1438: Validating AutoNode.VCPUs converges to 0
   karpenter_test.go:1438: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=0 in 0s
   karpenter_test.go:1439: Asserting AutoNode.VCPUs remains stable at 0 for 30s
   karpenter_test.go:1439: AutoNode.VCPUs remained stable at 0 for 30s
   karpenter_test.go:1443: Baseline billing metric vCPUs from native NodePools: 4
   karpenter_test.go:1454: Created Karpenter NodePool
   karpenter_test.go:1456: Created workloads with 2 replicas
   util.go:578: Successfully waited for 2 nodes to become ready in 5m21s
   karpenter_test.go:1459: Both nodes ready, validating billing vCPUs
   karpenter_test.go:1462: Validating AutoNode.VCPUs converges to 8
   karpenter_test.go:1462: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=8 in 25ms
   karpenter_test.go:1463: Asserting AutoNode.VCPUs remains stable at 8 for 30s
   karpenter_test.go:1463: AutoNode.VCPUs remained stable at 8 for 30s
   karpenter_test.go:1464: Validating hypershift_cluster_vcpus converges to 12
   karpenter_test.go:1695: hypershift_cluster_vcpus=12 for cluster e2e-clusters-fkv2n/karpenter-p52jl
   karpenter_test.go:1466: Scaling workload to 1 replica to verify deprovisioning and consolidation
   util.go:578: Successfully waited for 1 nodes to become ready in 1m48s
   karpenter_test.go:1473: Karpenter consolidated the extra node
   karpenter_test.go:1476: Validating AutoNode.VCPUs converges to 4
   karpenter_test.go:1476: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=4 in 25ms
   karpenter_test.go:1477: Asserting AutoNode.VCPUs remains stable at 4 for 30s
   karpenter_test.go:1477: AutoNode.VCPUs remained stable at 4 for 30s
   karpenter_test.go:1478: Validating hypershift_cluster_vcpus converges to 8
   karpenter_test.go:1695: hypershift_cluster_vcpus=8 for cluster e2e-clusters-fkv2n/karpenter-p52jl
   karpenter_test.go:1500: Created cluster-deletion-blocking PodDisruptionBudget

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox

Copy link
Copy Markdown
Member

/retest

@jkyros

jkyros commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

codecov/project is blocking me I guess?
/retest

@bryan-cox

Copy link
Copy Markdown
Member

/override codecov/project

We know this is currently broken

@openshift-ci

openshift-ci Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@bryan-cox: Overrode contexts on behalf of bryan-cox: codecov/project

Details

In response to this:

/override codecov/project

We know this is currently broken

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

This confirms everything. Here is the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: codecov/project
  • Build ID: Check Run 79072867811
  • PR: #8562 — OCPBUGS-86306: Stop controllers fighting over HCP status
  • Base Commit: adfbcdd (43 commits behind main)
  • Head Commit: fb2ea1c

Test Failure Analysis

Error

codecov/project: 40.69% (-5.16%) compared to adfbcdd
Project coverage decreased from 45.84% (base) to 40.69% (head).
Patch coverage: 50.79% with 31 lines missing coverage.

Summary

This is not a real test or coverage regression caused by PR #8562. The codecov/project check failed because Codecov compared the PR head against a stale base commit (adfbcdd, 43 commits behind main). The base coverage report measured 440 files / 52,824 lines at 45.84%, while the head coverage report measured 755 files / 93,412 lines at 40.69% — a delta of +315 files and +40,588 lines entirely unrelated to this PR. The codecov/patch check (which measures only the PR's changed lines) passed, and the check was already overridden by a maintainer (/override codecov/project) confirming this is a known broken state.

Root Cause

The failure is caused by a stale base comparison combined with no configured coverage threshold tolerance in the repository's Codecov configuration.

Mechanism:

  1. Stale base report: The base commit adfbcdd is 43 commits behind the current main. Its coverage report was generated against a different, smaller file set — 440 files covering 52,824 lines at 45.84%.

  2. Expanded file set on PR head: The coverage report for the PR's head commit fb2ea1c includes 755 files covering 93,412 lines. The +315 additional files (+40,588 lines) represent code added to main in intervening commits or included via a different build configuration. Codecov itself flags this: "Report is 43 commits behind head on main" and lists "316 files with indirect coverage changes".

  3. No threshold tolerance configured: Codecov's default behavior uses target: auto (coverage must not decrease) with threshold: 0% (zero tolerance). The apparent -5.16% drop (an artifact of comparing mismatched file sets) exceeds this zero-tolerance threshold.

  4. PR's actual impact is negligible: The codecov/patch check passed with 50.79% patch coverage. The PR only modifies 8 files. A maintainer (bryan-cox) confirmed this is a known broken state and overrode the check with /override codecov/project.

Recommendations
  1. No action needed on this PR — the failure is not caused by PR OCPBUGS-86306: Stop controllers fighting over HCP status #8562. The codecov/patch check passes, and the codecov/project check has already been overridden by a maintainer.

  2. Rebase onto current main — this will update the base commit for Codecov's comparison, resolving the stale file-set mismatch and likely making codecov/project pass on future runs.

  3. To prevent recurrence repo-wide, configure a threshold tolerance in codecov.yml:

    coverage:
      status:
        project:
          default:
            threshold: 1%  # Allow up to 1% decrease
  4. Refresh Codecov's base report — the 43-commit gap suggests Codecov hasn't processed recent main commits. Merging any PR to main with CI coverage upload will update the base.

Evidence
Evidence Detail
Failure type codecov/project — project-wide coverage gate, not a test execution failure
codecov/patch result ✅ Passed (50.79% patch coverage) — PR's own code is adequately covered
codecov/project result ❌ Failed: 40.69% (-5.16%) vs base adfbcdd
Stale base Codecov reports: "Report is 43 commits behind head on main"
File set mismatch Base: 440 files / 52,824 lines → Head: 755 files / 93,412 lines (+315 files, +40,588 lines)
PR scope 8 files changed, 3 commits — cannot account for +315 file delta
Maintainer override bryan-cox ran /override codecov/project with comment: "We know this is currently broken"
Override check run ID 79070332375codecov/project shows conclusion: success via Prow override
Files with missing coverage hcpstatus.go (52.38%), hostedcontrolplane_controller.go (22.22%), resources.go (50.00%) — partial coverage on new patch logic, not a regression

@openshift-ci

openshift-ci Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 1146767 into openshift:main Jun 2, 2026
47 of 58 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@jkyros: Jira Issue OCPBUGS-86306: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-86306 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

What this PR does / why we need it:

  • We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object
  • Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265
  • This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update
  • Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

The contentious bit here is that:

  • I'm building an RFC 6902 JSON Patch for the HCP status controller's Update() replacement because
  • we have fields in the status (AvailableUpdates, CompletionTime) that are +required and +nullable so
  • our patches get rejected with other patch types (And SSA - the generated applyConfigurations apply "omitempty" to everything so they drop them too. sigh)

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Refactor

  • Switched multiple controllers to optimistic-locking status patches for status writes, improving concurrency handling and updating success/error log messages.

  • Tests

  • Added tests validating JSON Patch generation and application: correct ops for adds/replaces/removes, optimistic-lock test, preservation of unrelated status fields and nil values, and condition array handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants