OCPBUGS-86306: Stop controllers fighting over HCP status by jkyros · Pull Request #8562 · openshift/hypershift

jkyros · 2026-05-20T23:31:33Z

What this PR does / why we need it:

We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object
Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265
This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update
Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

The contentious bit here is that:

I'm building an RFC 6902 JSON Patch for the HCP status controller's Update() replacement because
we have fields in the status (AvailableUpdates, CompletionTime) that are +required and +nullable so
our patches get rejected with other patch types (And SSA - the generated applyConfigurations apply "omitempty" to everything so they drop them too. sigh)

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Refactor
- Switched multiple controllers to optimistic-locking status patches for status writes, improving concurrency handling and updating success/error log messages.
Tests
- Added tests validating JSON Patch generation and application: correct ops for adds/replaces/removes, optimistic-lock test, preservation of unrelated status fields and nil values, and condition array handling.

openshift-merge-bot · 2026-05-20T23:31:36Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-20T23:31:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-20T23:31:40Z

@jkyros: This pull request references Jira Issue OCPBUGS-86306, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object

Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265

This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update

Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-20T23:31:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request replaces direct HostedControlPlane status updates with optimistic-locking status patches across multiple controllers. The hcpstatus controller now generates RFC‑6902 JSON Patch ops (including a resourceVersion test) and issues Status().Patch. Other controllers (hostedcontrolplane, resources, etcdbackup, ignition-server) deep-copy original HCP objects and persist condition changes via Status().Patch with MergeFromWithOptimisticLock. New unit tests validate patch generation and backup-condition patch behavior.

Suggested reviewers

enxebre
csrwng

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Structure And Quality	⚠️ Warning	Most assertions lack meaningful failure messages. TestBuildStatusPatch: 47/52 assertions (~90%) without messages. TestUpdateHCPBackupCondition: 10/12 assertions (~83%) without messages.	Add failure messages to assertions that lack them (e.g., Expect(err).ToNot(HaveOccurred(), "failed to update condition") instead of without message).

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly addresses the main change: replacing Status().Update() with optimistic-locking Patch calls to prevent controllers from overwriting each other's HCP status fields.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in PR are stable and deterministic: no pod names, timestamps, UUIDs, or dynamic information; uses standard Go testing (not Ginkgo); all subtests use descriptive static strings.
Topology-Aware Scheduling Compatibility	✅ Passed	PR contains no scheduling constraints (affinity, nodeSelector, topology spread, PDB) affecting SNO, Two-Node, HyperShift topologies. Changes are internal controller status-writing refactoring only.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added; only standard Go unit tests with mocked Kubernetes clients. Tests contain no IPv4 assumptions or external connectivity requirements.
No-Weak-Crypto	✅ Passed	PR contains no weak cryptography usage, custom crypto implementations, or non-constant-time secret comparisons. Changes are purely Kubernetes controller status patching updates.
Container-Privileges	✅ Passed	PR contains only Go controller source code refactoring; no container privilege configurations, Kubernetes manifests, or Dockerfiles are modified.
No-Sensitive-Data-In-Logs	✅ Passed	No logging statements in this PR expose sensitive data. All new logs are generic operational messages or Kubernetes condition metadata without secrets, tokens, credentials, PII, or customer data.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-05-20T23:32:50Z

@jkyros: This pull request references Jira Issue OCPBUGS-86306, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

What this PR does / why we need it:

We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object

Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265

This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update

Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Refactor

Improved status condition handling across hosted cluster controllers for more reliable concurrent updates.

Tests

Added test coverage for status condition patching behavior to ensure accurate field updates without unintended side effects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ignition-server/controllers/local_ignitionprovider.go (1)
768-788: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap reconcileValidReleaseInfoCondition in retry logic for optimistic-lock conflicts.

This helper now uses Status().Patch(..., client.MergeFromWithOptimisticLock{}), which returns 409 conflicts during concurrent HCP status writes. However, the caller at lines 451-452 only logs the error and continues, leaving IgnitionServerValidReleaseInfo stale indefinitely—there's no guaranteed near-term retry.

Wrap the helper logic in retry.RetryOnConflict(retry.DefaultBackoff, func() { ... }) and re-fetch the HostedControlPlane inside the retry block. This pattern is already used in support/k8sutil/object.go and aligns with Kubernetes best practices.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 768 -
788, Wrap the existing status-patch logic in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) so
optimistic-lock 409s are retried: inside the retry block re-fetch the latest
HostedControlPlane using r.Client.Get into hostedControlPlane, create
originalHCP := hostedControlPlane.DeepCopy(), compute/set the
IgnitionServerValidReleaseInfo condition (use
releaseImageProvider.GetMissingImages() and meta.SetStatusCondition as before),
then call r.Client.Status().Patch(ctx, &hostedControlPlane,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
and return the patch error; let RetryOnConflict surface a final error if retries
fail.

🧹 Nitpick comments (3)

control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go (2)

19-160: ⚡ Quick win

Add parallelization to align with unit-test execution guidance.

This test suite does not use t.Parallel(), so it misses the repository’s unit-test parallelism requirement.

Suggested change

 func TestMergeFromPatchDoesNotStompUnrelatedFields(t *testing.T) {
+	t.Parallel()
 	t.Run("When hcpstatus controller patches conditions and versionStatus, it should not include autoNode in the patch", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)
@@
 	t.Run("When hcpstatus controller makes no changes, it should produce an empty patch", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)
@@
 	t.Run("When a condition-only controller patches, the conditions array should contain ALL conditions from the read", func(t *testing.T) {
+		t.Parallel()
 		g := NewWithT(t)

As per coding guidelines, "Use race detection and parallel execution for unit tests".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`
around lines 19 - 160, The test suite
TestMergeFromPatchDoesNotStompUnrelatedFields is missing parallelization; add
t.Parallel() at the start of TestMergeFromPatchDoesNotStompUnrelatedFields and
inside each subtest closure (the three t.Run anonymous funcs) so each subtest
and the parent test run with testing parallelism and race detection enabled;
look for the TestMergeFromPatchDoesNotStompUnrelatedFields function and the
three t.Run blocks to insert the t.Parallel() calls.

151-152: ⚡ Quick win

Avoid panic-prone type assertions in tests.

These direct assertions can panic and hide the real failure context. Prefer checked assertions so failures stay diagnostic.

Suggested change

-		status := patchMap["status"].(map[string]interface{})
-		conditions := status["conditions"].([]interface{})
+		status, ok := patchMap["status"].(map[string]interface{})
+		g.Expect(ok).To(BeTrue(), "patch should contain a status object")
+		conditions, ok := status["conditions"].([]interface{})
+		g.Expect(ok).To(BeTrue(), "status.conditions should be a JSON array")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`
around lines 151 - 152, Replace the unchecked type assertions on
patchMap["status"] and status["conditions"] with guarded checks so the test
fails with a clear message instead of panicking: retrieve the value using the
comma-ok form (e.g., statusIface, ok := patchMap["status"]; if !ok {
t.Fatalf(...) } then status, ok := statusIface.(map[string]interface{}); if !ok
{ t.Fatalf(...) }) and likewise check conditionsIface, ok :=
status["conditions"]; if !ok { t.Fatalf(...) } then conditions, ok :=
conditionsIface.([]interface{}); if !ok { t.Fatalf(...) }); alternatively use
your test helper (e.g., require.IsType/require.NotNil) around the same symbols
(patchMap, status, conditions) to make assertions explicit and diagnostic.

control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)

570-576: ⚡ Quick win

Skip no-op recovery status patches.

reconcileClusterRecovery always patches after meta.SetStatusCondition, even when the condition is unchanged. Because this path polls every 120 seconds, steady-state recovery checks become repeated write attempts and avoidable optimistic-lock conflicts on the same HCP status object. Guard on the boolean returned by meta.SetStatusCondition to skip writing unchanged conditions.

♻️ Suggested change

 	originalHCP := hcp.DeepCopy()
-	meta.SetStatusCondition(&hcp.Status.Conditions, *condition)
+	if !meta.SetStatusCondition(&hcp.Status.Conditions, *condition) {
+		if !finished {
+			return ctrl.Result{RequeueAfter: 120 * time.Second}, nil
+		}
+		return ctrl.Result{}, nil
+	}
 	log.Info("setting condition", "type", condition.Type, "status", condition.Status, "message", condition.Message)
 	if err := r.cpClient.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})); err != nil {
 		return ctrl.Result{}, fmt.Errorf("failed to patch status on hcp for hosted cluster recovery: %w. Condition error message: %v", err, condition.Message)
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 570 - 576, reconcileClusterRecovery is always patching HCP status
even when meta.SetStatusCondition didn't change the condition; capture the
boolean return from meta.SetStatusCondition and only call
r.cpClient.Status().Patch (using originalHCP and the same
MergeFromWithOptions/OptimisticLock) if that boolean is true to skip no-op
writes and avoid redundant optimistic-lock updates.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go`:
- Around line 79-80: The patch call on h.mgtClusterClient.Status().Patch (using
crclient.MergeFromWithOptions and MergeFromWithOptimisticLock) must treat
optimistic-lock conflicts differently: after the Patch returns an error, detect
conflicts with apierrors.IsConflict(err) and, if true, return reconcile.Result{}
with a nil error so the controller runtime can requeue with backoff; for
non-conflict errors continue returning the wrapped fmt.Errorf("failed to patch
hcp status: %w", err). Locate the error handling directly after the Patch
invocation in the hcp status reconciliation code and insert the
apierrors.IsConflict check there.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 302-305: The patch that sets meta.SetStatusCondition and calls
r.Status().Patch(hcp, client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{})) should be wrapped in
retry.RetryOnConflict so optimistic-lock 409s are retried: on each retry re-get
the latest HCP (fresh read into a new variable), reapply
condition.ObservedGeneration = hcp.Generation and
meta.SetStatusCondition(&hcp.Status.Conditions, condition) against the fresh
HCP, then call r.Status().Patch with client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{}) (or use a new merge-from base as
appropriate) until success or non-conflict error; ensure errors are returned if
retries exhaust.

---

Outside diff comments:
In `@ignition-server/controllers/local_ignitionprovider.go`:
- Around line 768-788: Wrap the existing status-patch logic in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) so
optimistic-lock 409s are retried: inside the retry block re-fetch the latest
HostedControlPlane using r.Client.Get into hostedControlPlane, create
originalHCP := hostedControlPlane.DeepCopy(), compute/set the
IgnitionServerValidReleaseInfo condition (use
releaseImageProvider.GetMissingImages() and meta.SetStatusCondition as before),
then call r.Client.Status().Patch(ctx, &hostedControlPlane,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
and return the patch error; let RetryOnConflict surface a final error if retries
fail.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`:
- Around line 19-160: The test suite
TestMergeFromPatchDoesNotStompUnrelatedFields is missing parallelization; add
t.Parallel() at the start of TestMergeFromPatchDoesNotStompUnrelatedFields and
inside each subtest closure (the three t.Run anonymous funcs) so each subtest
and the parent test run with testing parallelism and race detection enabled;
look for the TestMergeFromPatchDoesNotStompUnrelatedFields function and the
three t.Run blocks to insert the t.Parallel() calls.
- Around line 151-152: Replace the unchecked type assertions on
patchMap["status"] and status["conditions"] with guarded checks so the test
fails with a clear message instead of panicking: retrieve the value using the
comma-ok form (e.g., statusIface, ok := patchMap["status"]; if !ok {
t.Fatalf(...) } then status, ok := statusIface.(map[string]interface{}); if !ok
{ t.Fatalf(...) }) and likewise check conditionsIface, ok :=
status["conditions"]; if !ok { t.Fatalf(...) } then conditions, ok :=
conditionsIface.([]interface{}); if !ok { t.Fatalf(...) }); alternatively use
your test helper (e.g., require.IsType/require.NotNil) around the same symbols
(patchMap, status, conditions) to make assertions explicit and diagnostic.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 570-576: reconcileClusterRecovery is always patching HCP status
even when meta.SetStatusCondition didn't change the condition; capture the
boolean return from meta.SetStatusCondition and only call
r.cpClient.Status().Patch (using originalHCP and the same
MergeFromWithOptions/OptimisticLock) if that boolean is true to skip no-op
writes and avoid redundant optimistic-lock updates.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d0b2014e-1faa-4d10-a400-26d60e220b42

📥 Commits

Reviewing files that changed from the base of the PR and between c23dbb8 and 58ac9ee.

📒 Files selected for processing (6)

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
hypershift-operator/controllers/etcdbackup/reconciler.go
ignition-server/controllers/local_ignitionprovider.go

coderabbitai · 2026-05-20T23:38:11Z

+		if err := h.mgtClusterClient.Status().Patch(ctx, hcp, crclient.MergeFromWithOptions(originalHCP, crclient.MergeFromWithOptimisticLock{})); err != nil {
+			return reconcile.Result{}, fmt.Errorf("failed to patch hcp status: %w", err)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file at the specified lines cat -n control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go | sed -n '70,90p'

Repository: openshift/hypershift

Length of output: 1230

🏁 Script executed:

# Let's also see more context around the patch operation cat -n control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go | sed -n '60,100p'

Repository: openshift/hypershift

Length of output: 2002

🏁 Script executed:

# Check if there are other patch operations in the file to understand the pattern rg -n "Status\(\)\.Patch|MergeFromWithOptimisticLock" control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go -A 2 -B 2

Repository: openshift/hypershift

Length of output: 450

🏁 Script executed:

# Search for how conflicts are handled elsewhere in the codebase rg -n "409|Conflict|optimistic" control-plane-operator/ --type go -i | head -20

Repository: openshift/hypershift

Length of output: 4005

🏁 Script executed:

# Check the controller-runtime and kubernetes patterns for conflict handling rg -n "MergeFromWithOptimisticLock" . --type go | head -10

Repository: openshift/hypershift

Length of output: 1802

🏁 Script executed:

# Check the vendored controller-runtime to understand conflict handling patterns cat vendor/sigs.k8s.io/controller-runtime/pkg/client/patch.go | sed -n '50,80p'

Repository: openshift/hypershift

Length of output: 1445

🏁 Script executed:

# Search for error handling patterns around optimistic locks rg -n "apiutil|IsConflict|StatusError" control-plane-operator/ --type go | head -15

Repository: openshift/hypershift

Length of output: 1952

🏁 Script executed:

# Check if there are any examples of conflict detection in the codebase rg -n "apierrors.IsConflict|StatusError" . --type go | head -20

Repository: openshift/hypershift

Length of output: 3023

🌐 Web query:

Kubernetes controller-runtime conflict 409 error handling optimistic locking best practices

💡 Result:

Handling 409 Conflict errors in Kubernetes is an essential aspect of managing optimistic concurrency [1][2]. When multiple clients (e.g., controllers, users, HPA) modify the same resource, the Kubernetes API server uses the resourceVersion field to ensure updates are based on the latest state [1][3]. If a conflict occurs, the API server rejects the update with a 409 Conflict status [3][4]. Best practices for handling these conflicts include: 1. Let the Controller Reconcile (Recommended for most cases) In standard controller-runtime development, the most idiomatic approach is to return the error from your Reconcile function [5][3]. This allows the controller-runtime's built-in work queue to catch the error, log it, and automatically re-enqueue the request with exponential backoff [6][7][8]. This avoids manual retry loops inside your reconciliation logic and keeps the controller simple and idempotent [9][10]. 2. Use RetryOnConflict (For specific update operations) If you are performing a complex update operation within a reconciliation loop that must succeed before continuing, you can use the k8s.io/client-go/util/retry helper [5][11][12]. This helper manages the re-fetching of the object to get the latest resourceVersion before attempting the update again [1][12][13]. Example: err:= retry.RetryOnConflict(retry.DefaultRetry, func error { // 1. Re-fetch the latest version of the object if err:= r.Get(ctx, req.NamespacedName, latestObj); err!= nil { return err } // 2. Apply modifications to the fresh object latestObj.Status.Phase = "Running" // 3. Attempt update return r.Status.Update(ctx, latestObj) }) 3. Consider Server-Side Apply For complex objects with many fields modified by multiple controllers, Server-Side Apply (SSA) is the preferred alternative to traditional Update calls [5]. SSA allows for field-level ownership, significantly reducing the frequency of 409 conflicts compared to full object updates [5][12]. Key Principles: - Avoid manual, infinite retry loops. - Do not manually manipulate or suppress the resourceVersion field to bypass concurrency checks [2]. - Separate spec updates from status updates; status subresources help minimize conflicts [2]. - Leverage the framework: Return errors to the reconciler to benefit from built-in rate-limiting and backoff [6][8].

Citations:

1: https://medium.com/@pankajhasija2109/mastering-kubernetes-api-concurrency-7cc3083ffbb9

2: https://oneuptime.com/blog/post/2026-02-09-optimistic-concurrency-resourceversion/view

3: https://github.com/gardener/gardener/blob/master/docs/development/kubernetes-clients.md

4: https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332

5: How to elegantly solve the update conflict problem kubernetes-sigs/controller-runtime#1748

6: https://dev.to/jamesli/client-go-deep-dive-workqueue-the-reliable-task-queue-for-kubernetes-controllers-3pjc

7: https://oneuptime.com/blog/post/2026-02-09-reconciliation-loops-exponential-backoff/view

8: https://stuartleeks.com/posts/error-back-off-with-controller-runtime/

9: Docs suggestion: Optimistic Concurrency kubernetes-sigs/kubebuilder#617

10: https://book.kubebuilder.io/cronjob-tutorial/controller-implementation.html

11: https://oneuptime.com/blog/post/2026-02-09-retry-logic-api-conflicts-controllers/view

12: https://platformwale.blog/2026/02/24/kubernetes-operators-a-deep-dive-into-the-internals/

13: https://medium.com/@vamshitejanizam/building-resilient-kubernetes-controllers-a-practical-guide-to-retry-mechanisms-0d689160fa51

🏁 Script executed:

# Check for controller-runtime configuration or custom backoff settings rg -n "DefaultRetry|ExponentialBackoff|SetupWithManager" control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go -A 3

Repository: openshift/hypershift

Length of output: 46

🏁 Script executed:

# Check if there's any custom error handling wrapper in the codebase rg -n "apierrors.IsConflict" . --type go -B 3 -A 3 | head -40

Repository: openshift/hypershift

Length of output: 3994

Handle optimistic-lock conflicts distinctly from other patch failures.

The current code treats 409 conflicts from optimistic locking the same as other patch errors, which can create misleading error metrics and logs. Conflicts are expected under contention and should be handled separately. Detect conflicts using apierrors.IsConflict(err) and return nil error to allow the framework's work queue to naturally requeue with exponential backoff, keeping other patch errors as genuine failures. See reference implementation in vendor/sigs.k8s.io/cluster-api/util/patch/patch.go for this pattern.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go` around lines 79 - 80, The patch call on h.mgtClusterClient.Status().Patch (using crclient.MergeFromWithOptions and MergeFromWithOptimisticLock) must treat optimistic-lock conflicts differently: after the Patch returns an error, detect conflicts with apierrors.IsConflict(err) and, if true, return reconcile.Result{} with a nil error so the controller runtime can requeue with backoff; for non-conflict errors continue returning the wrapped fmt.Errorf("failed to patch hcp status: %w", err). Locate the error handling directly after the Patch invocation in the hcp status reconciliation code and insert the apierrors.IsConflict check there.

coderabbitai · 2026-05-20T23:38:11Z

+	originalHCP := hcp.DeepCopy()
 	condition.ObservedGeneration = hcp.Generation
 	meta.SetStatusCondition(&hcp.Status.Conditions, condition)
-	return r.Status().Update(ctx, hcp)
+	return r.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cd hypershift-operator/controllers/etcdbackup && wc -l reconciler.go

Repository: openshift/hypershift

Length of output: 82

🏁 Script executed:

# First, let's look at the updateHCPBackupCondition function cd hypershift-operator/controllers/etcdbackup && sed -n '295,310p' reconciler.go

Repository: openshift/hypershift

Length of output: 968

🏁 Script executed:

# Now examine the callers at lines 193, 452, and 484 cd hypershift-operator/controllers/etcdbackup && sed -n '188,198p' reconciler.go

Repository: openshift/hypershift

Length of output: 396

🏁 Script executed:

cd hypershift-operator/controllers/etcdbackup && sed -n '447,457p' reconciler.go

Repository: openshift/hypershift

Length of output: 391

🏁 Script executed:

cd hypershift-operator/controllers/etcdbackup && sed -n '479,489p' reconciler.go

Repository: openshift/hypershift

Length of output: 393

🏁 Script executed:

# Check imports to verify retry is available cd hypershift-operator/controllers/etcdbackup && head -30 reconciler.go

Repository: openshift/hypershift

Length of output: 950

Retry optimistic-lock conflicts instead of silently dropping the HCP condition update.

MergeFromWithOptimisticLock{} surfaces 409 Conflict errors when the HCP is concurrently modified, but all callers at lines 193, 452, and 484 only log the error and continue. This means conflicts can permanently stale EtcdBackupSucceeded conditions on all three terminal paths (in-progress, success, failure).

The adjacent function updateHostedClusterBackupURL already handles this correctly with retry.RetryOnConflict, reflecting the same concurrency concern. Wrap this function's patch in retry.RetryOnConflict with a fresh HCP read to safely retry on conflicts.

Proposed fix

func (r *HCPEtcdBackupReconciler) updateHCPBackupCondition(ctx context.Context, hcp *hyperv1.HostedControlPlane, condition metav1.Condition) error { - originalHCP := hcp.DeepCopy() - condition.ObservedGeneration = hcp.Generation - meta.SetStatusCondition(&hcp.Status.Conditions, condition) - return r.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})) + return retry.RetryOnConflict(retry.DefaultBackoff, func() error { + current := &hyperv1.HostedControlPlane{} + if err := r.Get(ctx, client.ObjectKeyFromObject(hcp), current); err != nil { + return err + } + + originalHCP := current.DeepCopy() + condition.ObservedGeneration = current.Generation + meta.SetStatusCondition(&current.Status.Conditions, condition) + return r.Status().Patch(ctx, current, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})) + }) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 302 - 305, The patch that sets meta.SetStatusCondition and calls r.Status().Patch(hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})) should be wrapped in retry.RetryOnConflict so optimistic-lock 409s are retried: on each retry re-get the latest HCP (fresh read into a new variable), reapply condition.ObservedGeneration = hcp.Generation and meta.SetStatusCondition(&hcp.Status.Conditions, condition) against the fresh HCP, then call r.Status().Patch with client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}) (or use a new merge-from base as appropriate) until success or non-conflict error; ensure errors are returned if retries exhaust.

codecov · 2026-05-20T23:41:49Z

Codecov Report

❌ Patch coverage is 50.79365% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.69%. Comparing base (adfbcdd) to head (fb2ea1c).
⚠️ Report is 43 commits behind head on main.

Files with missing lines	Patch %	Lines
...rconfigoperator/controllers/hcpstatus/hcpstatus.go	52.38%	13 Missing and 7 partials ⚠️
...ostedcontrolplane/hostedcontrolplane_controller.go	22.22%	5 Missing and 2 partials ⚠️
...rconfigoperator/controllers/resources/resources.go	50.00%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8562      +/-   ##
==========================================
- Coverage   45.84%   40.69%   -5.16%     
==========================================
  Files         440      755     +315     
  Lines       52824    93412   +40588     
==========================================
+ Hits        24218    38010   +13792     
- Misses      26816    52662   +25846     
- Partials     1790     2740     +950

Files with missing lines	Coverage Δ
...hift-operator/controllers/etcdbackup/reconciler.go	`76.81% <100.00%> (+0.03%)`	⬆️
...ition-server/controllers/local_ignitionprovider.go	`38.14% <100.00%> (ø)`
...rconfigoperator/controllers/resources/resources.go	`56.62% <50.00%> (+0.01%)`	⬆️
...ostedcontrolplane/hostedcontrolplane_controller.go	`45.05% <22.22%> (+0.02%)`	⬆️
...rconfigoperator/controllers/hcpstatus/hcpstatus.go	`67.87% <52.38%> (-4.57%)`	⬇️

... and 316 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.70% <ø> (?)`
cpo-hostedcontrolplane	`41.81% <22.22%> (+<0.01%)`	⬆️
cpo-other	`41.43% <52.00%> (+0.03%)`	⬆️
hypershift-operator	`50.81% <100.00%> (-0.01%)`	⬇️
other	`31.62% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ignition-server/controllers/local_ignitionprovider.go (1)
767-788: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use retry.RetryOnConflict to prevent silent condition update loss on 409 conflicts.

This function uses optimistic locking but doesn't retry on conflicts. Since runMCO at lines 450-452 only logs the error, any 409 conflict will silently drop the IgnitionServerValidReleaseInfo condition update. Wrap the patch in retry.RetryOnConflict(retry.DefaultBackoff, func() error {...}) with a fresh HCP read inside the closure, following the pattern used in support/k8sutil.UpdateObject and elsewhere in the codebase.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 767 -
788, The patch of hostedControlPlane’s status can silently fail on 409
conflicts; wrap the r.Client.Status().Patch call in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) and inside the
closure re-read the current HostedControlPlane (hostedControlPlane) from the API
to compute/merge the IgnitionServerValidReleaseInfo condition (using
releaseImageProvider.GetMissingImages() as before), then call
r.Client.Status().Patch (or reuse support/k8sutil.UpdateObject pattern)
returning any error so conflicts are retried; keep originalHCP only for the
MergeFrom base inside the closure.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`:
- Around line 20-161: Add parallelization by calling t.Parallel() at the start
of the top-level TestMergeFromPatchDoesNotStompUnrelatedFields function and
inside each subtest's closure (the anonymous func passed to t.Run).
Specifically, insert t.Parallel() as the first statement in
TestMergeFromPatchDoesNotStompUnrelatedFields and as the first statement in each
subtest function (the func(t *testing.T) for the three t.Run blocks) so the test
and its subtests run with Go's t.Parallel.

---

Outside diff comments:
In `@ignition-server/controllers/local_ignitionprovider.go`:
- Around line 767-788: The patch of hostedControlPlane’s status can silently
fail on 409 conflicts; wrap the r.Client.Status().Patch call in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) and inside the
closure re-read the current HostedControlPlane (hostedControlPlane) from the API
to compute/merge the IgnitionServerValidReleaseInfo condition (using
releaseImageProvider.GetMissingImages() as before), then call
r.Client.Status().Patch (or reuse support/k8sutil.UpdateObject pattern)
returning any error so conflicts are retried; keep originalHCP only for the
MergeFrom base inside the closure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6b55879f-4645-4cc2-9269-fc6811eee015

📥 Commits

Reviewing files that changed from the base of the PR and between 58ac9ee and 10d5332.

📒 Files selected for processing (6)

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
hypershift-operator/controllers/etcdbackup/reconciler.go
ignition-server/controllers/local_ignitionprovider.go

jkyros · 2026-05-21T00:18:53Z

/test e2e-aws-autonode

jkyros · 2026-05-21T17:18:48Z

/test e2e-aws-autonode

jkyros · 2026-05-21T20:17:52Z

Ugh that passed. Great. There's something in here with Patch and fields that are +required +nullable that I think breaks it if we don't mess with it. The grotesque way I solved it with the JSON patch is not ideal. Trying to do better.

coderabbitai · 2026-05-28T21:27:02Z

Actionable comments posted: 0

jkyros · 2026-05-28T21:36:12Z

That totally passed locally, I've never seen that flake but...cache checksum issues?
/test lint
/test verify

jkyros · 2026-05-30T02:05:01Z

/test images

jkyros · 2026-05-30T02:08:07Z

/test e2e-aws-autonode

jkyros · 2026-05-30T04:24:18Z

/pipeline required

openshift-merge-bot · 2026-05-30T04:24:21Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

jkyros · 2026-05-31T21:42:05Z

Cleanup/infra failures
/retest-required

jkyros · 2026-05-31T22:18:30Z

Infra is still broken, now we're having trouble retrieving image streams 😢

TwoDCube · 2026-06-01T11:45:53Z

Infra supposedly fixed by ci-tools#5220
/retest-required

muraee · 2026-06-01T14:33:43Z

/lgtm

openshift-merge-bot · 2026-06-01T14:34:39Z

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

jkyros · 2026-06-01T16:36:11Z

/test e2e-aws-4-22

jkyros · 2026-06-01T21:03:56Z

/verified by e2e-aws-autonode

=== RUN   TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB
    karpenter_test.go:1438: Validating AutoNode.VCPUs converges to 0
    karpenter_test.go:1438: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=0 in 0s
    karpenter_test.go:1439: Asserting AutoNode.VCPUs remains stable at 0 for 30s
    karpenter_test.go:1439: AutoNode.VCPUs remained stable at 0 for 30s
    karpenter_test.go:1443: Baseline billing metric vCPUs from native NodePools: 4
    karpenter_test.go:1454: Created Karpenter NodePool
    karpenter_test.go:1456: Created workloads with 2 replicas
    util.go:578: Successfully waited for 2 nodes to become ready in 5m21s
    karpenter_test.go:1459: Both nodes ready, validating billing vCPUs
    karpenter_test.go:1462: Validating AutoNode.VCPUs converges to 8
    karpenter_test.go:1462: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=8 in 25ms
    karpenter_test.go:1463: Asserting AutoNode.VCPUs remains stable at 8 for 30s
    karpenter_test.go:1463: AutoNode.VCPUs remained stable at 8 for 30s
    karpenter_test.go:1464: Validating hypershift_cluster_vcpus converges to 12
    karpenter_test.go:1695: hypershift_cluster_vcpus=12 for cluster e2e-clusters-fkv2n/karpenter-p52jl
    karpenter_test.go:1466: Scaling workload to 1 replica to verify deprovisioning and consolidation
    util.go:578: Successfully waited for 1 nodes to become ready in 1m48s
    karpenter_test.go:1473: Karpenter consolidated the extra node
    karpenter_test.go:1476: Validating AutoNode.VCPUs converges to 4
    karpenter_test.go:1476: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=4 in 25ms
    karpenter_test.go:1477: Asserting AutoNode.VCPUs remains stable at 4 for 30s
    karpenter_test.go:1477: AutoNode.VCPUs remained stable at 4 for 30s
    karpenter_test.go:1478: Validating hypershift_cluster_vcpus converges to 8
    karpenter_test.go:1695: hypershift_cluster_vcpus=8 for cluster e2e-clusters-fkv2n/karpenter-p52jl
    karpenter_test.go:1500: Created cluster-deletion-blocking PodDisruptionBudget

openshift-ci-robot · 2026-06-01T21:04:08Z

@jkyros: This PR has been marked as verified by e2e-aws-autonode.

Details

In response to this:

/verified by e2e-aws-autonode

=== RUN   TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB
   karpenter_test.go:1438: Validating AutoNode.VCPUs converges to 0
   karpenter_test.go:1438: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=0 in 0s
   karpenter_test.go:1439: Asserting AutoNode.VCPUs remains stable at 0 for 30s
   karpenter_test.go:1439: AutoNode.VCPUs remained stable at 0 for 30s
   karpenter_test.go:1443: Baseline billing metric vCPUs from native NodePools: 4
   karpenter_test.go:1454: Created Karpenter NodePool
   karpenter_test.go:1456: Created workloads with 2 replicas
   util.go:578: Successfully waited for 2 nodes to become ready in 5m21s
   karpenter_test.go:1459: Both nodes ready, validating billing vCPUs
   karpenter_test.go:1462: Validating AutoNode.VCPUs converges to 8
   karpenter_test.go:1462: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=8 in 25ms
   karpenter_test.go:1463: Asserting AutoNode.VCPUs remains stable at 8 for 30s
   karpenter_test.go:1463: AutoNode.VCPUs remained stable at 8 for 30s
   karpenter_test.go:1464: Validating hypershift_cluster_vcpus converges to 12
   karpenter_test.go:1695: hypershift_cluster_vcpus=12 for cluster e2e-clusters-fkv2n/karpenter-p52jl
   karpenter_test.go:1466: Scaling workload to 1 replica to verify deprovisioning and consolidation
   util.go:578: Successfully waited for 1 nodes to become ready in 1m48s
   karpenter_test.go:1473: Karpenter consolidated the extra node
   karpenter_test.go:1476: Validating AutoNode.VCPUs converges to 4
   karpenter_test.go:1476: Successfully waited for HostedCluster e2e-clusters-fkv2n/karpenter-p52jl AutoNode.VCPUs=4 in 25ms
   karpenter_test.go:1477: Asserting AutoNode.VCPUs remains stable at 4 for 30s
   karpenter_test.go:1477: AutoNode.VCPUs remained stable at 4 for 30s
   karpenter_test.go:1478: Validating hypershift_cluster_vcpus converges to 8
   karpenter_test.go:1695: hypershift_cluster_vcpus=8 for cluster e2e-clusters-fkv2n/karpenter-p52jl
   karpenter_test.go:1500: Created cluster-deletion-blocking PodDisruptionBudget

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-06-02T01:33:57Z

/retest

jkyros · 2026-06-02T02:24:27Z

codecov/project is blocking me I guess?
/retest

bryan-cox · 2026-06-02T12:24:00Z

/override codecov/project

We know this is currently broken

openshift-ci · 2026-06-02T12:26:46Z

@bryan-cox: Overrode contexts on behalf of bryan-cox: codecov/project

Details

In response to this:

/override codecov/project

We know this is currently broken

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hypershift-jira-solve-ci · 2026-06-02T12:46:20Z

This confirms everything. Here is the final report:

Test Failure Analysis Complete

Job Information

Prow Job: codecov/project
Build ID: Check Run 79072867811
PR: #8562 — OCPBUGS-86306: Stop controllers fighting over HCP status
Base Commit: adfbcdd (43 commits behind main)
Head Commit: fb2ea1c

Test Failure Analysis

Error

codecov/project: 40.69% (-5.16%) compared to adfbcdd
Project coverage decreased from 45.84% (base) to 40.69% (head).
Patch coverage: 50.79% with 31 lines missing coverage.

Summary

This is not a real test or coverage regression caused by PR #8562. The codecov/project check failed because Codecov compared the PR head against a stale base commit (adfbcdd, 43 commits behind main). The base coverage report measured 440 files / 52,824 lines at 45.84%, while the head coverage report measured 755 files / 93,412 lines at 40.69% — a delta of +315 files and +40,588 lines entirely unrelated to this PR. The codecov/patch check (which measures only the PR's changed lines) passed, and the check was already overridden by a maintainer (/override codecov/project) confirming this is a known broken state.

Root Cause

The failure is caused by a stale base comparison combined with no configured coverage threshold tolerance in the repository's Codecov configuration.

Mechanism:

Stale base report: The base commit adfbcdd is 43 commits behind the current main. Its coverage report was generated against a different, smaller file set — 440 files covering 52,824 lines at 45.84%.
Expanded file set on PR head: The coverage report for the PR's head commit fb2ea1c includes 755 files covering 93,412 lines. The +315 additional files (+40,588 lines) represent code added to main in intervening commits or included via a different build configuration. Codecov itself flags this: "Report is 43 commits behind head on main" and lists "316 files with indirect coverage changes".
No threshold tolerance configured: Codecov's default behavior uses target: auto (coverage must not decrease) with threshold: 0% (zero tolerance). The apparent -5.16% drop (an artifact of comparing mismatched file sets) exceeds this zero-tolerance threshold.
PR's actual impact is negligible: The codecov/patch check passed with 50.79% patch coverage. The PR only modifies 8 files. A maintainer (bryan-cox) confirmed this is a known broken state and overrode the check with /override codecov/project.

Recommendations

No action needed on this PR — the failure is not caused by PR OCPBUGS-86306: Stop controllers fighting over HCP status #8562. The codecov/patch check passes, and the codecov/project check has already been overridden by a maintainer.
Rebase onto current main — this will update the base commit for Codecov's comparison, resolving the stale file-set mismatch and likely making codecov/project pass on future runs.

To prevent recurrence repo-wide, configure a threshold tolerance in codecov.yml:

coverage:
  status:
    project:
      default:
        threshold: 1%  # Allow up to 1% decrease

Refresh Codecov's base report — the 43-commit gap suggests Codecov hasn't processed recent main commits. Merging any PR to main with CI coverage upload will update the base.

Evidence

Evidence	Detail
Failure type	`codecov/project` — project-wide coverage gate, not a test execution failure
codecov/patch result	✅ Passed (50.79% patch coverage) — PR's own code is adequately covered
codecov/project result	❌ Failed: 40.69% (-5.16%) vs base `adfbcdd`
Stale base	Codecov reports: "Report is 43 commits behind head on main"
File set mismatch	Base: 440 files / 52,824 lines → Head: 755 files / 93,412 lines (+315 files, +40,588 lines)
PR scope	8 files changed, 3 commits — cannot account for +315 file delta
Maintainer override	`bryan-cox` ran `/override codecov/project` with comment: "We know this is currently broken"
Override check run	ID `79070332375` — `codecov/project` shows `conclusion: success` via Prow override
Files with missing coverage	`hcpstatus.go` (52.38%), `hostedcontrolplane_controller.go` (22.22%), `resources.go` (50.00%) — partial coverage on new patch logic, not a regression

openshift-ci · 2026-06-02T12:48:50Z

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-06-02T12:54:29Z

@jkyros: Jira Issue OCPBUGS-86306: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

openshift/hypershift#8540 is closed

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-86306 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

What this PR does / why we need it:

We have multiple controllers writing to the HCP's status object with Update() calls, which replaces the entire object

Depending on what the controller performing the Update() understands and sees, and when it performs the update, that potentially blows away fields like vcpus from the autonode status that we added here: AUTOSCALE-615: include Karpenter node vCPUs in billing metric #8265

This just replaces the Status().Update() calls with calls to Patch calls that use client.MergeFromWithOptions() along with client.MergeFromWithOptimisticLock{} so that the full object updates don't wipe other controllers' status update

Testing this explicitly is unpleasant without some kind of refactor, the parts controllers that write the status aren't set up for easy testing, but it's observably possible the issue can happen because of the use of Update() calls across multiple controllers on the same object

Which issue(s) this PR fixes:

Fixes OCPBUGS-86306

Special notes for your reviewer:

The contentious bit here is that:

I'm building an RFC 6902 JSON Patch for the HCP status controller's Update() replacement because

we have fields in the status (AvailableUpdates, CompletionTime) that are +required and +nullable so

our patches get rejected with other patch types (And SSA - the generated applyConfigurations apply "omitempty" to everything so they drop them too. sigh)

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Refactor

Switched multiple controllers to optimistic-locking status patches for status writes, improving concurrency handling and updating success/error log messages.

Tests

Added tests validating JSON Patch generation and application: correct ops for adds/replaces/removes, optimistic-lock test, preservation of unrelated status fields and nil values, and condition array handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 20, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 20, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 20, 2026

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

jkyros force-pushed the karpenter-fix-hcp-status-race-oplocks branch from 58ac9ee to 10d5332 Compare May 20, 2026 23:59

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go Outdated

jkyros force-pushed the karpenter-fix-hcp-status-race-oplocks branch from 10d5332 to 9995cf5 Compare May 21, 2026 16:37

hypershift-jira-solve-ci Bot mentioned this pull request May 22, 2026

OCPBUGS-85151: Re-enable serviceaccount-pull-secrets controller when registry managementState changes from Removed #8522

Merged

4 tasks

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2026

jkyros force-pushed the karpenter-fix-hcp-status-race-oplocks branch from 9995cf5 to 541e646 Compare May 28, 2026 21:23

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2026

openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label May 30, 2026

jkyros marked this pull request as ready for review May 30, 2026 04:04

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2026

openshift-ci Bot requested review from Nirshal and cblecker May 30, 2026 04:04

openshift-ci Bot assigned muraee Jun 1, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2026

hypershift-jira-solve-ci Bot mentioned this pull request Jun 1, 2026

CNTRLPLANE-3524: feat: add etcd data re-encryption after encryption key rotation #8219

Open

7 tasks

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 1, 2026

openshift-merge-bot Bot merged commit 1146767 into openshift:main Jun 2, 2026
47 of 58 checks passed

maxcao13 mentioned this pull request Jun 5, 2026

AUTOSCALE-681: update aws-karpenter-provider-aws configs openshift/release#80128

Merged

		if err := h.mgtClusterClient.Status().Patch(ctx, hcp, crclient.MergeFromWithOptions(originalHCP, crclient.MergeFromWithOptimisticLock{})); err != nil {
		return reconcile.Result{}, fmt.Errorf("failed to patch hcp status: %w", err)

Conversation

jkyros commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 20, 2026

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

openshift-ci-robot commented May 20, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented May 20, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkyros commented May 21, 2026

Uh oh!

jkyros commented May 21, 2026

Uh oh!

jkyros commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

jkyros commented May 28, 2026

Uh oh!

jkyros commented May 30, 2026

Uh oh!

jkyros commented May 30, 2026

Uh oh!

jkyros commented May 30, 2026

Uh oh!

openshift-merge-bot Bot commented May 30, 2026

Uh oh!

jkyros commented May 31, 2026

Uh oh!

jkyros commented May 31, 2026

Uh oh!

TwoDCube commented Jun 1, 2026

Uh oh!

muraee commented Jun 1, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 1, 2026

Uh oh!

jkyros commented Jun 1, 2026

Uh oh!

jkyros commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading