OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy by vsolanki12 · Pull Request #8406 · openshift/hypershift

vsolanki12 · 2026-05-04T11:25:52Z

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.
When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.
Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.
The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Bug Fixes
- Automatically cleans up etcd recovery resources and clears the recovery-active failure state when the etcd cluster reports full readiness, avoiding unnecessary manual-intervention alerts while still signaling failures when the cluster is unhealthy.
- Clears stale recovery failure conditions when there is no failing etcd pod and the cluster is fully available.
Tests
- Added unit tests covering recovery success/failure paths, condition clearing, and cleanup behavior.

openshift-merge-bot · 2026-05-04T11:25:55Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-04T11:25:56Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-04T11:25:58Z

@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.

When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.

Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.

The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-04T11:26:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

When an existing etcd recovery Job finishes unsuccessfully, the reconciler now fetches the etcd StatefulSet; if the StatefulSet reports ReadyReplicas==3 and AvailableReplicas==3 it deletes the recovery Job and related objects, clears the EtcdRecoveryActive condition (Status=False, Reason=AsExpectedReason) on the HostedCluster, updates HostedCluster status, and returns without marking manual-intervention. If the StatefulSet is not fully ready, the reconciler retains the prior failure handling (EtcdRecoveryJobFailedReason). Separately, when no failing etcd pod is detected and the StatefulSet is fully available, the reconciler clears a stale EtcdRecoveryActive failure condition only if its prior Reason was EtcdRecoveryJobFailedReason, updating status. A unit test verifies these behaviors.

Sequence Diagram

sequenceDiagram
    participant Reconciler
    participant ETCDJob as ETCD Recovery Job
    participant ETCDSet as ETCD StatefulSet
    participant HostedCluster as HostedCluster Status

    Reconciler->>ETCDJob: Check if Job exists and failed
    alt Job Failed
        Reconciler->>ETCDSet: Fetch StatefulSet readiness
        alt ReadyReplicas==3 && AvailableReplicas==3
            Reconciler->>ETCDJob: Delete recovery Job and objects
            Reconciler->>HostedCluster: Set EtcdRecoveryActive=False (Reason: AsExpectedReason)
            Reconciler->>HostedCluster: Update status
        else StatefulSet Not Fully Ready
            Reconciler->>HostedCluster: Set EtcdRecoveryActive=True/Failed (Reason: EtcdRecoveryJobFailedReason)
            Reconciler->>HostedCluster: Update status
        end
    else No Failing Pod Detected
        Reconciler->>ETCDSet: Check if fully available
        alt StatefulSet Fully Available
            Reconciler->>HostedCluster: If prior Reason==EtcdRecoveryJobFailedReason, clear EtcdRecoveryActive (Status=False, Reason=AsExpectedReason)
            Reconciler->>HostedCluster: Update status
        end
    end

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Test lacks assertion messages. 5 of 6 key assertions missing failure messages to diagnose failures, violating requirement `#4` for meaningful diagnostic messages.	Add descriptive failure messages to all assertions: err check, client.Get check, condition existence checks, and condition reason check. Example: g.Expect(err).ToNot(HaveOccurred(), "reconcileETCDMemberRecovery failed unexpectedly")

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main change: clearing a stale EtcdRecoveryActive failure condition when etcd becomes healthy, which matches the core fix implemented in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in TestReconcileETCDMemberRecovery are stable and deterministic with no dynamic content like pod names, timestamps, UUIDs, or IP addresses. Test names are descriptive and static.
Microshift Test Compatibility	✅ Passed	The new test TestReconcileETCDMemberRecovery is a standard Go unit test (testing.T), not a Ginkgo e2e test. The check applies only to Ginkgo e2e tests, so this check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No Ginkgo e2e tests added. The new test is a standard Go unit test using testing.T, not Ginkgo. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	No scheduling constraints. Controller monitoring logic only, gated by HighlyAvailable topology. No affinity, node selectors, or topology spreads.
Ote Binary Stdout Contract	✅ Passed	Check not applicable. Modified files are HyperShift controller code and unit tests, not OTE test extension binaries. No stdout writes found.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Check is not applicable. The PR adds TestReconcileETCDMemberRecovery, a standard Go unit test using fake clients, not a Ginkgo e2e test. The custom check applies only to Ginkgo e2e tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-04T11:30:37Z

Codecov Report

❌ Patch coverage is 67.64706% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.58%. Comparing base (860b695) to head (c5ada28).
⚠️ Report is 78 commits behind head on main.

Files with missing lines	Patch %	Lines
...perator/controllers/hostedcluster/etcd_recovery.go	67.64%	7 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8406      +/-   ##
==========================================
+ Coverage   41.27%   41.58%   +0.31%     
==========================================
  Files         755      758       +3     
  Lines       93446    93868     +422     
==========================================
+ Hits        38566    39032     +466     
+ Misses      52148    52083      -65     
- Partials     2732     2753      +21

Files with missing lines	Coverage Δ
...perator/controllers/hostedcluster/etcd_recovery.go	`46.03% <67.64%> (+10.80%)`	⬆️

... and 31 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.96% <ø> (+0.09%)`	⬆️
cpo-hostedcontrolplane	`43.59% <ø> (+0.09%)`	⬆️
cpo-other	`43.17% <ø> (+0.37%)`	⬆️
hypershift-operator	`51.74% <67.64%> (+0.74%)`	⬆️
other	`31.56% <ø> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

6622-6627: ⚡ Quick win

Assert the failed recovery job is gone in the cleanup case.

The recovered-after-failed-job case only checks the condition reason. If cleanup regresses and the stale Job stays behind, this test still passes even though one of the main behaviors in this PR is broken.

Suggested assertion

 	testCases := []struct {
 		name            string
 		objects         []crclient.Object
 		conditions      []metav1.Condition
 		expectedReason  string
 		conditionExists bool
+		expectJobDeleted bool
 	}{
 		{
 			name:            "When failed job exists but etcd recovered it should cleanup job and clear condition",
 			conditions:      []metav1.Condition{staleCondition},
 			objects:         append(healthyEtcdPods(), healthyStatefulSet, failedJob),
 			expectedReason:  hyperv1.AsExpectedReason,
 			conditionExists: true,
+			expectJobDeleted: true,
 		},
 	}
@@
 			if tc.conditionExists {
 				g.Expect(condition).ToNot(BeNil())
 				g.Expect(condition.Reason).To(Equal(tc.expectedReason))
 			} else {
 				g.Expect(condition).To(BeNil())
 			}
+
+			if tc.expectJobDeleted {
+				job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS)
+				err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job)
+				g.Expect(errors2.IsNotFound(err)).To(BeTrue())
+			}
 		})
 	}
 }

Also applies to: 6677-6686

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 6622 - 6627, The test cases that use the failedJob fixture (the
case with name "When failed job exists but etcd recovered it should cleanup job
and clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6622-6627: The test cases that use the failedJob fixture (the case
with name "When failed job exists but etcd recovered it should cleanup job and
clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d69011b7-ad73-45d5-aa19-6f7fc30bdb73

📥 Commits

Reviewing files that changed from the base of the PR and between 68106f0 and 7f5e921.

📒 Files selected for processing (4)

control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
hypershift-operator/controllers/hostedcluster/etcd_recovery.go
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

openshift-ci-robot · 2026-05-04T13:57:07Z

@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.

When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.

Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.

The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Enhanced error messages during in-place upgrades to include degraded node names and specific failure reasons.

Improved etcd recovery handling to automatically clean up recovery resources when etcd returns to healthy state, reducing manual intervention needs.

Tests

Added test coverage for degraded node scenarios during upgrades and etcd recovery status conditions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-05-04T14:02:00Z

@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.

When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.

Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.

The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

In-place upgrade failures now report the degraded node name and its degraded-reason annotation.

Etcd recovery no longer forces manual intervention when the cluster is healthy; recovery resources and failure condition are cleared when etcd is fully ready.

Tests

Added tests covering degraded-node upgrade errors and etcd recovery status transitions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

vsolanki12 · 2026-05-05T05:33:56Z

/auto-cc

vsolanki12 · 2026-05-05T09:48:06Z

@coderabbitai full review

coderabbitai · 2026-05-05T09:48:13Z

✅ Actions performed

Full review triggered.

coderabbitai

🧹 Nitpick comments (1)

hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

6661-6666: ⚡ Quick win

Assert recovery Job deletion in the “failed job but etcd recovered” case.

The test validates condition reason changes, but it doesn’t verify that cleanup actually removed the failed Job. Adding that assertion would directly protect the main behavior this PR introduces.

Suggested test assertion

 			condition := meta.FindStatusCondition(updatedHC.Status.Conditions, string(hyperv1.EtcdRecoveryActive))
 			if tc.conditionExists {
 				g.Expect(condition).ToNot(BeNil())
 				g.Expect(condition.Reason).To(Equal(tc.expectedReason))
 			} else {
 				g.Expect(condition).To(BeNil())
 			}
+
+			if tc.name == "When failed job exists but etcd recovered it should cleanup job and clear condition" {
+				job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS)
+				err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job)
+				g.Expect(errors2.IsNotFound(err)).To(BeTrue(), "expected failed recovery job to be deleted")
+			}
 		})
 	}
 }

Also applies to: 6719-6725

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 6661 - 6666, Add assertions to the two test cases that simulate
"failed job but etcd recovered" to ensure the failed Job resource was actually
deleted: after invoking the reconcile/test execution for the case where objects
include failedJob (referencing the test case with name "When failed job exists
but etcd recovered it should cleanup job and clear condition" and the analogous
case at the other location), query the fake client for the Job (the same
failedJob name/namespace used in the test) and assert it returns NotFound; use
the test's existing fake client/helper methods used elsewhere in
hostedcluster_controller_test.go so the assertion verifies cleanup of failedJob
in addition to the condition change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6661-6666: Add assertions to the two test cases that simulate
"failed job but etcd recovered" to ensure the failed Job resource was actually
deleted: after invoking the reconcile/test execution for the case where objects
include failedJob (referencing the test case with name "When failed job exists
but etcd recovered it should cleanup job and clear condition" and the analogous
case at the other location), query the fake client for the Job (the same
failedJob name/namespace used in the test) and assert it returns NotFound; use
the test's existing fake client/helper methods used elsewhere in
hostedcluster_controller_test.go so the assertion verifies cleanup of failedJob
in addition to the condition change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1c4bd90e-22a0-4d2f-8a27-8b46bb7a4b92

📥 Commits

Reviewing files that changed from the base of the PR and between 68106f0 and 5873d33.

📒 Files selected for processing (2)

hypershift-operator/controllers/hostedcluster/etcd_recovery.go
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

bryan-cox

Staff-Level Review

The fix correctly addresses the reported problem — stale EtcdRecoveryJobFailed condition persisting after etcd self-heals. The approach is architecturally sound, isEtcdStatefulSetHealthy is a good extraction, and the metric integration is correct. See inline comments for items to address before merge.

bryan-cox · 2026-05-20T18:37:52Z

+				return false, err
+			}
+			return true, nil
+		}


[blocking] The fallthrough to this existing setEtcdRecoveryCondition call is the "etcd is still unhealthy" case, but it is easy to miss after the new health-check block above. A short comment would help:

// Etcd is still unhealthy after the failed recovery job; report manual intervention needed.

Thank you @bryan-cox,
Done, added the comment.

bryan-cox · 2026-05-20T18:37:52Z


 	oldCondition := meta.FindStatusCondition(hcluster.Status.Conditions, string(hyperv1.EtcdRecoveryActive))
-	if oldCondition == nil || oldCondition.Status != condition.Status {
+	if oldCondition == nil || oldCondition.Status != condition.Status || oldCondition.Reason != condition.Reason {


[blocking] This guard change is necessary for the fix (both old and new condition use Status=False), but the intent is non-obvious. Please add a comment explaining why Reason is now compared:

// Update the condition if the status or reason changed. The reason check // is needed to transition from EtcdRecoveryJobFailed -> AsExpected when // etcd self-heals (both use Status=False).

Thank you @bryan-cox,
Done, added the comment.

bryan-cox · 2026-05-20T18:37:52Z

+			log.Info("etcd is healthy but EtcdRecoveryActive has stale failure condition, clearing it")
+			if err := r.setEtcdRecoveryCondition(ctx, hcluster, metav1.ConditionFalse, hyperv1.AsExpectedReason, "ETCD cluster is healthy."); err != nil {
+				return nil, err
+			}


[question] This only clears stale conditions with Reason == EtcdRecoveryJobFailedReason. If someone manually deletes the failed recovery Job while recovery is active, the condition could be stuck at Status=True/Reason=AsExpected — that would not be caught here.

Is manual job deletion during active recovery a scenario worth guarding against? If not, a brief comment explaining the scope would be helpful.

Added scope comment. Not a scenario worth guarding against active recovery conditions are managed by handleExistingEtcdRecoveryJob.

bryan-cox · 2026-05-20T18:37:53Z

+				return false, fmt.Errorf("failed to get etcd statefulset: %w", err)
+			}
+		} else if isEtcdStatefulSetHealthy(etcdStatefulSet) {
+			log.Info("etcd recovered despite failed recovery job, cleaning up")


[nit] New messages use "ETCD" (all-caps) while existing code at line 93 uses "Etcd" (title-case). Not blocking, but consider unifying the capitalization.

Done, unified to "Etcd" throughout.

bryan-cox · 2026-05-20T18:37:53Z

+		g.Expect(err).To(HaveOccurred())
+		g.Expect(err.Error()).To(ContainSubstring("failed to get etcd statefulset"))
+	})
+}


[blocking] Missing test: the transient error test only exercises detectAndTriggerEtcdRecovery (no job seeded). Please add a test where a failed job exists AND the StatefulSet Get fails, to verify the new Get in handleExistingEtcdRecoveryJob (line 76) properly returns the error rather than falling through.

Done, added when failed job exists and StatefulSet Get fails with transient error it should return the error.

vsolanki12 · 2026-05-26T07:50:22Z

/retest

bryan-cox

/approve

openshift-merge-bot · 2026-06-09T19:23:49Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

bryan-cox · 2026-06-09T19:24:04Z

/lgtm cancel

only meant to put approve

openshift-ci · 2026-06-09T19:24:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, vsolanki12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsolanki12 · 2026-06-10T07:07:07Z

/retest

sdminonne

@vsolanki12 thanks for this. I addes some comments mainly in testing.
Mind follow-up?

sdminonne · 2026-06-10T09:34:32Z

+		{
+			name:            "When etcd is healthy and stale EtcdRecoveryJobFailed condition exists it should clear the condition",
+			conditions:      []metav1.Condition{staleCondition},
+			objects:         append(healthyEtcdPods(), healthyStatefulSet),


healthyStatefulSet need to be initialized at every test instance (as healthyEtcdPods is doing)

Done, converted healthyStatefulSet, unhealthyStatefulSet, staleCondition, and failedJob to closures that return fresh instances per test, matching the healthyEtcdPods() pattern.

sdminonne · 2026-06-10T09:34:51Z

+		{
+			name:            "When etcd is healthy and no EtcdRecoveryActive condition exists it should not add one",
+			conditions:      []metav1.Condition{},
+			objects:         append(healthyEtcdPods(), healthyStatefulSet),


healthyStatefulSet need to be initialized at every test instance (as healthyEtcdPods is doing)

Done, all mutable objects are now closures.

sdminonne · 2026-06-10T09:34:57Z

+		{
+			name:            "When etcd pods have restarted but recovered it should clear the stale condition",
+			conditions:      []metav1.Condition{staleCondition},
+			objects:         append(recoveredEtcdPods(), healthyStatefulSet),


healthyStatefulSet need to be initialized at every test instance (as healthyEtcdPods is doing)

Done, all mutable objects are now closures.

sdminonne · 2026-06-10T09:35:41Z

+		{
+			name:            "When failed job exists and etcd is still unhealthy it should keep the failure condition",
+			conditions:      []metav1.Condition{staleCondition},
+			objects:         append(healthyEtcdPods(), unhealthyStatefulSet, failedJob),


Same for unhealthyStatefulSet and failedJob...
TL/DR: don't reuse mutable objects across unit tests

Done, converted all shared mutable objects (healthyStatefulSet, unhealthyStatefulSet, staleCondition, failedJob) to closures returning fresh instances per test invocation.

sdminonne · 2026-06-10T09:38:54Z

+			expectedReason:  hyperv1.AsExpectedReason,
+			conditionExists: true,
+		},
+		{


If I'm not wrong we may need a test

"When failed job exists and etcd statefulset does not exist"

Done, added this test case. It verifies the NotFound path in handleExistingEtcdRecoveryJob when the StatefulSet doesn't exist, it falls through to set EtcdRecoveryJobFailedReason.

…when etcd is healthy When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message even when the cluster was fully healthy. This fix adds two checks: - When a failed recovery job exists but etcd StatefulSet is fully available (3/3), clean up the job and clear the condition. Transient API errors are propagated instead of silently falling through to the failure path. - When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>

sdminonne · 2026-06-10T14:54:36Z

/lgtm

openshift-merge-bot · 2026-06-10T14:54:55Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

vsolanki12 · 2026-06-10T17:20:28Z

/test e2e-aks

vsolanki12 · 2026-06-11T01:38:35Z

/test e2e-aks

openshift-ci · 2026-06-11T02:16:18Z

@vsolanki12: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aks-4-22	`c5ada28`	link	true	`/test e2e-aks-4-22`
ci/prow/e2e-azure-self-managed	`c5ada28`	link	true	`/test e2e-azure-self-managed`
ci/prow/e2e-aws-4-22	`c5ada28`	link	true	`/test e2e-aws-4-22`
ci/prow/e2e-aws	`c5ada28`	link	true	`/test e2e-aws`
ci/prow/e2e-aks	`c5ada28`	link	true	`/test e2e-aks`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-06-11T03:47:14Z

The analysis is complete. Here is the report:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aks
Build ID: 2064884986166644736
Target: e2e-aks
PR: OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy #8406
Status: failure
Failure Reason: executing_graph:step_failed:importing_release

Test Failure Analysis

Error

could not run steps: step [release:n4minor] failed: failed to import release
4.19.0-0.ci-2026-06-03-210413 to tag release:n4minor: failed to reimport the tag
ci-op-1vfbgkdg/stable-n4minor:machine-config-operator: unable to import tag with
message Internal error occurred: dockerimage.image.openshift.io
"quay.io/openshift/ci@sha256:90428a..." not found — timed out after (6) imports

(+ 2 additional release import failures for n3minor and n2minor)

Summary

The e2e-aks job failed during CI infrastructure setup, before any test code from PR #8406 was executed. The ci-operator was unable to import three older OCP release payloads (n2minor/4.21, n3minor/4.20, n4minor/4.19) because specific container images within those releases had been garbage-collected from the quay.io/openshift/ci registry and were no longer available. This is a known CI infrastructure flake caused by stale release payloads referencing expired image digests — it is completely unrelated to the PR's code changes.

Root Cause

The failure is a CI infrastructure issue — specifically, stale release payload image references. The job requires importing multiple older OCP release streams (n1minor through n4minor) for multi-version upgrade testing. Three of these imports failed:

n4minor (OCP 4.19.0-0.ci-2026-06-03-210413): The machine-config-operator image at digest sha256:90428a82... was not found on either quay.io/openshift/ci or quay-proxy.ci.openshift.org/openshift/ci. This release payload is 8 days old (created June 3).
n3minor (OCP 4.20.0-0.ci-2026-06-07-204143): The hypershift image at digest sha256:ef1b3047... was not found. This release payload is 4 days old (created June 7).
n2minor (OCP 4.21.0-0.ci-2026-06-08-003009): The agent-installer-ui image at digest sha256:abea17a3... was not found. This release payload is 3 days old (created June 8).

Each import was retried 6 times by ci-operator before timing out. The images were garbage-collected from the container registries because the release payloads they belong to have aged beyond the registry's retention window. The CI infrastructure resolves the "latest" payload for each stream, but if that payload's constituent images have been pruned, the import fails.

No test code was ever executed. The e2e-aks multi-stage test step never started — it was blocked by the release import failures. The JUnit XML confirms only image build and release import steps ran; no test steps appear. This failure is entirely unrelated to the PR's changes (clearing stale EtcdRecoveryActive failure conditions).

Recommendations

Retest the PR — This is a transient CI infrastructure failure. A simple /retest on the PR should resolve it, as ci-operator will re-resolve the "latest" release payloads to newer versions whose images are still available in the registry.
No code changes needed — The PR's code (OCPBUGS-84577: clearing stale EtcdRecoveryActive failure condition) was never exercised by this job run. The failure is not a signal about the PR's quality.
If retests continue to fail — Escalate to the CI infrastructure team (TRT / DPTP) as it may indicate a broader registry GC or image mirroring issue affecting the quay.io/openshift/ci and quay-proxy.ci.openshift.org registries.

Evidence

Evidence	Detail
Failure type	CI infrastructure — release payload import failure (not a test failure)
Failure reason	`executing_graph:step_failed:importing_release`
Failed steps	`[release:n4minor]`, `[release:n3minor]`, `[release:n2minor]` — all release import steps
n4minor image missing	`quay.io/openshift/ci@sha256:90428a82...` (machine-config-operator from OCP 4.19, created June 3)
n3minor image missing	`quay.io/openshift/ci@sha256:ef1b3047...` (hypershift from OCP 4.20, created June 7)
n2minor image missing	`quay.io/openshift/ci@sha256:abea17a3...` (agent-installer-ui from OCP 4.21, created June 8)
Retry attempts	6 per image (all timed out)
Test steps executed	None — `e2e-aks` test step was never reached
PR code exercised	No — failure occurred during pre-test setup
JUnit XML	26 steps total, 3 failures — all in release import steps

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 4, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 4, 2026

vsolanki12 changed the title ~~fix(etcd-recovery): OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy~~ OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery.go Outdated

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 4, 2026

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from d28aec7 to a9c44fb Compare May 4, 2026 13:57

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch 2 times, most recently from 9462d10 to 5873d33 Compare May 4, 2026 14:07

openshift-ci Bot requested review from bryan-cox and sdminonne May 5, 2026 05:34

vsolanki12 marked this pull request as ready for review May 5, 2026 09:42

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

openshift-ci Bot requested a review from Nirshal May 5, 2026 09:42

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

bryan-cox reviewed May 20, 2026

View reviewed changes

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch 2 times, most recently from 73b3599 to b6be386 Compare May 22, 2026 08:54

hypershift-jira-solve-ci Bot mentioned this pull request May 22, 2026

OCPBUGS-77856: fix: use NodePort for HCP router Service on non-cloud platforms #8439

Closed

4 tasks

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from b6be386 to 8118f5c Compare May 26, 2026 02:59

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from 8118f5c to 9938b50 Compare May 26, 2026 10:29

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from 9938b50 to 5cd40fe Compare June 3, 2026 11:45

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2026

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from 5cd40fe to dc67571 Compare June 3, 2026 11:48

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2026

bryan-cox approved these changes Jun 9, 2026

View reviewed changes

openshift-ci Bot assigned bryan-cox Jun 9, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm Indicates that a PR is ready to be merged. labels Jun 9, 2026

sdminonne suggested changes Jun 10, 2026

View reviewed changes

vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from dc67571 to c5ada28 Compare June 10, 2026 10:32

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2026

Conversation

vsolanki12 commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 4, 2026

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

openshift-ci-robot commented May 4, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented May 4, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 4, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

vsolanki12 commented May 5, 2026

Uh oh!

vsolanki12 commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bryan-cox left a comment

Choose a reason for hiding this comment

Staff-Level Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vsolanki12 commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading