OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy#8406
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWhen an existing etcd recovery Job finishes unsuccessfully, the reconciler now fetches the etcd StatefulSet; if the StatefulSet reports ReadyReplicas==3 and AvailableReplicas==3 it deletes the recovery Job and related objects, clears the EtcdRecoveryActive condition (Status=False, Reason=AsExpectedReason) on the HostedCluster, updates HostedCluster status, and returns without marking manual-intervention. If the StatefulSet is not fully ready, the reconciler retains the prior failure handling (EtcdRecoveryJobFailedReason). Separately, when no failing etcd pod is detected and the StatefulSet is fully available, the reconciler clears a stale EtcdRecoveryActive failure condition only if its prior Reason was EtcdRecoveryJobFailedReason, updating status. A unit test verifies these behaviors. Sequence DiagramsequenceDiagram
participant Reconciler
participant ETCDJob as ETCD Recovery Job
participant ETCDSet as ETCD StatefulSet
participant HostedCluster as HostedCluster Status
Reconciler->>ETCDJob: Check if Job exists and failed
alt Job Failed
Reconciler->>ETCDSet: Fetch StatefulSet readiness
alt ReadyReplicas==3 && AvailableReplicas==3
Reconciler->>ETCDJob: Delete recovery Job and objects
Reconciler->>HostedCluster: Set EtcdRecoveryActive=False (Reason: AsExpectedReason)
Reconciler->>HostedCluster: Update status
else StatefulSet Not Fully Ready
Reconciler->>HostedCluster: Set EtcdRecoveryActive=True/Failed (Reason: EtcdRecoveryJobFailedReason)
Reconciler->>HostedCluster: Update status
end
else No Failing Pod Detected
Reconciler->>ETCDSet: Check if fully available
alt StatefulSet Fully Available
Reconciler->>HostedCluster: If prior Reason==EtcdRecoveryJobFailedReason, clear EtcdRecoveryActive (Status=False, Reason=AsExpectedReason)
Reconciler->>HostedCluster: Update status
end
end
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8406 +/- ##
==========================================
+ Coverage 41.27% 41.58% +0.31%
==========================================
Files 755 758 +3
Lines 93446 93868 +422
==========================================
+ Hits 38566 39032 +466
+ Misses 52148 52083 -65
- Partials 2732 2753 +21
... and 31 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6622-6627: ⚡ Quick winAssert the failed recovery job is gone in the cleanup case.
The recovered-after-failed-job case only checks the condition reason. If cleanup regresses and the stale
Jobstays behind, this test still passes even though one of the main behaviors in this PR is broken.Suggested assertion
testCases := []struct { name string objects []crclient.Object conditions []metav1.Condition expectedReason string conditionExists bool + expectJobDeleted bool }{ { name: "When failed job exists but etcd recovered it should cleanup job and clear condition", conditions: []metav1.Condition{staleCondition}, objects: append(healthyEtcdPods(), healthyStatefulSet, failedJob), expectedReason: hyperv1.AsExpectedReason, conditionExists: true, + expectJobDeleted: true, }, } @@ if tc.conditionExists { g.Expect(condition).ToNot(BeNil()) g.Expect(condition.Reason).To(Equal(tc.expectedReason)) } else { g.Expect(condition).To(BeNil()) } + + if tc.expectJobDeleted { + job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS) + err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job) + g.Expect(errors2.IsNotFound(err)).To(BeTrue()) + } }) } }Also applies to: 6677-6686
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6622 - 6627, The test cases that use the failedJob fixture (the case with name "When failed job exists but etcd recovered it should cleanup job and clear condition" and the similar case around lines 6677-6686) only assert condition reasons; add an assertion after reconciliation that the failed Job (failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in the test namespace and assert the client returns NotFound or that listing Jobs returns zero matching entries. Locate the assertions around expectedReason/conditionExists in hostedcluster_controller_test.go and add the cleanup check for failedJob for both test cases so a lingering Job causes the test to fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6622-6627: The test cases that use the failedJob fixture (the case
with name "When failed job exists but etcd recovered it should cleanup job and
clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d69011b7-ad73-45d5-aa19-6f7fc30bdb73
📒 Files selected for processing (4)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.gohypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
d28aec7 to
a9c44fb
Compare
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9462d10 to
5873d33
Compare
|
/auto-cc |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6661-6666: ⚡ Quick winAssert recovery Job deletion in the “failed job but etcd recovered” case.
The test validates condition reason changes, but it doesn’t verify that cleanup actually removed the failed Job. Adding that assertion would directly protect the main behavior this PR introduces.
Suggested test assertion
condition := meta.FindStatusCondition(updatedHC.Status.Conditions, string(hyperv1.EtcdRecoveryActive)) if tc.conditionExists { g.Expect(condition).ToNot(BeNil()) g.Expect(condition.Reason).To(Equal(tc.expectedReason)) } else { g.Expect(condition).To(BeNil()) } + + if tc.name == "When failed job exists but etcd recovered it should cleanup job and clear condition" { + job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS) + err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job) + g.Expect(errors2.IsNotFound(err)).To(BeTrue(), "expected failed recovery job to be deleted") + } }) } }Also applies to: 6719-6725
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6661 - 6666, Add assertions to the two test cases that simulate "failed job but etcd recovered" to ensure the failed Job resource was actually deleted: after invoking the reconcile/test execution for the case where objects include failedJob (referencing the test case with name "When failed job exists but etcd recovered it should cleanup job and clear condition" and the analogous case at the other location), query the fake client for the Job (the same failedJob name/namespace used in the test) and assert it returns NotFound; use the test's existing fake client/helper methods used elsewhere in hostedcluster_controller_test.go so the assertion verifies cleanup of failedJob in addition to the condition change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6661-6666: Add assertions to the two test cases that simulate
"failed job but etcd recovered" to ensure the failed Job resource was actually
deleted: after invoking the reconcile/test execution for the case where objects
include failedJob (referencing the test case with name "When failed job exists
but etcd recovered it should cleanup job and clear condition" and the analogous
case at the other location), query the fake client for the Job (the same
failedJob name/namespace used in the test) and assert it returns NotFound; use
the test's existing fake client/helper methods used elsewhere in
hostedcluster_controller_test.go so the assertion verifies cleanup of failedJob
in addition to the condition change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1c4bd90e-22a0-4d2f-8a27-8b46bb7a4b92
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
bryan-cox
left a comment
There was a problem hiding this comment.
Staff-Level Review
The fix correctly addresses the reported problem — stale EtcdRecoveryJobFailed condition persisting after etcd self-heals. The approach is architecturally sound, isEtcdStatefulSetHealthy is a good extraction, and the metric integration is correct. See inline comments for items to address before merge.
| return false, err | ||
| } | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
[blocking] The fallthrough to this existing setEtcdRecoveryCondition call is the "etcd is still unhealthy" case, but it is easy to miss after the new health-check block above. A short comment would help:
// Etcd is still unhealthy after the failed recovery job; report manual intervention needed.There was a problem hiding this comment.
Thank you @bryan-cox,
Done, added the comment.
|
|
||
| oldCondition := meta.FindStatusCondition(hcluster.Status.Conditions, string(hyperv1.EtcdRecoveryActive)) | ||
| if oldCondition == nil || oldCondition.Status != condition.Status { | ||
| if oldCondition == nil || oldCondition.Status != condition.Status || oldCondition.Reason != condition.Reason { |
There was a problem hiding this comment.
[blocking] This guard change is necessary for the fix (both old and new condition use Status=False), but the intent is non-obvious. Please add a comment explaining why Reason is now compared:
// Update the condition if the status or reason changed. The reason check
// is needed to transition from EtcdRecoveryJobFailed -> AsExpected when
// etcd self-heals (both use Status=False).There was a problem hiding this comment.
Thank you @bryan-cox,
Done, added the comment.
| log.Info("etcd is healthy but EtcdRecoveryActive has stale failure condition, clearing it") | ||
| if err := r.setEtcdRecoveryCondition(ctx, hcluster, metav1.ConditionFalse, hyperv1.AsExpectedReason, "ETCD cluster is healthy."); err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
[question] This only clears stale conditions with Reason == EtcdRecoveryJobFailedReason. If someone manually deletes the failed recovery Job while recovery is active, the condition could be stuck at Status=True/Reason=AsExpected — that would not be caught here.
Is manual job deletion during active recovery a scenario worth guarding against? If not, a brief comment explaining the scope would be helpful.
There was a problem hiding this comment.
Added scope comment. Not a scenario worth guarding against active recovery conditions are managed by handleExistingEtcdRecoveryJob.
| return false, fmt.Errorf("failed to get etcd statefulset: %w", err) | ||
| } | ||
| } else if isEtcdStatefulSetHealthy(etcdStatefulSet) { | ||
| log.Info("etcd recovered despite failed recovery job, cleaning up") |
There was a problem hiding this comment.
[nit] New messages use "ETCD" (all-caps) while existing code at line 93 uses "Etcd" (title-case). Not blocking, but consider unifying the capitalization.
There was a problem hiding this comment.
Done, unified to "Etcd" throughout.
| g.Expect(err).To(HaveOccurred()) | ||
| g.Expect(err.Error()).To(ContainSubstring("failed to get etcd statefulset")) | ||
| }) | ||
| } |
There was a problem hiding this comment.
[blocking] Missing test: the transient error test only exercises detectAndTriggerEtcdRecovery (no job seeded). Please add a test where a failed job exists AND the StatefulSet Get fails, to verify the new Get in handleExistingEtcdRecoveryJob (line 76) properly returns the error rather than falling through.
There was a problem hiding this comment.
Done, added when failed job exists and StatefulSet Get fails with transient error it should return the error.
73b3599 to
b6be386
Compare
b6be386 to
8118f5c
Compare
|
/retest |
8118f5c to
9938b50
Compare
9938b50 to
5cd40fe
Compare
5cd40fe to
dc67571
Compare
|
Scheduling tests matching the |
|
/lgtm cancel only meant to put approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, vsolanki12 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
sdminonne
left a comment
There was a problem hiding this comment.
@vsolanki12 thanks for this. I addes some comments mainly in testing.
Mind follow-up?
| { | ||
| name: "When etcd is healthy and stale EtcdRecoveryJobFailed condition exists it should clear the condition", | ||
| conditions: []metav1.Condition{staleCondition}, | ||
| objects: append(healthyEtcdPods(), healthyStatefulSet), |
There was a problem hiding this comment.
healthyStatefulSet need to be initialized at every test instance (as healthyEtcdPods is doing)
There was a problem hiding this comment.
Done, converted healthyStatefulSet, unhealthyStatefulSet, staleCondition, and failedJob to closures that return fresh instances per test, matching the healthyEtcdPods() pattern.
| { | ||
| name: "When etcd is healthy and no EtcdRecoveryActive condition exists it should not add one", | ||
| conditions: []metav1.Condition{}, | ||
| objects: append(healthyEtcdPods(), healthyStatefulSet), |
There was a problem hiding this comment.
healthyStatefulSet need to be initialized at every test instance (as healthyEtcdPods is doing)
There was a problem hiding this comment.
Done, all mutable objects are now closures.
| { | ||
| name: "When etcd pods have restarted but recovered it should clear the stale condition", | ||
| conditions: []metav1.Condition{staleCondition}, | ||
| objects: append(recoveredEtcdPods(), healthyStatefulSet), |
There was a problem hiding this comment.
healthyStatefulSet need to be initialized at every test instance (as healthyEtcdPods is doing)
There was a problem hiding this comment.
Done, all mutable objects are now closures.
| { | ||
| name: "When failed job exists and etcd is still unhealthy it should keep the failure condition", | ||
| conditions: []metav1.Condition{staleCondition}, | ||
| objects: append(healthyEtcdPods(), unhealthyStatefulSet, failedJob), |
There was a problem hiding this comment.
Same for unhealthyStatefulSet and failedJob...
TL/DR: don't reuse mutable objects across unit tests
There was a problem hiding this comment.
Done, converted all shared mutable objects (healthyStatefulSet, unhealthyStatefulSet, staleCondition, failedJob) to closures returning fresh instances per test invocation.
| expectedReason: hyperv1.AsExpectedReason, | ||
| conditionExists: true, | ||
| }, | ||
| { |
There was a problem hiding this comment.
If I'm not wrong we may need a test
"When failed job exists and etcd statefulset does not exist"
There was a problem hiding this comment.
Done, added this test case. It verifies the NotFound path in handleExistingEtcdRecoveryJob when the StatefulSet doesn't exist, it falls through to set EtcdRecoveryJobFailedReason.
…when etcd is healthy When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message even when the cluster was fully healthy. This fix adds two checks: - When a failed recovery job exists but etcd StatefulSet is fully available (3/3), clean up the job and clear the condition. Transient API errors are propagated instead of silently falling through to the failure path. - When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
dc67571 to
c5ada28
Compare
|
/lgtm |
|
Scheduling tests matching the |
|
/test e2e-aks |
1 similar comment
|
/test e2e-aks |
|
@vsolanki12: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The analysis is complete. Here is the report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe e2e-aks job failed during CI infrastructure setup, before any test code from PR #8406 was executed. The ci-operator was unable to import three older OCP release payloads (n2minor/4.21, n3minor/4.20, n4minor/4.19) because specific container images within those releases had been garbage-collected from the quay.io/openshift/ci registry and were no longer available. This is a known CI infrastructure flake caused by stale release payloads referencing expired image digests — it is completely unrelated to the PR's code changes. Root CauseThe failure is a CI infrastructure issue — specifically, stale release payload image references. The job requires importing multiple older OCP release streams (n1minor through n4minor) for multi-version upgrade testing. Three of these imports failed:
Each import was retried 6 times by ci-operator before timing out. The images were garbage-collected from the container registries because the release payloads they belong to have aged beyond the registry's retention window. The CI infrastructure resolves the "latest" payload for each stream, but if that payload's constituent images have been pruned, the import fails. No test code was ever executed. The Recommendations
Evidence
|
What this PR does / why we need it:
When the etcd recovery job fails but etcd self-heals, the
EtcdRecoveryJobFailedcondition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True,Degraded=False,EtcdAvailable=True).This fix adds two checks in
reconcileETCDMemberRecovery:EtcdRecoveryJobFailedcondition.Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-84577
Special notes for your reviewer:
ENABLE_ETCD_RECOVERYenv var and only applies to managed, highly-available etcd clusters.etcd_manual_intervention_requiredmetric (which reads this condition) will also correctly reset.Checklist:
Summary by CodeRabbit
Bug Fixes
Tests