CNTRLPLANE-3145: refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
📝 WalkthroughWalkthroughThis pull request introduces a phased, modular refactoring of the HostedCluster reconciliation loop. It adds a design document analyzing reconciliation segregation, restructures the controller into nine sequential phases with independent error aggregation, and introduces multiple helper functions to isolate functionality. A signature change removes the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/test e2e-aws |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/hostedcluster-reconcile-segregation-analysis.md`:
- Around line 112-190: The markdown fenced block containing the ASCII diagram
(the block that begins with
"+-----------------------------------------------------+" and includes "CRITICAL
PREREQUISITES (must succeed first)") needs a language label to satisfy
markdownlint: change the opening fence from ``` to ```text so the diagram is
fenced as a text block; update the single fenced block in the file (the ASCII
diagram between the backticks) accordingly.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 1726-1728: The code currently checks and then removes the
HostedClusterRestoredFromBackupAnnotation from hcluster before writing the
durable status (ReconciliationSucceeded/HostedClusterRestoredFromBackup
condition), which can lose the trigger if the status update fails; change the
flow so you do not consume/remove HostedClusterRestoredFromBackupAnnotation
until after the status write is confirmed: first set the
HostedClusterRestoredFromBackup condition on the HostedCluster status and
perform the status update (updateStatus on hcluster), retrying on conflict as
needed, and only after the status update succeeds remove the
HostedClusterRestoredFromBackupAnnotation (or perform the annotation removal in
a separate patch/update with proper conflict handling) so the reconcile will
retry if the status write failed.
- Around line 1456-1483: If reconcileCoreHCPChain failed and hcp is nil, phase‑8
helpers will dereference hcp and panic; guard the entire phase‑8 block by
checking if hcp == nil and, if so, append a recoverable error to componentErrs
(e.g. fmt.Errorf("skipping phase 8: HostedControlPlane is nil due to earlier
error")) and skip calling reconcileKubeconfigAndPasswordSync,
reconcileOperatorDeployments, reconcileRBACAndPolicies,
reconcilePlatformSpecific, and reconcileAuxiliary; otherwise run the existing
calls as before.
- Around line 2161-2174: The status fields holding secret references
(hcluster.Status.CustomKubeconfig and hcluster.Status.KubeadminPassword) are
only being cleared in memory; after deleting the Secrets you must also persist
those changes to the API by clearing the fields on the HostedCluster status and
calling the Status().Update (or Client.Status().Update) to save them. Modify the
branch that deletes the custom kubeconfig (and the other branch mentioned around
the KubeadminPassword) to set hcluster.Status.CustomKubeconfig = nil and/or
hcluster.Status.KubeadminPassword = nil as appropriate and then call
r.Status().Update(ctx, hcluster) (handling and returning any error) so the API
no longer holds dangling secret refs; use the existing DeleteIfNeeded flow and
ensure both branches behave the same way.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 33f6ef88-9d24-48cc-8609-666d2cca5d82
📒 Files selected for processing (3)
docs/design/hostedcluster-reconcile-segregation-analysis.mdhypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: muraee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
was there jira bug we can ref reporting the scenario where this was being problematic for managed? |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
04ede4b to
168e67a
Compare
168e67a to
15e6a0c
Compare
15e6a0c to
e7bc83c
Compare
e7bc83c to
5bc8ca5
Compare
5bc8ca5 to
922ef75
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7908 +/- ##
==========================================
+ Coverage 41.54% 41.86% +0.32%
==========================================
Files 758 759 +1
Lines 93838 94022 +184
==========================================
+ Hits 38986 39365 +379
+ Misses 52107 51924 -183
+ Partials 2745 2733 -12
... and 1 file with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
2d2292e to
4a89726
Compare
csrwng
left a comment
There was a problem hiding this comment.
Ship observability first, then change behavior
The refactoring approach is sound — categorizing operations as critical/non-critical and collecting errors instead of short-circuiting is a clear improvement. However, this PR changes the control flow of the most critical reconciler in the system, and the new failure modes (parallel error paths, reordered operations) are exactly the kind that don't surface in unit tests or standard e2e — they manifest under partial failures in production.
Step 0: Observability before behavior change
Before changing any error handling or ordering, ship a PR that adds structured logging or metrics to the existing sequential reconcile loop that records:
- Which operations fail and how often
- What downstream operations would have continued running under the new model
- Whether the current ordering assumptions actually matter in practice
This "dry-run" data from a production release cycle would validate the critical vs. non-critical categorization with real failure data, rather than guessing which operations are safe to unblock. It also gives us a baseline to compare against after the behavior change lands.
Then: incremental rollout
This PR bundles three distinct changes into one shot, which creates a large blast radius:
- Extracting inline blocks into named methods (pure refactor)
- Introducing
reconcileReportand wiring it up (new framework) - Reclassifying operations as non-critical and reordering them (behavior change)
Suggested split:
PR 1 — Extract methods (zero behavior change). Move inline code blocks into named methods, keeping the exact same sequential short-circuit order. This is safe to review, easy to verify (identical behavior), and reduces the diff for subsequent PRs.
PR 2 — Introduce reconcileReport, classify everything as critical. Wire up the report framework but keep all operations as critical so behavior is identical to today — every error still blocks downstream work. This validates the framework without changing semantics.
PR 3+ — Reclassify operations as nonCritical one group at a time. Move SSH key sync, audit webhook, etc. to nonCritical incrementally, with production validation between each change. Each PR is small, reviewable, and independently revertable.
Other notes
- Ordering changes need per-operation justification: The PR reorders several operations (e.g., CLI Secrets moved from first to last, RestoredFromBackup shifted relative to pull secret sync). Each change should have a brief rationale explaining why nothing downstream depends on the old position.
- Feature gate: Consider a flag (env var or annotation) to switch between old and new reconcile paths during the rollout period, so issues can be mitigated without a rollback.
4a89726 to
857c6b5
Compare
857c6b5 to
087c0cf
Compare
087c0cf to
ed33ab2
Compare
|
/test e2e-aws |
e03d791 to
6328533
Compare
6328533 to
db0fee8
Compare
…ng blocks The reconcile() method executes ~50 sequential operations where every error causes an early return, short-circuiting all remaining work. An unrelated failure (e.g., missing SSH key secret) prevents critical operations like deploying the CPO or reconciling the HCP object. This refactoring: - Extracts 12 inline blocks into named methods - Groups operations into phased error-collecting blocks - Aggregates all errors with utilerrors.NewAggregate at the end - Introduce reconcileReport struct that classifies reconcile operations as critical (blocks Phase 8) or non-critical (error-collecting, never blocks). Replace the sequential error chain where any failure short-circuits the entire loop with structured error collection and blocking rules. After this change, failures in one phase no longer block unrelated phases. For example, a missing SSH key no longer prevents CPO deployment or HCP object creation. Includes the analysis document and integration tests that verify non-blocking behavior across phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rror-collection framework Replace the ad-hoc early-return pull-secret recovery path (PR openshift#8352) with the error-collection framework. Instead of inline HCP reconciliation when GetPullSecretBytes fails, the reconciliation now flows through the framework where PullSecretSync captures the error as critical and CoreHCPChain reconciles the HCP with full cert resolution. Key changes: - Move GetPullSecretBytes, CPO image/label resolution, and namespace reconciliation into a single report.execute("CPOImageAndNamespace") block. This prevents namespace PSA label downgrades when CPO labels are unavailable. - Make DetermineHostedClusterPayloadArch and lookupReleaseImage non-fatal so reconciliation continues to the framework. - Make cpoSupportsKASCustomKubeconfig status check unconditional — all supported CPO versions expose custom kubeconfig. - Wrap releaseImageVersion parsing in report.execute(critical) to block OperatorDeployments and RBACAndPolicies on failure instead of hard-returning. - Extract reconcileControlPlaneNamespace into its own method. - Update pull-secret-missing tests with valid fixtures (NonePlatform, Route, valid UUID) so reconciliation reaches the framework. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
db0fee8 to
e2adc8a
Compare
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aws |
2 similar comments
|
/test e2e-aws |
|
/test e2e-aws |
|
@muraee: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Now I have the complete picture. Here is the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is a CI infrastructure failure unrelated to PR #7908. The e2e-aws job never reached any test execution — it failed during the ci-operator release payload import phase. Three out of five N-minor release payloads (n2minor/4.21, n3minor/4.20, n4minor/4.19) could not be imported because specific container images within those payloads were no longer available in the Root CauseThe ci-operator for this HyperShift e2e-aws job resolves multiple "N-minor" release payloads (n1minor through n4minor) representing previous OCP minor versions, which are used for cross-version upgrade testing. These payloads are snapshots of CI-built images stored in Three of the five release payloads contained image references that were no longer available:
The ci-operator attempted 6 reimport retries for each failing tag before timing out. Since these are required dependency steps in the execution graph, the entire job was aborted before any multi-stage test steps (pre/test/post phases) could begin. The two successfully imported payloads — This is a transient infrastructure issue and is completely unrelated to the PR code changes (refactoring the hostedcluster reconcile loop). Recommendations
Evidence
|
Summary
reconcile()in the HostedCluster controller to use categorized error handling with critical/non-critical operations instead of sequential short-circuiting. Previously, any single failure among ~50 operations would block all subsequent work — e.g., a missing SSH key secret prevented CPO deployment and HCP creation.reconcileReportstruct (reconcile_report.go) that classifies operations as critical (blocks downstream Phase 8) or nonCritical (errors collected, never blocks). When critical operations fail, Phase 8 components are automatically skipped with clear reporting of what failed and what was blocked.reconcileOperatorDeployments,reconcileRBACAndPolicies,reconcileKubeconfigAndPasswordSync,reconcileAuxiliary,reconcilePlatformSpecific) that collect errors independently.ReconciliationSucceededcondition now reflects the structured report: when critical failures exist, the condition message surfaces which operations failed and which were blocked (e.g.,critical failures: [PullSecretSync]; blocked operations: [OperatorDeployments, RBACAndPolicies, ...]).Key changes
Error categorization
Phase structure
Condition reporting
The
ReconciliationSucceededcondition now reflects the structured error report:critical failures: [PullSecretSync]; blocked operations: [KubeconfigAndPasswordSync, OperatorDeployments, RBACAndPolicies, PlatformOIDCAndCSI, MonitoringAndCLISecrets]Structured error aggregation
When critical failures exist,
aggregate()returns only critical errors with blocked operation list — non-critical errors are suppressed since the user should fix the critical issue first:When no critical failures exist, all errors are returned as-is.
reconcileReport API
Two public methods on the report:
execute(name, category, func() error)— always runs the operation and records the resultexecuteOrBlock(name, func() error)— automatically checkshasCriticalFailure()and either runs the operation or records it as blockedAnalysis
See
docs/design/hostedcluster-reconcile-segregation-analysis.mdfor the full design.Test plan
go test -count=1 -race ./hypershift-operator/controllers/hostedcluster/)make lintpasses with 0 issuesreconcileReportmethods (TestReconcileReport,TestConditionMessage,TestAggregate,TestExecuteOrBlock)TestReconcileKubeconfigAndPasswordSync_*,TestReconcileRBACAndPolicies_*)