Skip to content

CNTRLPLANE-3145: refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908

Open
muraee wants to merge 2 commits into
openshift:mainfrom
muraee:refactor/hostedcluster-reconcile-error-collecting
Open

CNTRLPLANE-3145: refactor(hostedcluster): segregate reconcile loop into error-collecting blocks#7908
muraee wants to merge 2 commits into
openshift:mainfrom
muraee:refactor/hostedcluster-reconcile-error-collecting

Conversation

@muraee

@muraee muraee commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Refactors reconcile() in the HostedCluster controller to use categorized error handling with critical/non-critical operations instead of sequential short-circuiting. Previously, any single failure among ~50 operations would block all subsequent work — e.g., a missing SSH key secret prevented CPO deployment and HCP creation.
  • Introduces a reconcileReport struct (reconcile_report.go) that classifies operations as critical (blocks downstream Phase 8) or nonCritical (errors collected, never blocks). When critical operations fail, Phase 8 components are automatically skipped with clear reporting of what failed and what was blocked.
  • Extracts inline code blocks into named methods and introduces wrapper methods (reconcileOperatorDeployments, reconcileRBACAndPolicies, reconcileKubeconfigAndPasswordSync, reconcileAuxiliary, reconcilePlatformSpecific) that collect errors independently.
  • The ReconciliationSucceeded condition now reflects the structured report: when critical failures exist, the condition message surfaces which operations failed and which were blocked (e.g., critical failures: [PullSecretSync]; blocked operations: [OperatorDeployments, RBACAndPolicies, ...]).

Key changes

Error categorization

Category Behavior Operations
critical Failures block Phase 8 components PlatformCredentials, PullSecretSync, SecretEncryptionSync, CoreHCPChain
nonCritical Errors collected, never blocks SSHKeySync, AuditWebhookSync, AdditionalTrustBundle, all Phase 8 groups

Phase structure

Phase Behavior Operations
0–5 Short-circuit (prerequisites) HCP get, deletion, platform defaults, status, finalizers, namespace, platform
6a Critical sync (error-collecting) PlatformCredentials, PullSecretSync, SecretEncryptionSync
6b Non-critical sync (error-collecting, never blocked) RestoredFromBackup, AuditWebhookSync, SSHKeySync, AdditionalTrustBundle, SA signing key, etcd MTLS, ETCDMemberRecovery, GlobalConfigSync
7 Core HCP chain (always runs regardless of 6a) HCP object → CAPI InfraCR → CAPI Cluster
8 Components — blocked if any critical failure KubeconfigAndPasswordSync, OperatorDeployments, RBACAndPolicies, PlatformOIDCAndCSI, MonitoringAndCLISecrets

Condition reporting

The ReconciliationSucceeded condition now reflects the structured error report:

  • When critical failures exist, the condition message includes which operations failed and which were blocked
  • When only non-critical failures exist, the condition reports the aggregate error as before
  • Example condition message: critical failures: [PullSecretSync]; blocked operations: [KubeconfigAndPasswordSync, OperatorDeployments, RBACAndPolicies, PlatformOIDCAndCSI, MonitoringAndCLISecrets]

Structured error aggregation

When critical failures exist, aggregate() returns only critical errors with blocked operation list — non-critical errors are suppressed since the user should fix the critical issue first:

critical error: failed to get pull secret...; blocked operations: [KubeconfigAndPasswordSync, OperatorDeployments, RBACAndPolicies, PlatformOIDCAndCSI, MonitoringAndCLISecrets]

When no critical failures exist, all errors are returned as-is.

reconcileReport API

Two public methods on the report:

  • execute(name, category, func() error) — always runs the operation and records the result
  • executeOrBlock(name, func() error) — automatically checks hasCriticalFailure() and either runs the operation or records it as blocked

Analysis

See docs/design/hostedcluster-reconcile-segregation-analysis.md for the full design.

Test plan

  • All existing unit tests pass (go test -count=1 -race ./hypershift-operator/controllers/hostedcluster/)
  • make lint passes with 0 issues
  • New unit tests for reconcileReport methods (TestReconcileReport, TestConditionMessage, TestAggregate, TestExecuteOrBlock)
  • New unit tests for wrapper method isolation (TestReconcileKubeconfigAndPasswordSync_*, TestReconcileRBACAndPolicies_*)
  • New integration tests verifying blocking behavior:
    • Phase 6a critical failure → Phase 8 blocked, Phase 7 still runs
    • Phase 7 HCP creation failure → Phase 8 blocked
    • Phase 6b non-critical failure → nothing blocked

@openshift-ci-robot

Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request introduces a phased, modular refactoring of the HostedCluster reconciliation loop. It adds a design document analyzing reconciliation segregation, restructures the controller into nine sequential phases with independent error aggregation, and introduces multiple helper functions to isolate functionality. A signature change removes the defaultIngressDomain parameter from reconcileControlPlaneOperator, and new tests validate partial progress when operations fail.

Changes

Cohort / File(s) Summary
Design Documentation
docs/design/hostedcluster-reconcile-segregation-analysis.md
New design document detailing reconciliation segregation analysis, including operation map splits (Pre-requisite, Part One, Part Two), dependency graphs, identified blocking issues, and impact assessment showing partial progress scenarios.
Controller Refactoring
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
Major restructuring into 9 phases: initialization, pre-deletion propagation, deletion handling, conversion fixes, status updates, prerequisites, and three independent phase blocks. Introduces 15+ new modular helper functions (e.g., reconcileCoreHCPChain, reconcileOperatorDeployments, reconcilePlatformCredentialsWithStatus), changes reconcileControlPlaneOperator signature (removes defaultIngressDomain parameter), and implements aggregated error collection across independent syncs.
Test Coverage
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
Adds three new tests validating resilient reconciliation: kubeconfig sync failure with continued kubeadmin-password sync, RBAC failure with continued Prometheus RBAC creation, and Phase 6 SSH key failure with continued Phase 7–8 completion. Includes rbacv1 import for RBAC assertions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Three new tests lack context timeouts, have incomplete coverage for failure scenarios, and include inconsistent assertion messages. Add context timeouts using context.WithTimeout(), mock dependencies to force failures in the PKI RBAC test, and add meaningful messages to all assertions.
✅ Passed checks (3 passed)
Check name Status Explanation
Stable And Deterministic Test Names ✅ Passed Pull request uses Go's standard testing package (func TestXxx) rather than Ginkgo, so the Ginkgo test title stability check does not apply. Test function names are descriptive and static with no dynamic values.
Title check ✅ Passed The title accurately describes the primary refactoring—segregating the reconcile loop into error-collecting blocks with clearer phase separation.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@muraee

muraee commented Mar 10, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/hostedcluster-reconcile-segregation-analysis.md`:
- Around line 112-190: The markdown fenced block containing the ASCII diagram
(the block that begins with
"+-----------------------------------------------------+" and includes "CRITICAL
PREREQUISITES (must succeed first)") needs a language label to satisfy
markdownlint: change the opening fence from ``` to ```text so the diagram is
fenced as a text block; update the single fenced block in the file (the ASCII
diagram between the backticks) accordingly.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 1726-1728: The code currently checks and then removes the
HostedClusterRestoredFromBackupAnnotation from hcluster before writing the
durable status (ReconciliationSucceeded/HostedClusterRestoredFromBackup
condition), which can lose the trigger if the status update fails; change the
flow so you do not consume/remove HostedClusterRestoredFromBackupAnnotation
until after the status write is confirmed: first set the
HostedClusterRestoredFromBackup condition on the HostedCluster status and
perform the status update (updateStatus on hcluster), retrying on conflict as
needed, and only after the status update succeeds remove the
HostedClusterRestoredFromBackupAnnotation (or perform the annotation removal in
a separate patch/update with proper conflict handling) so the reconcile will
retry if the status write failed.
- Around line 1456-1483: If reconcileCoreHCPChain failed and hcp is nil, phase‑8
helpers will dereference hcp and panic; guard the entire phase‑8 block by
checking if hcp == nil and, if so, append a recoverable error to componentErrs
(e.g. fmt.Errorf("skipping phase 8: HostedControlPlane is nil due to earlier
error")) and skip calling reconcileKubeconfigAndPasswordSync,
reconcileOperatorDeployments, reconcileRBACAndPolicies,
reconcilePlatformSpecific, and reconcileAuxiliary; otherwise run the existing
calls as before.
- Around line 2161-2174: The status fields holding secret references
(hcluster.Status.CustomKubeconfig and hcluster.Status.KubeadminPassword) are
only being cleared in memory; after deleting the Secrets you must also persist
those changes to the API by clearing the fields on the HostedCluster status and
calling the Status().Update (or Client.Status().Update) to save them. Modify the
branch that deletes the custom kubeconfig (and the other branch mentioned around
the KubeadminPassword) to set hcluster.Status.CustomKubeconfig = nil and/or
hcluster.Status.KubeadminPassword = nil as appropriate and then call
r.Status().Update(ctx, hcluster) (handling and returning any error) so the API
no longer holds dangling secret refs; use the existing DeleteIfNeeded flow and
ensure both branches behave the same way.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33f6ef88-9d24-48cc-8609-666d2cca5d82

📥 Commits

Reviewing files that changed from the base of the PR and between cc479bc and 04ede4b.

📒 Files selected for processing (3)
  • docs/design/hostedcluster-reconcile-segregation-analysis.md
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

Comment thread docs/design/hostedcluster-reconcile-segregation-analysis.md
Comment thread hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go Outdated
@openshift-ci openshift-ci Bot requested review from enxebre and sjenning March 10, 2026 16:26
@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Mar 10, 2026
@openshift-ci

openshift-ci Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Mar 10, 2026
Comment thread docs/design/hostedcluster-reconcile-segregation-analysis.md Outdated
Comment thread docs/design/hostedcluster-reconcile-segregation-analysis.md Outdated
@enxebre

enxebre commented Mar 12, 2026

Copy link
Copy Markdown
Member

was there jira bug we can ref reporting the scenario where this was being problematic for managed?

Comment thread docs/design/hostedcluster-reconcile-segregation-analysis.md Outdated
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 15, 2026
@openshift-merge-robot

Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 04ede4b to 168e67a Compare March 31, 2026 11:16
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 168e67a to 15e6a0c Compare March 31, 2026 11:23
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 15e6a0c to e7bc83c Compare March 31, 2026 11:27
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from e7bc83c to 5bc8ca5 Compare March 31, 2026 11:40
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 5bc8ca5 to 922ef75 Compare March 31, 2026 11:47
@codecov

codecov Bot commented Mar 31, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 51.36187% with 375 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.86%. Comparing base (4755e9c) to head (e2adc8a).

Files with missing lines Patch % Lines
...trollers/hostedcluster/hostedcluster_controller.go 45.50% 326 Missing and 44 partials ⚠️
hypershift-operator/main.go 0.00% 2 Missing ⚠️
support/util/util.go 60.00% 1 Missing and 1 partial ⚠️
...rollers/hostedcluster/internal/platform/aws/aws.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7908      +/-   ##
==========================================
+ Coverage   41.54%   41.86%   +0.32%     
==========================================
  Files         758      759       +1     
  Lines       93838    94022     +184     
==========================================
+ Hits        38986    39365     +379     
+ Misses      52107    51924     -183     
+ Partials     2745     2733      -12     
Files with missing lines Coverage Δ
...ator/controllers/hostedcluster/reconcile_report.go 100.00% <100.00%> (ø)
...rollers/hostedcluster/internal/platform/aws/aws.go 14.09% <0.00%> (ø)
hypershift-operator/main.go 0.00% <0.00%> (ø)
support/util/util.go 39.71% <60.00%> (+0.16%) ⬆️
...trollers/hostedcluster/hostedcluster_controller.go 51.68% <45.50%> (+5.79%) ⬆️

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 34.96% <60.00%> (+<0.01%) ⬆️
cpo-hostedcontrolplane 43.59% <ø> (ø)
cpo-other 43.17% <ø> (ø)
hypershift-operator 52.76% <51.30%> (+1.14%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch 2 times, most recently from 2d2292e to 4a89726 Compare May 5, 2026 16:41

@csrwng csrwng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship observability first, then change behavior

The refactoring approach is sound — categorizing operations as critical/non-critical and collecting errors instead of short-circuiting is a clear improvement. However, this PR changes the control flow of the most critical reconciler in the system, and the new failure modes (parallel error paths, reordered operations) are exactly the kind that don't surface in unit tests or standard e2e — they manifest under partial failures in production.

Step 0: Observability before behavior change

Before changing any error handling or ordering, ship a PR that adds structured logging or metrics to the existing sequential reconcile loop that records:

  • Which operations fail and how often
  • What downstream operations would have continued running under the new model
  • Whether the current ordering assumptions actually matter in practice

This "dry-run" data from a production release cycle would validate the critical vs. non-critical categorization with real failure data, rather than guessing which operations are safe to unblock. It also gives us a baseline to compare against after the behavior change lands.

Then: incremental rollout

This PR bundles three distinct changes into one shot, which creates a large blast radius:

  1. Extracting inline blocks into named methods (pure refactor)
  2. Introducing reconcileReport and wiring it up (new framework)
  3. Reclassifying operations as non-critical and reordering them (behavior change)

Suggested split:

PR 1 — Extract methods (zero behavior change). Move inline code blocks into named methods, keeping the exact same sequential short-circuit order. This is safe to review, easy to verify (identical behavior), and reduces the diff for subsequent PRs.

PR 2 — Introduce reconcileReport, classify everything as critical. Wire up the report framework but keep all operations as critical so behavior is identical to today — every error still blocks downstream work. This validates the framework without changing semantics.

PR 3+ — Reclassify operations as nonCritical one group at a time. Move SSH key sync, audit webhook, etc. to nonCritical incrementally, with production validation between each change. Each PR is small, reviewable, and independently revertable.

Other notes

  • Ordering changes need per-operation justification: The PR reorders several operations (e.g., CLI Secrets moved from first to last, RestoredFromBackup shifted relative to pull secret sync). Each change should have a brief rationale explaining why nothing downstream depends on the old position.
  • Feature gate: Consider a flag (env var or annotation) to switch between old and new reconcile paths during the rollout period, so issues can be mitigated without a rollback.

@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 4a89726 to 857c6b5 Compare May 28, 2026 16:22
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label May 28, 2026
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-7908 May 28, 2026 16:32 Inactive
Comment thread support/config/constants.go Outdated
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 857c6b5 to 087c0cf Compare May 29, 2026 12:51
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-7908 May 29, 2026 12:53 Inactive
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 087c0cf to ed33ab2 Compare June 8, 2026 16:05
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-7908 June 8, 2026 16:07 Inactive
@muraee

muraee commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-7908 June 9, 2026 16:38 Inactive
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch 2 times, most recently from e03d791 to 6328533 Compare June 10, 2026 10:24
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2026
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from 6328533 to db0fee8 Compare June 10, 2026 10:37
muraee and others added 2 commits June 10, 2026 12:43
…ng blocks

The reconcile() method executes ~50 sequential operations where every
error causes an early return, short-circuiting all remaining work. An
unrelated failure (e.g., missing SSH key secret) prevents critical
operations like deploying the CPO or reconciling the HCP object.

This refactoring:

- Extracts 12 inline blocks into named methods
- Groups operations into phased error-collecting blocks
- Aggregates all errors with utilerrors.NewAggregate at the end
- Introduce reconcileReport struct that classifies reconcile operations as
critical (blocks Phase 8) or non-critical (error-collecting, never blocks).
Replace the sequential error chain where any failure short-circuits the
entire loop with structured error collection and blocking rules.

After this change, failures in one phase no longer block unrelated
phases. For example, a missing SSH key no longer prevents CPO deployment
or HCP object creation.

Includes the analysis document and integration tests that verify
non-blocking behavior across phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rror-collection framework

Replace the ad-hoc early-return pull-secret recovery path (PR openshift#8352) with
the error-collection framework. Instead of inline HCP reconciliation when
GetPullSecretBytes fails, the reconciliation now flows through the framework
where PullSecretSync captures the error as critical and CoreHCPChain
reconciles the HCP with full cert resolution.

Key changes:
- Move GetPullSecretBytes, CPO image/label resolution, and namespace
  reconciliation into a single report.execute("CPOImageAndNamespace")
  block. This prevents namespace PSA label downgrades when CPO labels
  are unavailable.
- Make DetermineHostedClusterPayloadArch and lookupReleaseImage non-fatal
  so reconciliation continues to the framework.
- Make cpoSupportsKASCustomKubeconfig status check unconditional — all
  supported CPO versions expose custom kubeconfig.
- Wrap releaseImageVersion parsing in report.execute(critical) to block
  OperatorDeployments and RBACAndPolicies on failure instead of hard-returning.
- Extract reconcileControlPlaneNamespace into its own method.
- Update pull-secret-missing tests with valid fixtures (NonePlatform, Route,
  valid UUID) so reconciliation reaches the framework.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@muraee muraee force-pushed the refactor/hostedcluster-reconcile-error-collecting branch from db0fee8 to e2adc8a Compare June 10, 2026 10:45
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2026
@muraee

muraee commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2064660417933742080 | Cost: $1.5865967499999996 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@muraee

muraee commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

2 similar comments
@muraee

muraee commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@muraee

muraee commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@muraee: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/docs-preview 6805539 link false /test docs-preview
ci/prow/verify-workflows 6805539 link true /test verify-workflows
ci/prow/e2e-aws e2adc8a link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

Now I have the complete picture. Here is the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2065020173840027648
  • Target: e2e-aws
  • PR: #7908CNTRLPLANE-3145: refactor(hostedcluster): segregate reconcile loop into error-collecting blocks
  • Failure Reason: executing_graph:step_failed:importing_release

Test Failure Analysis

Error

could not run steps: step [release:n3minor] failed: failed to import release
4.20.0-0.ci-2026-06-07-204143 to tag release:n3minor: failed to reimport the tag
ci-op-nk6yp9sw/stable-n3minor:hypershift: unable to import tag ... with message
Internal error occurred: dockerimage.image.openshift.io
"quay.io/openshift/ci@sha256:ef1b3047fb8915cf4bfd3e7a08ed90a1312b8399a152d8acf69767055a2a446d"
not found ... timed out waiting for the condition

(plus 2 additional release import failures: n2minor and n4minor)

Summary

This is a CI infrastructure failure unrelated to PR #7908. The e2e-aws job never reached any test execution — it failed during the ci-operator release payload import phase. Three out of five N-minor release payloads (n2minor/4.21, n3minor/4.20, n4minor/4.19) could not be imported because specific container images within those payloads were no longer available in the quay.io/openshift/ci and quay-proxy.ci.openshift.org/openshift/ci registries. The affected payloads were 3–8 days old at the time of the job run and their images had likely been garbage-collected. No code from the PR was tested. Retrying (/retest) should resolve this by picking up fresher release payloads.

Root Cause

The ci-operator for this HyperShift e2e-aws job resolves multiple "N-minor" release payloads (n1minor through n4minor) representing previous OCP minor versions, which are used for cross-version upgrade testing. These payloads are snapshots of CI-built images stored in quay.io/openshift/ci.

Three of the five release payloads contained image references that were no longer available:

  1. n2minor (4.21.0-0.ci-2026-06-08-003009, 3 days old): agent-installer-ui image sha256:abea17a3... not found
  2. n3minor (4.20.0-0.ci-2026-06-07-204143, 4 days old): hypershift image sha256:ef1b3047... not found
  3. n4minor (4.19.0-0.ci-2026-06-03-210413, 8 days old): machine-config-operator image sha256:90428a82... not found

The ci-operator attempted 6 reimport retries for each failing tag before timing out. Since these are required dependency steps in the execution graph, the entire job was aborted before any multi-stage test steps (pre/test/post phases) could begin.

The two successfully imported payloads — initial (5.0.0, 1 day old) and n1minor (4.22.0, 4 days old) — had all their images still available, supporting the hypothesis that the failures are due to image garbage collection of stale CI payload images on the registry side.

This is a transient infrastructure issue and is completely unrelated to the PR code changes (refactoring the hostedcluster reconcile loop).

Recommendations
  1. Retest the job — Run /retest or /test e2e-aws on the PR. The ci-operator will resolve fresh latest payloads for each N-minor stream, picking up newer images that are still available in the registry.
  2. No code changes needed — This failure is entirely a CI infrastructure issue (stale release payload image references). The PR code was never tested.
  3. If retest fails again — Check the OpenShift CI status page and #forum-ocp-crt / #announce-testplatform Slack channels for known registry or image-mirroring issues. File a bug against DPTP if the problem persists across multiple retests.
Evidence
Evidence Detail
Failure stage Release payload import (pre-test infrastructure), not test execution
Failure reason executing_graph:step_failed:importing_release
Failed step: [release:n2minor] Image agent-installer-ui (sha256:abea17a3...) not found in quay.io/openshift/ci — payload 4.21.0-0.ci-2026-06-08-003009 (3 days old)
Failed step: [release:n3minor] Image hypershift (sha256:ef1b3047...) not found in quay.io/openshift/ci — payload 4.20.0-0.ci-2026-06-07-204143 (4 days old)
Failed step: [release:n4minor] Image machine-config-operator (sha256:90428a82...) not found in quay.io/openshift/ci — payload 4.19.0-0.ci-2026-06-03-210413 (8 days old)
Passed step: [release:initial] 5.0.0-0.ci-2026-06-10-000905 (1 day old) — imported successfully
Passed step: [release:n1minor] 4.22.0-0.ci-2026-06-07-214855 (4 days old) — imported successfully
Test steps executed None — job aborted before any e2e-aws multi-stage test steps ran
JUnit XML junit_operator.xml: 25 tests, 3 failures — all 3 are release import steps
PR code relevance None — failure is in CI infrastructure, not in PR code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants