Skip to content

ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions#8689

Open
Ajpantuso wants to merge 1 commit into
openshift:mainfrom
Ajpantuso:apantuso/rosaeng-8224
Open

ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions#8689
Ajpantuso wants to merge 1 commit into
openshift:mainfrom
Ajpantuso:apantuso/rosaeng-8224

Conversation

@Ajpantuso

@Ajpantuso Ajpantuso commented Jun 5, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

HCP namespace egress NetworkPolicies (private-router, management-kas) use /32 CIDRs
derived from the management cluster's KAS endpoint IPs to block HCP pods from reaching
the hosting KAS. These IPs rotate during rolling restarts, causing a burst of NetworkPolicy
UPDATEs that can trigger OVN port-group reconciliation races and drop traffic to HCP router
pods for 20-30+ minutes.

Adds a --hcp-egress-block-cidrs repeatable flag to the HyperShift Operator. When
supplied, the egress exception list in both policies is built from the provided stable CIDR
blocks (e.g. the MC machine network CIDR) instead of the live default/kubernetes Endpoints
object. This eliminates NetworkPolicy churn during KAS rollouts and avoids the OVN race.

The flag is wired end-to-end: hypershift install --hcp-egress-block-cidrs=<cidr> sets it
on the operator Deployment. Values are validated as valid CIDRs at startup. May be specified
multiple times.

Default behaviour (flag absent) is unchanged: KAS endpoint IPs continue to be discovered at
reconcile time from the default/kubernetes Endpoints object.

This PR will remain in draft until live testing is completed to ensure no regressions.

Which issue(s) this PR fixes:

Fixes ROSAENG-8224

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features
    • Added a repeatable CLI flag --hcp-egress-block-cidrs to supply one or more CIDRs (validated at startup). Provided CIDRs are used for HCP namespace egress policies instead of dynamically discovered endpoints—reducing policy churn during control-plane restarts—and are propagated into the operator deployment and install manifests.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 5, 2026

Copy link
Copy Markdown

@Ajpantuso: This pull request references ROSAENG-8224 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a --use-mc-machine-network-for-kas-block flag to the HyperShift Operator.

When enabled, the operator reads the management cluster's machine network
CIDR(s) from kube-system/cluster-config-v1 (the install-config key) and
uses them as the Except entries in the egress IPBlock rules of the
private-router and management-kas NetworkPolicies, instead of the
per-pod /32 CIDRs derived from the default/kubernetes Endpoints object.

If the ConfigMap is absent or unparseable, the operator falls back to the
existing kasEndpointsToCIDRs() behavior and logs a warning.

The flag is wired end-to-end: hypershift install --use-mc-machine-network-for-kas-block
sets it on the operator Deployment.

This PR will remain in draft until live testing is completed to ensure no regressions.

Which issue(s) this PR fixes:

Fixes ROSAENG-8224

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds support for statically configuring HCP egress block CIDRs to replace dynamic KAS endpoint derivation in network policies. Configuration flows through the install pipeline and operator startup with CLI flag validation, then into the HostedClusterReconciler. Network policy reconciliation is refactored to compute the kasBlock CIDR list once per reconcile and pass it through the policy chain, eliminating per-endpoint /32 derivation and reducing policy churn during KAS rolling restarts.

Sequence Diagram(s)

sequenceDiagram
  participant UserCLI as User CLI
  participant OperatorMain as Operator main
  participant Reconciler as HostedClusterReconciler
  participant K8sAPI as Kubernetes API
  UserCLI->>OperatorMain: provide --hcp-egress-block-cidrs values
  OperatorMain->>OperatorMain: validate CIDRs
  OperatorMain->>Reconciler: set HCPEgressBlockCIDRs
  Reconciler->>Reconciler: kasBlockCIDRs(kubernetesEndpoint) -> kasBlock
  Reconciler->>K8sAPI: apply NetworkPolicies using kasBlock in IPBlock.Except
Loading

Suggested reviewers

  • sjenning
  • sdminonne
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Two new unit tests contain hardcoded IPv4 addresses (10.0.0.1, 10.100.0.0/16, 10.128.0.0/14, 172.30.0.0/16) and CIDRs without IPv6 equivalents, incompatible with IPv6-only disconnected environments. Update tests to use dynamic IP family detection or add IPv6 equivalents; use correctCIDRFamily() helper to select appropriate CIDRs for the cluster IP family.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding a --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic. New tests use static string literals with no dynamic values like pod names, IPs, or timestamps.
Test Structure And Quality ✅ Passed New tests follow all quality requirements: single responsibility per test, proper setup/cleanup with fake clients, meaningful assertion messages, and consistent with existing patterns.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only networking configuration (HCPEgressBlockCIDRs field and CLI flag). No scheduling constraints, affinity rules, replicas, node selectors, tolerations, or topology settings modified.
No-Weak-Crypto ✅ Passed PR contains no weak crypto usage: no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, no custom crypto, no insecure secret comparisons. Changes add CIDR field strings and net.ParseCIDR() validation only.
Container-Privileges ✅ Passed PR adds network policy CIDR filtering flag with no container privilege escalation or host access settings.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, PII, session IDs) are logged. CIDRs logged in line 228 of main.go for validation errors are public network ranges, not sensitive.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/cli Indicates the PR includes changes for CLI label Jun 5, 2026
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Ajpantuso
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)

1697-1707: ⚡ Quick win

Assert management pod CIDR is still included in Except.

Line 1701 validates the feature-toggle CIDRs, but it doesn’t confirm the existing management cluster pod CIDR remains present. Add one assertion so this test also catches regressions where new KAS block CIDRs replace (instead of extend) existing exceptions.

Suggested test hardening
 			exceptCIDRs := collectExceptCIDRs(privateRouterPolicy)
+			g.Expect(exceptCIDRs).To(ContainElement("10.128.0.0/14"), "private-router Except list should retain management cluster pod CIDR")
 			for _, cidr := range tc.expectedExceptCIDRs {
 				g.Expect(exceptCIDRs).To(ContainElement(cidr), "private-router Except list should contain %s", cidr)
 			}

Based on learnings: "Unit test any code changes and additions and include e2e tests when changes impact consumer behaviour."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`
around lines 1697 - 1707, The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 1697-1707: The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: eadd8666-ca8d-475c-8303-c5739a6cf3c8

📥 Commits

Reviewing files that changed from the base of the PR and between f13c62d and 4e64951.

📒 Files selected for processing (6)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/main.go

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.70588% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.55%. Comparing base (4755e9c) to head (2b4c57d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
hypershift-operator/main.go 0.00% 7 Missing ⚠️
cmd/install/assets/hypershift_operator.go 0.00% 2 Missing and 1 partial ⚠️
...ator/controllers/hostedcluster/network_policies.go 90.90% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8689   +/-   ##
=======================================
  Coverage   41.54%   41.55%           
=======================================
  Files         758      758           
  Lines       93838    93862   +24     
=======================================
+ Hits        38986    39000   +14     
- Misses      52107    52116    +9     
- Partials     2745     2746    +1     
Files with missing lines Coverage Δ
cmd/install/install.go 63.13% <100.00%> (+0.06%) ⬆️
...trollers/hostedcluster/hostedcluster_controller.go 45.89% <ø> (ø)
...ator/controllers/hostedcluster/network_policies.go 77.47% <90.90%> (+0.27%) ⬆️
cmd/install/assets/hypershift_operator.go 48.03% <0.00%> (-0.08%) ⬇️
hypershift-operator/main.go 0.00% <0.00%> (ø)
Flag Coverage Δ
cmd-support 34.96% <40.00%> (+<0.01%) ⬆️
cpo-hostedcontrolplane 43.59% <ø> (ø)
cpo-other 43.17% <ø> (ø)
hypershift-operator 51.63% <68.96%> (+<0.01%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Ajpantuso Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 2 times, most recently from 3b45a90 to 8df0806 Compare June 5, 2026 21:10
@openshift-ci openshift-ci Bot added area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing labels Jun 5, 2026
@Ajpantuso Ajpantuso force-pushed the apantuso/rosaeng-8224 branch 4 times, most recently from 65a6dad to 0f14878 Compare June 7, 2026 17:24
@Ajpantuso Ajpantuso changed the title ROSAENG-8224: feat(operator): use MC machine network CIDR to block HCP egress to management KAS ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions Jun 7, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hypershift-operator/controllers/hostedcluster/network_policies.go (1)

345-346: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle IPv4/IPv6 exceptions with separate IPBlock CIDRs.

In both reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy, the egress rule uses IPBlock{CIDR:"0.0.0.0/0", Except: exceptions} where exceptions := append(kasBlockExceptions, clusterNetworks...). kasBlockExceptions comes from HCPEgressBlockCIDRs directly when set (no IPv4/IPv6 filtering) or from kasEndpointsToCIDRs (hardcoded to /32), and clusterNetworks comes from managementClusterNetwork.Spec.ClusterNetwork (also unfiltered). If any IPv6 CIDRs are present in either list, the resulting NetworkPolicy will be rejected because ipBlock.except must be a strict subset of—and in the same IP family as—ipBlock.cidr.

Also applies to: 851-852, 943-955

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 345 - 346, The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/network_policies.go (1)

73-76: ⚡ Quick win

Skip default/kubernetes Endpoints lookup when static CIDRs are configured.

Static mode still hard-fails on Endpoints Get. This keeps an unnecessary dependency/path even when HCPEgressBlockCIDRs is already set.

Suggested refactor
- kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}}
- if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil {
- 	return fmt.Errorf("failed to get management cluster network config: %w", err)
- }
- kasBlock := r.kasBlockCIDRs(kubernetesEndpoint)
+ var kasBlock []string
+ if len(r.HCPEgressBlockCIDRs) > 0 {
+ 	kasBlock = append([]string(nil), r.HCPEgressBlockCIDRs...)
+ } else {
+ 	kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}}
+ 	if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil {
+ 		return fmt.Errorf("failed to get management cluster kubernetes endpoint: %w", err)
+ 	}
+ 	kasBlock = kasEndpointsToCIDRs(kubernetesEndpoint)
+ }

Also applies to: 78-83, 950-955

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 73 - 76, The code currently always does a Get on the Endpoints object
kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a hard
failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/install/install.go`:
- Line 443: Add CIDR validation for the --hcp-egress-block-cidrs option by
implementing a helper (e.g.,
validateHCPEgressBlockCIDRs(opts.HCPEgressBlockCIDRs) bool or error) that
iterates over Options.HCPEgressBlockCIDRs and uses net.ParseCIDR to ensure each
entry is a valid CIDR, returning a descriptive error for the first invalid
value; call this helper from Options.Validate() so installation fails fast on
invalid input, and add import "net" at the top of the file; ensure the
Validate() error message references the invalid CIDR and the flag name for
clarity.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Line 338: The code appends clusterNetworks onto kasBlockExceptions which may
share a backing array returned directly by kasBlockCIDRs
(r.HCPEgressBlockCIDRs), causing race/mutation issues; to fix, make a defensive
copy of kasBlockExceptions before appending (e.g. start with a new slice created
from kasBlockExceptions or append onto a nil copy) when building exceptions in
the code path that uses the kasBlockExceptions variable (the occurrence with:
exceptions := append(kasBlockExceptions, clusterNetworks...) and the similar
sites around the kasBlockCIDRs usage), and apply the same defensive-copy pattern
at the other occurrences mentioned (around lines 844 and 950-953) so we never
append into the original shared slice.

---

Outside diff comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 345-346: The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.

---

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 73-76: The code currently always does a Get on the Endpoints
object kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a
hard failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f093ff4c-fa3a-486c-9e3f-711a006a912e

📥 Commits

Reviewing files that changed from the base of the PR and between 2827d9d and c1597c3.

📒 Files selected for processing (6)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/install/assets/hypershift_operator.go

Comment thread cmd/install/install.go
Comment thread hypershift-operator/controllers/hostedcluster/network_policies.go
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2026
@openshift-ci openshift-ci Bot requested review from clebs and jparrill June 8, 2026 13:16
ScaleFromZeroSecret *corev1.Secret
ScaleFromZeroSecretKey string
ScaleFromZeroProvider string
HCPEgressBlockCIDRs []string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall in a previous commit you had this automatically computed from data in the cluster. Was that not successful, or rather, did some context mark it undesirable?

Imo, we should just have HyperShift compute this by default and use it vs making it required input.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I was adding a command line flag which overrode the default behavior to determine the machine CIDR from the install config. While that works it's pretty specific to OCP, effectively hard codes the desired CIDR blocks, and is brittle with regards to the install config structure.

This is more generic, extensible, and less susceptible to subtle failures based on undeclared schema changes.

That said if there is a more reliable way to determine the machine CIDR then determining it at runtime would be fine.

If by default you mean default for rosa or whatever combination of providers/configs == rosa in HyperShift that's also fine, but I'm cautious about polluting broader default configs here. As discussed the reason for doing this is very specific to OCP based MCs which is increasingly becoming the minority deployment config for us.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, wasn't thinking about the non-OCP world. Can we safely infer the machine CIDR of any Kubernetes cluster where the operator is running via some means?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do some research and get back

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2026
@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

Now I have a complete understanding. Both PRs modify the same struct in hostedcluster_controller.go and the same functions in network_policies.go. The conflict is clear.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

tide: Not mergeable. PR has a merge conflict.
GitHub merge state: CONFLICTING (mergeable: false, mergeStateStatus: DIRTY)

Summary

This is not a test failure — all CI test jobs (images, security, verify-deps, unit tests, lint, verify, etc.) passed successfully. The sole failure is the tide merge controller reporting that PR #8689 has a git merge conflict with the current main branch, preventing it from being merged. The conflict was introduced when PR #8537 ("CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1)") was merged to main on 2026-06-10T10:24:42Z — after PR #8689 was created on 2026-06-05. Both PRs modify the same HostedClusterReconciler struct and network_policies.go functions.

Root Cause

PR #8537 was merged to main on 2026-06-10 and introduced changes to two files that PR #8689 also modifies, creating textual git conflicts:

  1. hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go — Both PRs add new fields to the HostedClusterReconciler struct at the same location (after OpenShiftTrustedCAFilePath string, line ~213). PR CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1) #8537 added ProbeSharedIngressEndpoint, while PR ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689 adds HCPEgressBlockCIDRs. Since both insertions target the same anchor point, git cannot auto-merge them.

  2. hypershift-operator/controllers/hostedcluster/network_policies.go — PR CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1) #8537 changed the reconcileNetworkPolicies function around line 80 (replacing sharedingress.UseSharedIngress() with netutil.UseSharedIngressHC(hcluster) and adding a new import). PR ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689 also modifies reconcileNetworkPolicies in the same region (adding the kasBlock variable computation and changing function call signatures). The overlapping hunks in the same function cause a merge conflict.

The PR's base SHA (a101e6697a) predates the merge of PR #8537, so the branch is now stale relative to main.

Recommendations
  1. Rebase PR ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689 onto current main — The PR author needs to rebase or merge main into the feature branch to resolve the conflicts:

    git fetch upstream main
    git rebase upstream/main
    # Resolve conflicts in hostedcluster_controller.go and network_policies.go
    git push --force-with-lease
  2. Conflict resolution guidance for hostedcluster_controller.go — Keep both new struct fields: ProbeSharedIngressEndpoint (from CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1) #8537) and HCPEgressBlockCIDRs (from ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689).

  3. Conflict resolution guidance for network_policies.go — Keep the netutil.UseSharedIngressHC(hcluster) change from CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1) #8537 and the kasBlock / kasBlockCIDRs refactoring from ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions #8689. Both changes are in the same reconcileNetworkPolicies function but address different concerns (shared ingress vs. egress block CIDRs).

  4. Re-run CI after rebase — Once conflicts are resolved and pushed, Prow will automatically re-trigger CI checks including tide.

Evidence
Evidence Detail
Tide status Not mergeable. PR has a merge conflict.
GitHub merge state mergeable: false, mergeStateStatus: DIRTY, rebaseable: false
Conflicting PR #8537 "CNTRLPLANE-3250,CNTRLPLANE-430: API-driven Azure topology and private connectivity (Phase 1)" merged 2026-06-10T10:24:42Z
Conflict file 1 hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go — both PRs add fields to HostedClusterReconciler struct at same location (+123/-5 from #8537, +7/-0 from #8689)
Conflict file 2 hypershift-operator/controllers/hostedcluster/network_policies.go — both PRs modify reconcileNetworkPolicies() function (+2/-1 from #8537, changes to function signatures and logic from #8689)
PR #8689 base SHA a101e6697a (predates #8537 merge)
All CI checks 26 passed, 0 failed (only tide reports error)

…kPolicy egress exceptions

HCP namespace egress NetworkPolicies (private-router, management-kas) use
/32 CIDRs derived from the management cluster's KAS endpoint IPs to block
HCP pods from reaching the hosting KAS. These IPs rotate during rolling
restarts, causing a burst of NetworkPolicy UPDATEs that can trigger OVN
port-group reconciliation races and drop traffic to HCP router pods.

Introduce --hcp-egress-block-cidrs, a repeatable flag that accepts stable
CIDR blocks (e.g. the MC machine network) as a static replacement for the
dynamic KAS endpoint /32s. When supplied, the egress exception list in both
policies is built from the static CIDRs instead of the live Endpoints
object, eliminating churn during KAS rollouts.

Default behaviour (flag absent) is unchanged: KAS endpoint IPs continue to
be discovered at reconcile time from the default/kubernetes Endpoints object.

Signed-off-by: Andrew Pantuso <apantuso@redhat.com>
@Ajpantuso Ajpantuso force-pushed the apantuso/rosaeng-8224 branch from 6352ee3 to 2b4c57d Compare June 10, 2026 13:43
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2026
@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@Ajpantuso: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants