ROSAENG-8224: feat(operator): add --hcp-egress-block-cidrs flag to stabilize NetworkPolicy egress exceptions#8689
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@Ajpantuso: This pull request references ROSAENG-8224 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds support for statically configuring HCP egress block CIDRs to replace dynamic KAS endpoint derivation in network policies. Configuration flows through the install pipeline and operator startup with CLI flag validation, then into the HostedClusterReconciler. Network policy reconciliation is refactored to compute the Sequence Diagram(s)sequenceDiagram
participant UserCLI as User CLI
participant OperatorMain as Operator main
participant Reconciler as HostedClusterReconciler
participant K8sAPI as Kubernetes API
UserCLI->>OperatorMain: provide --hcp-egress-block-cidrs values
OperatorMain->>OperatorMain: validate CIDRs
OperatorMain->>Reconciler: set HCPEgressBlockCIDRs
Reconciler->>Reconciler: kasBlockCIDRs(kubernetesEndpoint) -> kasBlock
Reconciler->>K8sAPI: apply NetworkPolicies using kasBlock in IPBlock.Except
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Ajpantuso The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)
1697-1707: ⚡ Quick winAssert management pod CIDR is still included in
Except.Line 1701 validates the feature-toggle CIDRs, but it doesn’t confirm the existing management cluster pod CIDR remains present. Add one assertion so this test also catches regressions where new KAS block CIDRs replace (instead of extend) existing exceptions.
Suggested test hardening
exceptCIDRs := collectExceptCIDRs(privateRouterPolicy) + g.Expect(exceptCIDRs).To(ContainElement("10.128.0.0/14"), "private-router Except list should retain management cluster pod CIDR") for _, cidr := range tc.expectedExceptCIDRs { g.Expect(exceptCIDRs).To(ContainElement(cidr), "private-router Except list should contain %s", cidr) }Based on learnings: "Unit test any code changes and additions and include e2e tests when changes impact consumer behaviour."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go` around lines 1697 - 1707, The test currently checks feature-toggle CIDRs but doesn't assert the management cluster pod CIDR is still present; after collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an assertion that the known management pod CIDR (e.g., managementPodCIDR or the variable used elsewhere in this test file) is contained in exceptCIDRs using g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions where exceptions get replaced instead of extended; place this assertion near the existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it runs for each test case referencing createdNetworkPolicies["private-router"] and privateRouterPolicy.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 1697-1707: The test currently checks feature-toggle CIDRs but
doesn't assert the management cluster pod CIDR is still present; after
collecting exceptCIDRs (from collectExceptCIDRs(privateRouterPolicy)) add an
assertion that the known management pod CIDR (e.g., managementPodCIDR or the
variable used elsewhere in this test file) is contained in exceptCIDRs using
g.Expect(exceptCIDRs).To(ContainElement(managementPodCIDR)) to catch regressions
where exceptions get replaced instead of extended; place this assertion near the
existing checks for tc.expectedExceptCIDRs and tc.expectCIDRNotPresent so it
runs for each test case referencing createdNetworkPolicies["private-router"] and
privateRouterPolicy.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: eadd8666-ca8d-475c-8303-c5739a6cf3c8
📒 Files selected for processing (6)
cmd/install/assets/hypershift_operator.gocmd/install/install.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/network_policies.gohypershift-operator/controllers/hostedcluster/network_policies_test.gohypershift-operator/main.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8689 +/- ##
=======================================
Coverage 41.54% 41.55%
=======================================
Files 758 758
Lines 93838 93862 +24
=======================================
+ Hits 38986 39000 +14
- Misses 52107 52116 +9
- Partials 2745 2746 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
3b45a90 to
8df0806
Compare
65a6dad to
0f14878
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hypershift-operator/controllers/hostedcluster/network_policies.go (1)
345-346:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle IPv4/IPv6 exceptions with separate IPBlock CIDRs.
In both
reconcilePrivateRouterNetworkPolicyandreconcileManagementKASNetworkPolicy, the egress rule usesIPBlock{CIDR:"0.0.0.0/0", Except: exceptions}whereexceptions := append(kasBlockExceptions, clusterNetworks...).kasBlockExceptionscomes fromHCPEgressBlockCIDRsdirectly when set (no IPv4/IPv6 filtering) or fromkasEndpointsToCIDRs(hardcoded to/32), andclusterNetworkscomes frommanagementClusterNetwork.Spec.ClusterNetwork(also unfiltered). If any IPv6 CIDRs are present in either list, the resulting NetworkPolicy will be rejected becauseipBlock.exceptmust be a strict subset of—and in the same IP family as—ipBlock.cidr.Also applies to: 851-852, 943-955
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around lines 345 - 346, The egress IPBlock uses a single CIDR ("0.0.0.0/0") while combining exceptions that may include IPv6 addresses, which will be rejected; in reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and the similar blocks around the other occurrences) split egress rules by IP family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6 CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs results, and clusterNetworks) filter each list by IP family before appending to the matching IPBlock so ipBlock.except is a strict subset and same family as ipBlock.cidr.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/network_policies.go (1)
73-76: ⚡ Quick winSkip
default/kubernetesEndpoints lookup when static CIDRs are configured.Static mode still hard-fails on Endpoints
Get. This keeps an unnecessary dependency/path even whenHCPEgressBlockCIDRsis already set.Suggested refactor
- kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}} - if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil { - return fmt.Errorf("failed to get management cluster network config: %w", err) - } - kasBlock := r.kasBlockCIDRs(kubernetesEndpoint) + var kasBlock []string + if len(r.HCPEgressBlockCIDRs) > 0 { + kasBlock = append([]string(nil), r.HCPEgressBlockCIDRs...) + } else { + kubernetesEndpoint := &corev1.Endpoints{ObjectMeta: metav1.ObjectMeta{Name: "kubernetes", Namespace: "default"}} + if err := r.Get(ctx, client.ObjectKeyFromObject(kubernetesEndpoint), kubernetesEndpoint); err != nil { + return fmt.Errorf("failed to get management cluster kubernetes endpoint: %w", err) + } + kasBlock = kasEndpointsToCIDRs(kubernetesEndpoint) + }Also applies to: 78-83, 950-955
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around lines 73 - 76, The code currently always does a Get on the Endpoints object kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a hard failure even when static CIDRs are configured; update the logic in network_policies.go to first check the static CIDR configuration (HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs are present — i.e., guard the r.Get call with an if that returns or continues when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call are only executed when no static CIDRs are configured (apply the same guard to the other occurrences around lines indicated: the block at 78-83 and 950-955).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/install/install.go`:
- Line 443: Add CIDR validation for the --hcp-egress-block-cidrs option by
implementing a helper (e.g.,
validateHCPEgressBlockCIDRs(opts.HCPEgressBlockCIDRs) bool or error) that
iterates over Options.HCPEgressBlockCIDRs and uses net.ParseCIDR to ensure each
entry is a valid CIDR, returning a descriptive error for the first invalid
value; call this helper from Options.Validate() so installation fails fast on
invalid input, and add import "net" at the top of the file; ensure the
Validate() error message references the invalid CIDR and the flag name for
clarity.
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Line 338: The code appends clusterNetworks onto kasBlockExceptions which may
share a backing array returned directly by kasBlockCIDRs
(r.HCPEgressBlockCIDRs), causing race/mutation issues; to fix, make a defensive
copy of kasBlockExceptions before appending (e.g. start with a new slice created
from kasBlockExceptions or append onto a nil copy) when building exceptions in
the code path that uses the kasBlockExceptions variable (the occurrence with:
exceptions := append(kasBlockExceptions, clusterNetworks...) and the similar
sites around the kasBlockCIDRs usage), and apply the same defensive-copy pattern
at the other occurrences mentioned (around lines 844 and 950-953) so we never
append into the original shared slice.
---
Outside diff comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 345-346: The egress IPBlock uses a single CIDR ("0.0.0.0/0") while
combining exceptions that may include IPv6 addresses, which will be rejected; in
reconcilePrivateRouterNetworkPolicy and reconcileManagementKASNetworkPolicy (and
the similar blocks around the other occurrences) split egress rules by IP
family: build two IPBlock rules—one with CIDR "0.0.0.0/0" and Except filtered to
only IPv4 CIDRs, and one with CIDR "::/0" and Except filtered to only IPv6
CIDRs; when assembling exceptions (from kasBlockExceptions, kasEndpointsToCIDRs
results, and clusterNetworks) filter each list by IP family before appending to
the matching IPBlock so ipBlock.except is a strict subset and same family as
ipBlock.cidr.
---
Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/network_policies.go`:
- Around line 73-76: The code currently always does a Get on the Endpoints
object kubernetesEndpoint (via r.Get(..., kubernetesEndpoint)) which causes a
hard failure even when static CIDRs are configured; update the logic in
network_policies.go to first check the static CIDR configuration
(HCPEgressBlockCIDRs) and skip the endpoints lookup entirely when static CIDRs
are present — i.e., guard the r.Get call with an if that returns or continues
when HCPEgressBlockCIDRs is non-empty so kubernetesEndpoint and the r.Get call
are only executed when no static CIDRs are configured (apply the same guard to
the other occurrences around lines indicated: the block at 78-83 and 950-955).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f093ff4c-fa3a-486c-9e3f-711a006a912e
📒 Files selected for processing (6)
cmd/install/assets/hypershift_operator.gocmd/install/install.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/network_policies.gohypershift-operator/controllers/hostedcluster/network_policies_test.gohypershift-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/install/assets/hypershift_operator.go
3e92625 to
6352ee3
Compare
| ScaleFromZeroSecret *corev1.Secret | ||
| ScaleFromZeroSecretKey string | ||
| ScaleFromZeroProvider string | ||
| HCPEgressBlockCIDRs []string |
There was a problem hiding this comment.
I recall in a previous commit you had this automatically computed from data in the cluster. Was that not successful, or rather, did some context mark it undesirable?
Imo, we should just have HyperShift compute this by default and use it vs making it required input.
There was a problem hiding this comment.
Previously I was adding a command line flag which overrode the default behavior to determine the machine CIDR from the install config. While that works it's pretty specific to OCP, effectively hard codes the desired CIDR blocks, and is brittle with regards to the install config structure.
This is more generic, extensible, and less susceptible to subtle failures based on undeclared schema changes.
That said if there is a more reliable way to determine the machine CIDR then determining it at runtime would be fine.
If by default you mean default for rosa or whatever combination of providers/configs == rosa in HyperShift that's also fine, but I'm cautious about polluting broader default configs here. As discussed the reason for doing this is very specific to OCP based MCs which is increasingly becoming the minority deployment config for us.
There was a problem hiding this comment.
Good point, wasn't thinking about the non-OCP world. Can we safely infer the machine CIDR of any Kubernetes cluster where the operator is running via some means?
There was a problem hiding this comment.
I'll do some research and get back
|
Now I have a complete understanding. Both PRs modify the same struct in Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is not a test failure — all CI test jobs (images, security, verify-deps, unit tests, lint, verify, etc.) passed successfully. The sole failure is the tide merge controller reporting that PR #8689 has a git merge conflict with the current Root CausePR #8537 was merged to
The PR's base SHA ( Recommendations
Evidence
|
…kPolicy egress exceptions HCP namespace egress NetworkPolicies (private-router, management-kas) use /32 CIDRs derived from the management cluster's KAS endpoint IPs to block HCP pods from reaching the hosting KAS. These IPs rotate during rolling restarts, causing a burst of NetworkPolicy UPDATEs that can trigger OVN port-group reconciliation races and drop traffic to HCP router pods. Introduce --hcp-egress-block-cidrs, a repeatable flag that accepts stable CIDR blocks (e.g. the MC machine network) as a static replacement for the dynamic KAS endpoint /32s. When supplied, the egress exception list in both policies is built from the static CIDRs instead of the live Endpoints object, eliminating churn during KAS rollouts. Default behaviour (flag absent) is unchanged: KAS endpoint IPs continue to be discovered at reconcile time from the default/kubernetes Endpoints object. Signed-off-by: Andrew Pantuso <apantuso@redhat.com>
6352ee3 to
2b4c57d
Compare
|
@Ajpantuso: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
HCP namespace egress NetworkPolicies (
private-router,management-kas) use/32CIDRsderived from the management cluster's KAS endpoint IPs to block HCP pods from reaching
the hosting KAS. These IPs rotate during rolling restarts, causing a burst of NetworkPolicy
UPDATEs that can trigger OVN port-group reconciliation races and drop traffic to HCP router
pods for 20-30+ minutes.
Adds a
--hcp-egress-block-cidrsrepeatable flag to the HyperShift Operator. Whensupplied, the egress exception list in both policies is built from the provided stable CIDR
blocks (e.g. the MC machine network CIDR) instead of the live
default/kubernetesEndpointsobject. This eliminates NetworkPolicy churn during KAS rollouts and avoids the OVN race.
The flag is wired end-to-end:
hypershift install --hcp-egress-block-cidrs=<cidr>sets iton the operator Deployment. Values are validated as valid CIDRs at startup. May be specified
multiple times.
Default behaviour (flag absent) is unchanged: KAS endpoint IPs continue to be discovered at
reconcile time from the
default/kubernetesEndpoints object.This PR will remain in draft until live testing is completed to ensure no regressions.
Which issue(s) this PR fixes:
Fixes ROSAENG-8224
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit