Skip to content

NVIDIA-882: DPU host: add configurable mgmt-port-resource-count#3024

Open
tsorya wants to merge 1 commit into
openshift:masterfrom
tsorya:pr-2997-remote
Open

NVIDIA-882: DPU host: add configurable mgmt-port-resource-count#3024
tsorya wants to merge 1 commit into
openshift:masterfrom
tsorya:pr-2997-remote

Conversation

@tsorya

@tsorya tsorya commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add configurable mgmt-port-resource-count for dpu-host and smart-nic modes. Previously both modes hardcoded the management port resource request/limit to '1'. Now both use a configurable count (defaulting to 1) read from the hardware-offload-config ConfigMap.
  • This is required for UDN (User Defined Networks) support on DPU host nodes, where OVN-Kubernetes must own the full management port SR-IOV resource pool to manage VFs for UDN traffic.
  • The count defaults to 1 when mgmt-port-resource-name is set, and can be overridden via the mgmt-port-resource-count ConfigMap key.

Jira: https://redhat.atlassian.net/browse/NVIDIA-882

Test plan

  • New TestDpuHostModeResourceCount tests pass (validates template rendering with counts 8, 16, and default for both dpu-host and smart-nic)
  • Existing TestOVNKubernetesNodeModeTemplates tests pass
  • go vet clean
  • Verify on a DPU host cluster that the ovnkube-node pod gets correct resource requests when mgmt-port-resource-count is set in the ConfigMap
  • Verify default behavior (count=1) when only mgmt-port-resource-name is set without explicit count

Summary by CodeRabbit

  • New Features
    • Management port resource counts in OVN Kubernetes deployments are now configurable, allowing customization for DPU host and smart NIC modes instead of using fixed hardcoded values.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fa178998-ec3e-43f0-a575-a4c6b5f8d6c1

📥 Commits

Reviewing files that changed from the base of the PR and between f65f95e and 39e5e78.

📒 Files selected for processing (5)
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
  • pkg/bootstrap/types.go
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_dpu_host_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/bootstrap/types.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • pkg/network/ovn_kubernetes_dpu_host_test.go
  • pkg/network/ovn_kubernetes.go
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml

Walkthrough

Reads an optional mgmt-port-resource-count from hardware-offload-config (default 1 when a resource name exists), adds MgmtPortResourceCount to bootstrap/render data, updates managed/self-hosted ovnkube-node templates to use that count for MgmtPortResourceName when OVN_NODE_MODE is dpu-host or smart-nic, and adds tests to verify rendered DaemonSet resources.

Changes

Management Port Resource Count Configuration

Layer / File(s) Summary
Bootstrap struct and ConfigMap parsing
pkg/bootstrap/types.go, pkg/network/ovn_kubernetes.go
Adds MgmtPortResourceCount to OVNConfigBoostrapResult; bootstrapOVNConfig reads mgmt-port-resource-name and defaults count to 1 if present, and parses/validates optional mgmt-port-resource-count (errors on non-integer or <=0).
Template rendering data flow
pkg/network/ovn_kubernetes.go
renderOVNKubernetes adds MgmtPortResourceCount to the template render context.
YAML templates mode-specific resource handling
bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml, bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
Both managed and self-hosted templates set ovnkube-controller's resources.requests and resources.limits for MgmtPortResourceName to {{ .MgmtPortResourceCount }} when the name is present and mode is dpu-host or smart-nic; full mode omits the resource.
Template rendering validation test
pkg/network/ovn_kubernetes_dpu_host_test.go
Adds TestDpuHostModeResourceCount that renders templates for different modes and asserts the ovnkube-controller container's resource requests/limits reflect the configured MgmtPortResourceCount (and absence when appropriate).

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: adding configurable management port resource count for DPU host mode, which is the core objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test uses standard Go testing with static test names from literal strings only. Ginkgo check applies to Ginkgo tests (It, Describe, Context), which aren't used here.
Test Structure And Quality ✅ Passed The PR adds a standard Go test (testing.T), not Ginkgo tests. The check specifies Ginkgo requirements (It blocks, BeforeEach/AfterEach), which don't apply here.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The new test (TestDpuHostModeResourceCount) is a standard Go unit test using testing.T, not a Ginkgo test, so this check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR adds only unit tests (TestDpuHostModeResourceCount using standard Go testing and t.Run), not Ginkgo e2e tests. SNO compatibility check applies only to Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR only configures SR-IOV resource allocation (mgmt-port-resource-count) in DaemonSet containers; introduces no scheduling constraints that would break SNO, TNF, TNA, or HyperShift topologies.
Ote Binary Stdout Contract ✅ Passed PR introduces no process-level stdout writes. Changes include struct field, template variable assignment, config parsing with klog (stderr), and test function within test blocks.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds TestDpuHostModeResourceCount, a Go unit test (not Ginkgo e2e). Check applies only to Ginkgo e2e tests, so this is out of scope and passes.
No-Weak-Crypto ✅ Passed PR adds configurable mgmt-port-resource-count without introducing weak crypto; uses only standard library functions for integer conversion, no secret/token comparisons.
Container-Privileges ✅ Passed No privileged, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation or root-running configurations added; PR only modifies management port resource allocation values.
No-Sensitive-Data-In-Logs ✅ Passed The logged mgmt-port-resource-count is numeric infrastructure configuration, not sensitive data like passwords, tokens, APIs, or PII.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17357 characters] ...

red in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jcaamano and taanyas June 9, 2026 22:35
@tsorya tsorya changed the title DPU host: use configurable mgmt-port-resource-count instead of node query NVIDIA-396: DPU host: add configurable mgmt-port-resource-count Jun 9, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@tsorya: This pull request references NVIDIA-396 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add configurable mgmt-port-resource-count for DPU host mode. Previously both dpu-host and smart-nic modes hardcoded the management port resource request/limit to '1'. Now dpu-host mode uses a configurable count (defaulting to 1) while smart-nic continues with hardcoded 1.
  • The count is read from the mgmt-port-resource-count key in the hardware-offload-config ConfigMap. If not specified, it defaults to 1 when mgmt-port-resource-name is set.
  • Split the template conditions so dpu-host and smart-nic modes are handled independently with their respective resource counts.

Test plan

  • New TestDpuHostModeResourceCount tests pass (validates template rendering with counts 8, 16, and default)
  • Existing TestOVNKubernetesNodeModeTemplates tests pass
  • go vet clean
  • Verify on a DPU host cluster that the ovnkube-node pod gets correct resource requests when mgmt-port-resource-count is set in the ConfigMap
  • Verify default behavior (count=1) when only mgmt-port-resource-name is set without explicit count

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 9, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/network/ovn_kubernetes.go (1)

1074-1084: ⚡ Quick win

Consider logging a warning when count is set without resource name.

The validation correctly rejects invalid count values, but if mgmt-port-resource-count is configured without mgmt-port-resource-name, the count is validated but never used (templates check MgmtPortResourceName first). Adding a warning log would help admins debug misconfigurations, consistent with similar warnings elsewhere in this function (e.g., lines 1051, 1058, 1065).

📝 Optional: add warning when count is orphaned
 mgmtPortResourceCount, exists := cm.Data["mgmt-port-resource-count"]
 if exists {
+	if ovnConfigResult.MgmtPortResourceName == "" {
+		klog.Warningf("mgmt-port-resource-count is set but mgmt-port-resource-name is not; count will be ignored")
+	}
 	count, err := strconv.ParseInt(mgmtPortResourceCount, 10, 64)
 	if err != nil {
 		return nil, fmt.Errorf("invalid mgmt-port-resource-count value %q: %w", mgmtPortResourceCount, err)
 	}
 	if count <= 0 {
 		return nil, fmt.Errorf("invalid mgmt-port-resource-count value %q: must be > 0", mgmtPortResourceCount)
 	}
 	ovnConfigResult.MgmtPortResourceCount = count
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/network/ovn_kubernetes.go` around lines 1074 - 1084, The code currently
parses mgmt-port-resource-count into ovnConfigResult.MgmtPortResourceCount but
doesn't warn if mgmt-port-resource-name is missing, causing the count to be
ignored by template logic that checks MgmtPortResourceName first; update the
same parsing block that sets ovnConfigResult.MgmtPortResourceCount (referencing
mgmt-port-resource-count, mgmt-port-resource-name and ovnConfigResult) to emit a
warning via the existing logger (same style as the warnings at the earlier
checks around lines handling mgmt-port-resource-name) when
mgmt-port-resource-count is present but the corresponding
mgmt-port-resource-name is empty or unset so admins are alerted to the orphaned
count.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1074-1084: The code currently parses mgmt-port-resource-count into
ovnConfigResult.MgmtPortResourceCount but doesn't warn if
mgmt-port-resource-name is missing, causing the count to be ignored by template
logic that checks MgmtPortResourceName first; update the same parsing block that
sets ovnConfigResult.MgmtPortResourceCount (referencing
mgmt-port-resource-count, mgmt-port-resource-name and ovnConfigResult) to emit a
warning via the existing logger (same style as the warnings at the earlier
checks around lines handling mgmt-port-resource-name) when
mgmt-port-resource-count is present but the corresponding
mgmt-port-resource-name is empty or unset so admins are alerted to the orphaned
count.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0663a776-bcb8-44c8-9ea1-a629841b0cc8

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc1804 and 755d2bf.

📒 Files selected for processing (5)
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
  • pkg/bootstrap/types.go
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_dpu_host_test.go

@tsorya tsorya changed the title NVIDIA-396: DPU host: add configurable mgmt-port-resource-count NVIDIA-882: DPU host: add configurable mgmt-port-resource-count Jun 9, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@tsorya: This pull request references NVIDIA-882 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add configurable mgmt-port-resource-count for DPU host mode. Previously both dpu-host and smart-nic modes hardcoded the management port resource request/limit to '1'. Now dpu-host mode uses a configurable count (defaulting to 1) while smart-nic continues with hardcoded 1.
  • The count is read from the mgmt-port-resource-count key in the hardware-offload-config ConfigMap. If not specified, it defaults to 1 when mgmt-port-resource-name is set.
  • Split the template conditions so dpu-host and smart-nic modes are handled independently with their respective resource counts.

Test plan

  • New TestDpuHostModeResourceCount tests pass (validates template rendering with counts 8, 16, and default)
  • Existing TestOVNKubernetesNodeModeTemplates tests pass
  • go vet clean
  • Verify on a DPU host cluster that the ovnkube-node pod gets correct resource requests when mgmt-port-resource-count is set in the ConfigMap
  • Verify default behavior (count=1) when only mgmt-port-resource-name is set without explicit count

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/network/ovn_kubernetes_dpu_host_test.go`:
- Around line 554-558: In the negative branch of the test (the else for
expectResource false) add an assertion to also verify that
ovnkubeController.Resources.Limits does not contain the resource key when
tc.mgmtPortResourceName != ""; locate the block that currently checks
ovnkubeController.Resources.Requests and mirror that check for Resources.Limits
(e.g., look up resourceName in ovnkubeController.Resources.Limits and
g.Expect(found).To(BeFalse(), "resource limit should not be set")) so the
template contract asserts both Requests and Limits are unset.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: be40fc2c-3836-4044-bfe5-959ee9c45cff

📥 Commits

Reviewing files that changed from the base of the PR and between 755d2bf and f65f95e.

📒 Files selected for processing (5)
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
  • pkg/bootstrap/types.go
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_dpu_host_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/bootstrap/types.go
  • pkg/network/ovn_kubernetes.go

Comment on lines +554 to +558
} else {
if tc.mgmtPortResourceName != "" {
_, found := ovnkubeController.Resources.Requests[resourceName]
g.Expect(found).To(BeFalse(), "resource request should not be set")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert limits absence in the negative path too.

When expectResource is false, the test only verifies Resources.Requests is unset; it should also verify Resources.Limits is unset to fully lock the template contract.

Suggested patch
 				} else {
 					if tc.mgmtPortResourceName != "" {
 						_, found := ovnkubeController.Resources.Requests[resourceName]
 						g.Expect(found).To(BeFalse(), "resource request should not be set")
+						_, found = ovnkubeController.Resources.Limits[resourceName]
+						g.Expect(found).To(BeFalse(), "resource limit should not be set")
 					}
 				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else {
if tc.mgmtPortResourceName != "" {
_, found := ovnkubeController.Resources.Requests[resourceName]
g.Expect(found).To(BeFalse(), "resource request should not be set")
}
} else {
if tc.mgmtPortResourceName != "" {
_, found := ovnkubeController.Resources.Requests[resourceName]
g.Expect(found).To(BeFalse(), "resource request should not be set")
_, found = ovnkubeController.Resources.Limits[resourceName]
g.Expect(found).To(BeFalse(), "resource limit should not be set")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/network/ovn_kubernetes_dpu_host_test.go` around lines 554 - 558, In the
negative branch of the test (the else for expectResource false) add an assertion
to also verify that ovnkubeController.Resources.Limits does not contain the
resource key when tc.mgmtPortResourceName != ""; locate the block that currently
checks ovnkubeController.Resources.Requests and mirror that check for
Resources.Limits (e.g., look up resourceName in
ovnkubeController.Resources.Limits and g.Expect(found).To(BeFalse(), "resource
limit should not be set")) so the template contract asserts both Requests and
Limits are unset.

…uery

Add a configurable mgmt-port-resource-count key to the
hardware-offload-config ConfigMap for both dpu-host and smart-nic modes.
This allows OVNK to claim the full management port SR-IOV resource pool,
which is required for UDN support on DPU host nodes.

The resource count defaults to 1 when mgmt-port-resource-name is set,
and can be overridden via the mgmt-port-resource-count ConfigMap key.

Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@tsorya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw 39e5e78 link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/security 39e5e78 link false /test security
ci/prow/e2e-aws-ovn-fdp-qe 39e5e78 link true /test e2e-aws-ovn-fdp-qe
ci/prow/e2e-aws-ovn-rhcos10-techpreview 39e5e78 link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/e2e-aws-ovn-upgrade 39e5e78 link true /test e2e-aws-ovn-upgrade
ci/prow/hypershift-e2e-aks 39e5e78 link true /test hypershift-e2e-aks
ci/prow/e2e-aws-ovn-hypershift-conformance 39e5e78 link true /test e2e-aws-ovn-hypershift-conformance

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wizhaoredhat

Copy link
Copy Markdown
Contributor

LGTM. We also do this in dpu-simulator to support network segmentation (UDN).

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tsorya, wizhaoredhat
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants