Skip to content

resctrl-mon: add NRI plugin for per-pod resctrl monitoring groups#666

Open
cmcantalupo wants to merge 1 commit into
containers:mainfrom
cmcantalupo:resctrl-mon
Open

resctrl-mon: add NRI plugin for per-pod resctrl monitoring groups#666
cmcantalupo wants to merge 1 commit into
containers:mainfrom
cmcantalupo:resctrl-mon

Conversation

@cmcantalupo
Copy link
Copy Markdown

@cmcantalupo cmcantalupo commented May 4, 2026

Description

Add nri-resctrl-mon, a standalone NRI plugin that creates per-pod resctrl monitoring groups (mon_groups) to support passive monitoring of Application Energy Telemetry (AET) via consumers such as Kepler.

Motivation

Userspace daemon approaches to resctrl mon_group management suffer from a fork race: a container's first threads can execute before the daemon writes their PIDs into the mon_group's tasks file, causing energy attribution gaps. By using the NRI StartContainer hook, this plugin assigns the container's init PID to a mon_group before exec runs and threads fork, eliminating the race entirely.

What's included

  • Plugin source (main.go, plugin.go, resctrl.go, state.go)
  • Unit tests (plugin_test.go, resctrl_test.go)
  • Dockerfile following the nri-memory-qos pattern
  • Helm chart (Chart.yaml, values.yaml, templates, JSON schema)
  • Documentation (new "monitoring" category, plugin docs, Helm docs)
  • Sample configuration

Design decisions

  • Hook selection: PostCreateContainer creates the mon_group directory (assigns the RMID), and StartContainer writes the init PID while the process is paused — before any user threads fork. PostStartContainer is a fallback in case the PID is not available at StartContainer (should not happen on containerd ≥ 2.x).
  • Kernel RMID management: mkdir/rmdir on resctrl delegates RMID lifecycle to the kernel, avoiding userspace exhaustion bugs.
  • Pod-level granularity: All containers in a pod share a single mon_group. The mon_group is created when the first container is created and removed when the last container stops.
  • Crash recovery: Synchronize re-creates mon_groups for running pods on plugin restart and removes orphaned mon_groups left by a previous instance.
  • Minimal privileges: SYS_ADMIN + DAC_OVERRIDE only (no privileged: true). hostPID: true is required so the plugin can write host-namespace PIDs to the resctrl tasks file.

Testing

  • Unit tests cover plugin lifecycle, resctrl operations, namespace/label filtering, multi-container pods, orphan cleanup, path traversal rejection, and invalid UID rejection.
  • Manually validated full pod lifecycle (create → run → delete) on a Clearwater Forest system with k3s, confirming mon_group creation, PID assignment, and cleanup.

Stats

22 files changed, 1958 insertions(+), 1 deletion(-)

@kad kad requested review from askervin and Copilot and removed request for Copilot May 21, 2026 13:36
@kad kad requested review from Copilot and marquiz May 21, 2026 13:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new standalone NRI plugin (nri-resctrl-mon) to create per-pod resctrl monitoring groups (mon_groups) and optionally persist .begin/.end counter snapshots under a host directory for passive AET/Kepler-style consumers.

Changes:

  • Introduces the nri-resctrl-mon plugin implementation (resctrl ops, lifecycle hooks, in-memory pod/container tracking, snapshot store) plus unit tests.
  • Adds a Helm chart, sample configuration, and documentation under a new “monitoring” docs category.
  • Wires the plugin into the repository build via the top-level Makefile.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sample-configs/nri-resctrl-mon.yaml Adds example YAML configuration for the new plugin.
Makefile Registers nri-resctrl-mon in the build plugin list.
docs/monitoring/resctrl-mon.md New end-user/developer documentation for resctrl-mon behavior and snapshots.
docs/monitoring/index.md Adds a new “monitoring” docs section index.
docs/index.md Links the new monitoring docs section from the main docs index.
docs/deployment/helm/resctrl-mon.md Includes the Helm chart README into the docs site.
docs/deployment/helm/index.md Adds resctrl-mon to the Helm deployment docs index.
deployment/helm/resctrl-mon/values.yaml Default Helm values for the resctrl-mon DaemonSet/chart.
deployment/helm/resctrl-mon/values.schema.json Helm values schema for chart parameter validation.
deployment/helm/resctrl-mon/templates/daemonset.yaml DaemonSet template for running the plugin with required mounts/capabilities.
deployment/helm/resctrl-mon/templates/configmap.yaml ConfigMap template for the plugin configuration file.
deployment/helm/resctrl-mon/templates/_helpers.tpl Shared Helm template helpers (labels/selectors).
deployment/helm/resctrl-mon/README.md Helm chart usage and configuration documentation.
deployment/helm/resctrl-mon/Chart.yaml Helm chart metadata.
deployment/helm/resctrl-mon/.helmignore Helm packaging ignore rules.
cmd/plugins/resctrl-mon/state.go In-memory tracking for per-pod mon_group and container membership.
cmd/plugins/resctrl-mon/snapshot.go Snapshot store implementation (.begin/.end, symlinks, pruning).
cmd/plugins/resctrl-mon/snapshot_test.go Unit tests for snapshot store behavior and pruning.
cmd/plugins/resctrl-mon/resctrl.go Resctrl filesystem operations (create/remove mon_groups, write tasks, read mon_data, cleanup).
cmd/plugins/resctrl-mon/resctrl_test.go Unit tests for resctrl operations and safety validation.
cmd/plugins/resctrl-mon/plugin.go Core NRI hook implementation and configuration parsing/validation.
cmd/plugins/resctrl-mon/plugin_test.go Unit tests for lifecycle behavior, filtering, and snapshot integration.
cmd/plugins/resctrl-mon/main.go Plugin entrypoint, flags, and NRI stub wiring.
cmd/plugins/resctrl-mon/Dockerfile Container image build for the new plugin.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deployment/helm/resctrl-mon/templates/configmap.yaml
Comment thread deployment/helm/resctrl-mon/values.schema.json Outdated
Comment thread deployment/helm/resctrl-mon/README.md Outdated
Comment thread docs/monitoring/resctrl-mon.md Outdated
Comment thread cmd/plugins/resctrl-mon/plugin.go
Comment thread cmd/plugins/resctrl-mon/snapshot.go Outdated
Comment thread cmd/plugins/resctrl-mon/resctrl.go Outdated
@cmcantalupo cmcantalupo force-pushed the resctrl-mon branch 2 times, most recently from e44434b to d1171cd Compare May 22, 2026 13:48
@cmcantalupo cmcantalupo requested a review from Copilot May 22, 2026 13:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Comment thread deployment/helm/resctrl-mon/Chart.yaml
Comment thread deployment/helm/resctrl-mon/templates/daemonset.yaml
Comment thread deployment/helm/resctrl-mon/values.yaml
Comment thread deployment/helm/resctrl-mon/values.yaml
Comment thread deployment/helm/resctrl-mon/values.schema.json
Comment thread cmd/plugins/resctrl-mon/plugin.go Outdated
Add nri-resctrl-mon, a standalone NRI plugin that creates per-pod resctrl
monitoring groups (mon_groups) to support passive monitorning of
Application Energy Telemetry (AET).

The plugin uses the PostCreateContainer hook to assign container PIDs to
mon_groups before exec/fork, eliminating the fork race that plagues
userspace daemon approaches. RMID allocation is delegated to the kernel
via mkdir/rmdir on the resctrl filesystem.

Includes:
- Plugin source (main.go, plugin.go, resctrl.go, state.go)
- Unit tests (plugin_test.go, resctrl_test.go)
- Dockerfile following nri-memory-qos pattern
- Helm chart (Chart.yaml, values.yaml, templates/, schema)
- Documentation (monitoring category, plugin docs, Helm docs)
- Sample configuration

Signed-off-by: Christopher M. Cantalupo <christopher.m.cantalupo@intel.com>
Signed-off-by: Jedrzej Wasiukiewicz <jedrzej.wasiukiewicz@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants