Skip to content

Expose the controller's /metrics endpoint through a dedicated Service#1803

Open
danielorbach wants to merge 7 commits intokagent-dev:mainfrom
danielorbach:feat/helm-controller-metrics
Open

Expose the controller's /metrics endpoint through a dedicated Service#1803
danielorbach wants to merge 7 commits intokagent-dev:mainfrom
danielorbach:feat/helm-controller-metrics

Conversation

@danielorbach
Copy link
Copy Markdown

Operators deploying through this chart can now point Prometheus, VictoriaMetrics, or any other authenticated scraper at the controller's /metrics endpoint without hand-rolling a Service, ClusterRoles, and a binding. The controller binary already understands --metrics-bind-address and --metrics-secure via flags or env vars (the env-var loader at go/core/pkg/app/app.go translates every flag automatically), but the chart had no way to surface the endpoint short of stuffing values into controller.env and writing the missing Service and RBAC by hand. Closes #1369.

The shape mirrors the kmcp sub-chart that already lives in this same umbrella: a separate <fullname>-controller-metrics Service exposing https:8443, two ClusterRoles (metrics-auth-role granting the controller's ServiceAccount the right to call TokenReview and SubjectAccessReview against the API server; metrics-reader granting get on the /metrics non-resource URL), and a ClusterRoleBinding wiring only the auth role to the controller. The reader role is deliberately left unbound — picking a subject (Prometheus, VictoriaMetrics, an OpenTelemetry collector) belongs to the cluster operator, not the chart. This is also what kubebuilder and operator-sdk scaffold by default, for the same reason.

The auth role and its binding are gated additionally on controller.metrics.secureServing because they're no-ops when the authn/authz filter isn't installed. The deployment's existing controller.env pass-through still wins over the chart-supplied defaults (the conditional env vars are emitted before the user-supplied list), so anyone who needs METRICS_BIND_ADDRESS=0 for debugging or --metrics-cert-path wired through volumes/volumeMounts retains the escape hatch.

Designed exclusions: ServiceMonitor is deferred — it pulls in a Prometheus-Operator coupling the rest of the chart has avoided, and it's straightforward to layer on top once the underlying Service exists, gated on controller.metrics.serviceMonitor.enabled. Cert-manager-issued metrics certificates are also out of scope; controller-runtime's self-signed default works for in-cluster scrapers that skip TLS verification or trust the kubelet CA, and the existing cert-path flags remain reachable via controller.env for those who want to wire their own. Default is opt-in (controller.metrics.enabled: false), so existing installations are unchanged.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional Helm-chart support for exposing the controller manager’s /metrics endpoint, so operators can scrape it without hand-creating a Service and RBAC resources.

Changes:

  • Added controller.metrics values for enabling metrics exposure, configuring bind address, secure serving, and Service ports.
  • Added a dedicated controller metrics Service plus RBAC templates for metrics auth and read access.
  • Added Helm unit tests covering the new Service, RBAC, and Deployment metrics wiring.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
helm/kagent/values.yaml Adds chart values for controller metrics exposure and service settings.
helm/kagent/tests/controller-metrics-service_test.yaml Adds Helm unittest coverage for the new metrics Service.
helm/kagent/tests/controller-metrics-rbac_test.yaml Adds Helm unittest coverage for metrics RBAC resources.
helm/kagent/tests/controller-deployment_test.yaml Adds Helm unittest coverage for metrics env vars and container port wiring.
helm/kagent/templates/rbac/metrics-reader-clusterrole.yaml Adds unbound ClusterRole granting get on /metrics.
helm/kagent/templates/rbac/metrics-auth-clusterrolebinding.yaml Binds controller SA to the metrics auth role when secure serving is enabled.
helm/kagent/templates/rbac/metrics-auth-clusterrole.yaml Grants TokenReview and SubjectAccessReview permissions for authenticated metrics serving.
helm/kagent/templates/controller-metrics-service.yaml Adds the dedicated Service exposing the controller metrics endpoint.
helm/kagent/templates/controller-deployment.yaml Injects metrics env vars and adds a metrics container port when enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread helm/kagent/templates/controller-deployment.yaml Outdated
Comment thread helm/kagent/templates/controller-deployment.yaml Outdated
@danielorbach danielorbach force-pushed the feat/helm-controller-metrics branch from cfca28e to 43a6b15 Compare May 6, 2026 00:34
Surface the controller manager's /metrics endpoint through a dedicated
Service so cluster scrapers can target it without port-forwarding to
individual pods. The new resource is gated on `controller.metrics.enabled`
and defaults off so existing installations are unchanged.

The Service mirrors the kmcp sub-chart layout (separate metrics Service,
named `https` port) to keep the umbrella chart internally consistent and
to let users author one ServiceMonitor pattern that targets both
controllers. RBAC and Deployment wiring follow in subsequent commits.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The controller manager already wires controller-runtime's
WithAuthenticationAndAuthorization filter on the metrics endpoint when
--metrics-secure is true (the default). For that filter to validate
incoming scrape requests it needs permission to issue TokenReview and
SubjectAccessReview calls; ship those rights as a dedicated auth role
bound to the controller ServiceAccount.

Also ship an unbound `<fullname>-metrics-reader` ClusterRole granting
`get` on the `/metrics` non-resource URL. The chart deliberately leaves
the binding to the cluster operator: which ServiceAccount (Prometheus,
VictoriaMetrics, OpenTelemetry, etc.) is allowed to scrape is a
deployment-time choice, not a chart-time one.

The auth role and its binding are gated additionally on
`controller.metrics.secureServing` since they are no-ops when the filter
is disabled. All resources are gated on `controller.metrics.enabled` and
default off.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
Add METRICS_BIND_ADDRESS and METRICS_SECURE env vars and a `metrics`
containerPort to the controller pod, all gated on
`controller.metrics.enabled`. The new env vars are emitted before the
`controller.env` passthrough so user-supplied values continue to win
(later entries take precedence in env arrays), preserving the existing
escape hatch.

The controller binary already understands METRICS_BIND_ADDRESS and
METRICS_SECURE via its env-var loader (see go/core/pkg/app/app.go), so
no controller changes are required.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The earlier commits exposed both `controller.metrics.bindAddress` and
`controller.metrics.service.targetPort` as independent values, so a user
could shift the runtime bind without moving the Service target and end
up with a Service that points at a port the controller binary is not
listening on. Drop the separate `service.targetPort` knob and derive
both the container port and the Service targetPort from `bindAddress`
via `regexFind`, matching the `kmcp` sub-chart's pattern.

Add a test that asserts the derived ports stay in sync when
`bindAddress` is customised, plus a regression test that documents the
`controller.env` escape hatch (user-supplied env vars are emitted after
chart defaults, so the user's value wins at runtime). The escape hatch
still cannot move the Service target; values.yaml now flags this.

Addresses review feedback on kagent-dev#1803.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
@danielorbach danielorbach force-pushed the feat/helm-controller-metrics branch from 43a6b15 to 2260bda Compare May 6, 2026 00:42
@danielorbach danielorbach requested a review from Copilot May 6, 2026 00:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread helm/kagent/templates/controller-deployment.yaml Outdated
Comment thread helm/kagent/templates/controller-metrics-service.yaml Outdated
Comment thread helm/kagent/templates/controller-deployment.yaml Outdated
Comment thread helm/kagent/tests/controller-deployment_test.yaml
- equal:
path: spec.ports[0].targetPort
value: 9443

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 72aafde: host-qualified and IPv6 cases for the Service targetPort. The bindAddress: "0" and empty-string disable paths are covered by should not render when bindAddress disables metrics and should not render when bindAddress is empty, added in 7405b04.

Comment thread helm/kagent/values.yaml Outdated
`regexFind "[0-9]+"` returns the first digit run, so a perfectly valid
bindAddress like `127.0.0.1:8443` would extract `127` and the rendered
Service would point at the wrong port. Anchor the match to the end of
the string with `[0-9]+$` so every Go-style address form yields the
trailing port: bare `:port`, host-qualified `host:port`, and bracketed
IPv6 `[::1]:port` all parse correctly.

The shared logic lives behind a `kagent.controller.metricsPort` helper
in `_helpers.tpl` so the deployment containerPort and the Service
targetPort cannot fall out of sync if the parsing rule evolves.

Disabling metrics is exclusively the job of
`controller.metrics.enabled=false`; the chart does not engage with the
controller binary's `bindAddress=0` disable sentinel, since wiring two
ways to express the same intent only invites contradictory
configuration.

Addresses review feedback on kagent-dev#1803.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The earlier note suggested keeping `service.port` aligned with a
runtime override of `METRICS_BIND_ADDRESS`, but `service.port` is the
cluster-facing front-door port and has no effect on the Service
backend. The Service `targetPort` and the pod `containerPort` are both
derived from `bindAddress` at template time, so overriding the env var
via `controller.env` cannot move the rendered Service at all — that
operational consequence is what the comment now states, with
`bindAddress` flagged as the only chart-side knob that moves the
listener and the Service together.

Addresses review feedback on kagent-dev#1803.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The `controller.metrics.bindAddress` value mirrors the controller's
`--metrics-bind-address` flag, and that flag's documented contract
treats `0` as "disable the metrics server" (see
go/core/pkg/app/app.go:145-146). Letting the chart silently render an
invalid `containerPort: 0` when the operator follows that contract is
surprising; treat the binary's disable sentinel as another way of
saying `controller.metrics.enabled=false`.

Wrap the existing `kagent.controller.metricsPort` helper with a new
`kagent.controller.metricsEnabled` helper that returns truthy only
when `enabled` is true *and* the resolved port is a real, non-zero
value. Switch the Service, both ClusterRoles, the ClusterRoleBinding,
and the Deployment env/port wiring over to the new helper so the two
disable signals are equivalent in their effect on rendered manifests.

The escape-hatch behaviour is unchanged: setting
`METRICS_BIND_ADDRESS` via `controller.env` still shifts only the
runtime listener and cannot move the chart-rendered Service.

Addresses review feedback on kagent-dev#1803.

Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
@danielorbach
Copy link
Copy Markdown
Author

danielorbach commented May 6, 2026

Heads-up on a divergence with the kmcp sub-chart that ships in this same umbrella. Two of the latest Copilot review findings exist verbatim in kmcp's metrics templates today and have been fixed here:

  • regexFind "[0-9]+" extracting the wrong digit run for host-qualified bind addresses (so 127.0.0.1:8443 resolves to 127) — see #discussion_r3192371162 and #discussion_r3192371168, fixed in 72aafde.
  • bindAddress: "0" rendering containerPort: 0 instead of disabling metrics — see #discussion_r3192371148, fixed in 7405b04.

That means kagent and kmcp now diverge on metrics-template behaviour. If convergence matters more than the bug-fixes, two paths are open:

  1. Keep this branch as-is and ship a follow-up PR porting the fixes to kmcp.
  2. Drop the divergent commits from this PR (72aafde and 7405b04, committed as standalones precisely so they are easy to revert) and send the fixes against kmcp first.

Happy to do either — just say the word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Helm Chart: Add support for enabling metrics endpoint and optional ServiceMonitor

2 participants