Expose the controller's /metrics endpoint through a dedicated Service#1803
Expose the controller's /metrics endpoint through a dedicated Service#1803danielorbach wants to merge 7 commits intokagent-dev:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds optional Helm-chart support for exposing the controller manager’s /metrics endpoint, so operators can scrape it without hand-creating a Service and RBAC resources.
Changes:
- Added
controller.metricsvalues for enabling metrics exposure, configuring bind address, secure serving, and Service ports. - Added a dedicated controller metrics Service plus RBAC templates for metrics auth and read access.
- Added Helm unit tests covering the new Service, RBAC, and Deployment metrics wiring.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
helm/kagent/values.yaml |
Adds chart values for controller metrics exposure and service settings. |
helm/kagent/tests/controller-metrics-service_test.yaml |
Adds Helm unittest coverage for the new metrics Service. |
helm/kagent/tests/controller-metrics-rbac_test.yaml |
Adds Helm unittest coverage for metrics RBAC resources. |
helm/kagent/tests/controller-deployment_test.yaml |
Adds Helm unittest coverage for metrics env vars and container port wiring. |
helm/kagent/templates/rbac/metrics-reader-clusterrole.yaml |
Adds unbound ClusterRole granting get on /metrics. |
helm/kagent/templates/rbac/metrics-auth-clusterrolebinding.yaml |
Binds controller SA to the metrics auth role when secure serving is enabled. |
helm/kagent/templates/rbac/metrics-auth-clusterrole.yaml |
Grants TokenReview and SubjectAccessReview permissions for authenticated metrics serving. |
helm/kagent/templates/controller-metrics-service.yaml |
Adds the dedicated Service exposing the controller metrics endpoint. |
helm/kagent/templates/controller-deployment.yaml |
Injects metrics env vars and adds a metrics container port when enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cfca28e to
43a6b15
Compare
Surface the controller manager's /metrics endpoint through a dedicated Service so cluster scrapers can target it without port-forwarding to individual pods. The new resource is gated on `controller.metrics.enabled` and defaults off so existing installations are unchanged. The Service mirrors the kmcp sub-chart layout (separate metrics Service, named `https` port) to keep the umbrella chart internally consistent and to let users author one ServiceMonitor pattern that targets both controllers. RBAC and Deployment wiring follow in subsequent commits. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The controller manager already wires controller-runtime's WithAuthenticationAndAuthorization filter on the metrics endpoint when --metrics-secure is true (the default). For that filter to validate incoming scrape requests it needs permission to issue TokenReview and SubjectAccessReview calls; ship those rights as a dedicated auth role bound to the controller ServiceAccount. Also ship an unbound `<fullname>-metrics-reader` ClusterRole granting `get` on the `/metrics` non-resource URL. The chart deliberately leaves the binding to the cluster operator: which ServiceAccount (Prometheus, VictoriaMetrics, OpenTelemetry, etc.) is allowed to scrape is a deployment-time choice, not a chart-time one. The auth role and its binding are gated additionally on `controller.metrics.secureServing` since they are no-ops when the filter is disabled. All resources are gated on `controller.metrics.enabled` and default off. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
Add METRICS_BIND_ADDRESS and METRICS_SECURE env vars and a `metrics` containerPort to the controller pod, all gated on `controller.metrics.enabled`. The new env vars are emitted before the `controller.env` passthrough so user-supplied values continue to win (later entries take precedence in env arrays), preserving the existing escape hatch. The controller binary already understands METRICS_BIND_ADDRESS and METRICS_SECURE via its env-var loader (see go/core/pkg/app/app.go), so no controller changes are required. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The earlier commits exposed both `controller.metrics.bindAddress` and `controller.metrics.service.targetPort` as independent values, so a user could shift the runtime bind without moving the Service target and end up with a Service that points at a port the controller binary is not listening on. Drop the separate `service.targetPort` knob and derive both the container port and the Service targetPort from `bindAddress` via `regexFind`, matching the `kmcp` sub-chart's pattern. Add a test that asserts the derived ports stay in sync when `bindAddress` is customised, plus a regression test that documents the `controller.env` escape hatch (user-supplied env vars are emitted after chart defaults, so the user's value wins at runtime). The escape hatch still cannot move the Service target; values.yaml now flags this. Addresses review feedback on kagent-dev#1803. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
43a6b15 to
2260bda
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - equal: | ||
| path: spec.ports[0].targetPort | ||
| value: 9443 | ||
|
|
There was a problem hiding this comment.
`regexFind "[0-9]+"` returns the first digit run, so a perfectly valid bindAddress like `127.0.0.1:8443` would extract `127` and the rendered Service would point at the wrong port. Anchor the match to the end of the string with `[0-9]+$` so every Go-style address form yields the trailing port: bare `:port`, host-qualified `host:port`, and bracketed IPv6 `[::1]:port` all parse correctly. The shared logic lives behind a `kagent.controller.metricsPort` helper in `_helpers.tpl` so the deployment containerPort and the Service targetPort cannot fall out of sync if the parsing rule evolves. Disabling metrics is exclusively the job of `controller.metrics.enabled=false`; the chart does not engage with the controller binary's `bindAddress=0` disable sentinel, since wiring two ways to express the same intent only invites contradictory configuration. Addresses review feedback on kagent-dev#1803. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The earlier note suggested keeping `service.port` aligned with a runtime override of `METRICS_BIND_ADDRESS`, but `service.port` is the cluster-facing front-door port and has no effect on the Service backend. The Service `targetPort` and the pod `containerPort` are both derived from `bindAddress` at template time, so overriding the env var via `controller.env` cannot move the rendered Service at all — that operational consequence is what the comment now states, with `bindAddress` flagged as the only chart-side knob that moves the listener and the Service together. Addresses review feedback on kagent-dev#1803. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
The `controller.metrics.bindAddress` value mirrors the controller's `--metrics-bind-address` flag, and that flag's documented contract treats `0` as "disable the metrics server" (see go/core/pkg/app/app.go:145-146). Letting the chart silently render an invalid `containerPort: 0` when the operator follows that contract is surprising; treat the binary's disable sentinel as another way of saying `controller.metrics.enabled=false`. Wrap the existing `kagent.controller.metricsPort` helper with a new `kagent.controller.metricsEnabled` helper that returns truthy only when `enabled` is true *and* the resolved port is a real, non-zero value. Switch the Service, both ClusterRoles, the ClusterRoleBinding, and the Deployment env/port wiring over to the new helper so the two disable signals are equivalent in their effect on rendered manifests. The escape-hatch behaviour is unchanged: setting `METRICS_BIND_ADDRESS` via `controller.env` still shifts only the runtime listener and cannot move the chart-rendered Service. Addresses review feedback on kagent-dev#1803. Signed-off-by: Daniel Orbach <ddorbach@gmail.com>
|
Heads-up on a divergence with the
That means kagent and kmcp now diverge on metrics-template behaviour. If convergence matters more than the bug-fixes, two paths are open:
Happy to do either — just say the word. |
Operators deploying through this chart can now point Prometheus, VictoriaMetrics, or any other authenticated scraper at the controller's
/metricsendpoint without hand-rolling a Service, ClusterRoles, and a binding. The controller binary already understands--metrics-bind-addressand--metrics-securevia flags or env vars (the env-var loader atgo/core/pkg/app/app.gotranslates every flag automatically), but the chart had no way to surface the endpoint short of stuffing values intocontroller.envand writing the missing Service and RBAC by hand. Closes #1369.The shape mirrors the
kmcpsub-chart that already lives in this same umbrella: a separate<fullname>-controller-metricsService exposinghttps:8443, two ClusterRoles (metrics-auth-rolegranting the controller's ServiceAccount the right to callTokenReviewandSubjectAccessReviewagainst the API server;metrics-readergrantinggeton the/metricsnon-resource URL), and a ClusterRoleBinding wiring only the auth role to the controller. The reader role is deliberately left unbound — picking a subject (Prometheus, VictoriaMetrics, an OpenTelemetry collector) belongs to the cluster operator, not the chart. This is also whatkubebuilderandoperator-sdkscaffold by default, for the same reason.The auth role and its binding are gated additionally on
controller.metrics.secureServingbecause they're no-ops when the authn/authz filter isn't installed. The deployment's existingcontroller.envpass-through still wins over the chart-supplied defaults (the conditional env vars are emitted before the user-supplied list), so anyone who needsMETRICS_BIND_ADDRESS=0for debugging or--metrics-cert-pathwired throughvolumes/volumeMountsretains the escape hatch.Designed exclusions:
ServiceMonitoris deferred — it pulls in a Prometheus-Operator coupling the rest of the chart has avoided, and it's straightforward to layer on top once the underlying Service exists, gated oncontroller.metrics.serviceMonitor.enabled. Cert-manager-issued metrics certificates are also out of scope; controller-runtime's self-signed default works for in-cluster scrapers that skip TLS verification or trust the kubelet CA, and the existing cert-path flags remain reachable viacontroller.envfor those who want to wire their own. Default is opt-in (controller.metrics.enabled: false), so existing installations are unchanged.