Skip to content

CP-41221: add chart-wide env injection via defaults.env#832

Open
evan-cz wants to merge 1 commit into
developfrom
CP-41221
Open

CP-41221: add chart-wide env injection via defaults.env#832
evan-cz wants to merge 1 commit into
developfrom
CP-41221

Conversation

@evan-cz
Copy link
Copy Markdown
Contributor

@evan-cz evan-cz commented May 29, 2026

The chart has no way to inject environment variables into the containers it manages. Users who need to set chart-wide env — most commonly HTTP_PROXY / HTTPS_PROXY / NO_PROXY for clusters behind a corporate egress proxy, but also any other binary-level configuration — have to fork the chart or maintain external overlays that patch every Deployment, Job, DaemonSet, and sidecar by hand.

Implementation Approach:

The fix is a generic defaults.env value: a list of EnvVar entries that gets merged into every chart-managed container's env block. Every binary the chart deploys honors http.ProxyFromEnvironment, so the standard proxy env vars are sufficient for the proxy use case, and the same field covers any other chart-wide env need without a dedicated API per use case.

A new cloudzero-agent.generateEnv helper in helm/templates/_helpers.tpl performs the merge. It follows the same precedence pattern as the existing generateLabels and generateAnnotations helpers: a list of sources is merged by name with last-wins-by-name semantics, the result is emitted as an env: block, and nothing renders when the merged result is empty. First-seen wins for list ordering; overrides keep the entry's original position. The generateLabels/generateAnnotations mechanism merges dicts via mergeOverwrite, while generateEnv merges over a list keyed by name — the surfaces are analogous at the precedence level but not literally the same.

Precedence (lowest → highest priority) inside every container's call:

  1. .Values.defaults.env — chart-wide user override; lowest priority.
  2. Component-specific user env (e.g. .Values.server.env on the Prometheus container) — wins over the chart-wide override.
  3. Chart-emitted helper output (e.g. validatorEnv).
  4. Chart-emitted hardcoded literals (SERVER_PORT, NODE_NAME, HOSTNAME fieldRefs) — highest priority. These are load-bearing for chart correctness and must not be overridable from values.

To override a chart-emitted entry on a single component (for example, to tweak K8S_NAMESPACE on the Prometheus server only), use that component's env value rather than defaults.env.

Functional Requirements:

  1. A user must be able to set environment variables on every container the chart manages with a single values change.

    Added defaults.env (typed as a list of K8s EnvVar entries) to helm/values.yaml and helm/values.schema.yaml. Every container in helm/templates/*.yaml now calls generateEnv with .Values.defaults.env as the LOWEST-priority source — aggregator collector and shipper, agent prometheus-server, agent alloy, both validator init containers, the configmap-reload sidecars, the agent daemonset's config-subst init and prometheus-server, the webhook server, the backfill init-scrape, the config-loader run-validator, helmless, and init-cert. The kube-state-metrics subchart is intentionally untouched (no outbound traffic; users can set kubeStateMetrics.env directly if needed).

  2. Chart-emitted env entries (validatorEnv, hardcoded SERVER_PORT / NODE_NAME / HOSTNAME) must continue to render and win over defaults.env on name collision, and .Values.server.env must continue to apply to the Prometheus container as a middle-tier override.

    generateEnv takes a list of env-entry lists and merges them in order, with later sources overriding earlier sources on name collision. Each call site places .Values.defaults.env first (lowest priority), then .Values.server.env where applicable (middle), then chart-emitted helpers / literals last (highest).

  3. The values schema must enforce the shape of defaults.env at chart- render time, and the enforcement must be regression-tested.

    defaults.env references the K8s io.k8s.api.core.v1.EnvVar $ref in helm/values.schema.yaml, so malformed entries (wrong type, missing name, bad valueFrom) are rejected before the template engine sees them. Added tests/helm/schema/defaults.env.valid.pass.yaml and tests/helm/schema/defaults.env.invalid.fail.yaml so make helm-test-schema exercises a typical value list (with value, secretKeyRef, and fieldRef shapes) against the valid path and an env: "not-an-array" string against the fail path.

  4. The values.yaml comment for defaults.env must accurately describe what the field does and what NO_PROXY needs to cover.

    The comment states the precedence rule (defaults.env is lowest; chart-emitted entries win), explains how to override a chart-emitted entry on a specific component, and includes a worked NO_PROXY example with the explanation placed above the entry. It documents the cluster-specific entries the user must supply (pod CIDR, service CIDR, kube-apiserver IP) on top of the standard in-cluster Service DNS suffixes and cloud-provider instance metadata IPs.

  5. The helper's merge contract must be unit-tested.

    helm/tests/defaults_env_test.yaml covers propagation to every chart-managed container, valueFrom preservation as a deep dict, the empty-input case (no env: block rendered), the precedence rules (defaults.env does NOT override chart-emitted SERVER_PORT or validatorEnv; .Values.server.env DOES override defaults.env on the Prometheus container; first-seen position is preserved when a later source overrides by name), and the backfill CronJob's spec.jobTemplate.spec.template.spec.containers[0].env path on documentIndex 0 alongside the Job's flat path on documentIndex 1.

Validation:

  • make helm-test clean (569/569 helm-unittest cases pass plus the new schema validation tests; helm-test-template regenerates the goldens with the new env blocks; helm-lint clean).
  • Deployed to a GKE cluster with a default-deny egress NetworkPolicy applied to the target namespace, using an HTTP proxy as the only allowed egress route. Confirmed every chart-managed container picks up the env vars from defaults.env via a kubectl get pod -o jsonpath sweep across all chart containers and both validator init containers. Confirmed the proxy's access log shows only the intended outbound destination (api.cloudzero.com) and no in-cluster Service hostnames or cloud metadata endpoints. Confirmed the config-loader job reaches api.cloudzero.com and returns HTTP 403 from the upstream (the expected result for a fake API key).
  • Two pre-existing inconsistencies the refactor incidentally fixes: validatorEnv is now always emitted on the Prometheus server containers (was previously skipped when .Values.server.env was unset, leaving the validator lifecycle hooks without K8S_NAMESPACE / K8S_POD_NAME); and .Values.server.env can now override defaults.env on a per-container basis instead of blindly appending.

The chart has no way to inject environment variables into the containers
it manages. Users who need to set chart-wide env — most commonly
HTTP_PROXY / HTTPS_PROXY / NO_PROXY for clusters behind a corporate
egress proxy, but also any other binary-level configuration — have to
fork the chart or maintain external overlays that patch every
Deployment, Job, DaemonSet, and sidecar by hand.

Implementation Approach:

The fix is a generic `defaults.env` value: a list of EnvVar entries that
gets merged into every chart-managed container's env block. Every binary
the chart deploys honors `http.ProxyFromEnvironment`, so the standard
proxy env vars are sufficient for the proxy use case, and the same field
covers any other chart-wide env need without a dedicated API per use case.

A new `cloudzero-agent.generateEnv` helper in `helm/templates/_helpers.tpl`
performs the merge. It follows the same precedence pattern as the existing
`generateLabels` and `generateAnnotations` helpers: a list of sources is
merged by `name` with last-wins-by-name semantics, the result is emitted
as an `env:` block, and nothing renders when the merged result is empty.
First-seen wins for list ordering; overrides keep the entry's original
position. The `generateLabels`/`generateAnnotations` mechanism merges
dicts via `mergeOverwrite`, while `generateEnv` merges over a list keyed
by `name` — the surfaces are analogous at the precedence level but not
literally the same.

Precedence (lowest → highest priority) inside every container's call:

  1. `.Values.defaults.env` — chart-wide user override; lowest priority.
  2. Component-specific user env (e.g. `.Values.server.env` on the
     Prometheus container) — wins over the chart-wide override.
  3. Chart-emitted helper output (e.g. `validatorEnv`).
  4. Chart-emitted hardcoded literals (`SERVER_PORT`, `NODE_NAME`,
     `HOSTNAME` fieldRefs) — highest priority. These are load-bearing
     for chart correctness and must not be overridable from values.

To override a chart-emitted entry on a single component (for example, to
tweak `K8S_NAMESPACE` on the Prometheus server only), use that
component's env value rather than `defaults.env`.

Functional Requirements:

1. A user must be able to set environment variables on every container
   the chart manages with a single values change.

   Added `defaults.env` (typed as a list of K8s EnvVar entries) to
   `helm/values.yaml` and `helm/values.schema.yaml`. Every container in
   `helm/templates/*.yaml` now calls `generateEnv` with
   `.Values.defaults.env` as the LOWEST-priority source — aggregator
   collector and shipper, agent prometheus-server, agent alloy, both
   validator init containers, the configmap-reload sidecars, the agent
   daemonset's config-subst init and prometheus-server, the webhook
   server, the backfill init-scrape, the config-loader run-validator,
   helmless, and init-cert. The kube-state-metrics subchart is
   intentionally untouched (no outbound traffic; users can set
   `kubeStateMetrics.env` directly if needed).

2. Chart-emitted env entries (validatorEnv, hardcoded SERVER_PORT /
   NODE_NAME / HOSTNAME) must continue to render and win over
   `defaults.env` on name collision, and `.Values.server.env` must
   continue to apply to the Prometheus container as a middle-tier
   override.

   `generateEnv` takes a list of env-entry lists and merges them in
   order, with later sources overriding earlier sources on `name`
   collision. Each call site places `.Values.defaults.env` first
   (lowest priority), then `.Values.server.env` where applicable
   (middle), then chart-emitted helpers / literals last (highest).

3. The values schema must enforce the shape of `defaults.env` at chart-
   render time, and the enforcement must be regression-tested.

   `defaults.env` references the K8s `io.k8s.api.core.v1.EnvVar` `$ref`
   in `helm/values.schema.yaml`, so malformed entries (wrong type,
   missing `name`, bad `valueFrom`) are rejected before the template
   engine sees them. Added `tests/helm/schema/defaults.env.valid.pass.yaml`
   and `tests/helm/schema/defaults.env.invalid.fail.yaml` so
   `make helm-test-schema` exercises a typical value list (with `value`,
   `secretKeyRef`, and `fieldRef` shapes) against the valid path and an
   `env: "not-an-array"` string against the fail path.

4. The values.yaml comment for `defaults.env` must accurately describe
   what the field does and what NO_PROXY needs to cover.

   The comment states the precedence rule (defaults.env is lowest;
   chart-emitted entries win), explains how to override a chart-emitted
   entry on a specific component, and includes a worked NO_PROXY example
   with the explanation placed above the entry. It documents the
   cluster-specific entries the user must supply (pod CIDR, service
   CIDR, kube-apiserver IP) on top of the standard in-cluster Service
   DNS suffixes and cloud-provider instance metadata IPs.

5. The helper's merge contract must be unit-tested.

   `helm/tests/defaults_env_test.yaml` covers propagation to every
   chart-managed container, valueFrom preservation as a deep dict, the
   empty-input case (no `env:` block rendered), the precedence rules
   (defaults.env does NOT override chart-emitted SERVER_PORT or
   validatorEnv; `.Values.server.env` DOES override defaults.env on the
   Prometheus container; first-seen position is preserved when a later
   source overrides by name), and the backfill CronJob's
   `spec.jobTemplate.spec.template.spec.containers[0].env` path on
   documentIndex 0 alongside the Job's flat path on documentIndex 1.

Validation:

- `make helm-test` clean (569/569 helm-unittest cases pass plus the new
  schema validation tests; `helm-test-template` regenerates the goldens
  with the new env blocks; `helm-lint` clean).
- Deployed to a GKE cluster with a default-deny egress NetworkPolicy
  applied to the target namespace, using an HTTP proxy as the only
  allowed egress route. Confirmed every chart-managed container picks up
  the env vars from `defaults.env` via a `kubectl get pod -o jsonpath`
  sweep across all chart containers and both validator init containers.
  Confirmed the proxy's access log shows only the intended outbound
  destination (api.cloudzero.com) and no in-cluster Service hostnames or
  cloud metadata endpoints. Confirmed the config-loader job reaches
  api.cloudzero.com and returns HTTP 403 from the upstream (the expected
  result for a fake API key).
- Two pre-existing inconsistencies the refactor incidentally fixes:
  `validatorEnv` is now always emitted on the Prometheus server containers
  (was previously skipped when `.Values.server.env` was unset, leaving the
  validator lifecycle hooks without `K8S_NAMESPACE` / `K8S_POD_NAME`); and
  `.Values.server.env` can now override `defaults.env` on a per-container
  basis instead of blindly appending.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@evan-cz evan-cz requested a review from a team as a code owner May 29, 2026 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant