Skip to content

fix(observability): v0.12.1 — CPU mCPU relabel + memory chart labels#20

Merged
mguptahub merged 1 commit into
mainfrom
arun/v0.12.1-cpu-mcpu-memory-labels
Apr 27, 2026
Merged

fix(observability): v0.12.1 — CPU mCPU relabel + memory chart labels#20
mguptahub merged 1 commit into
mainfrom
arun/v0.12.1-cpu-mcpu-memory-labels

Conversation

@eyriehq-bot
Copy link
Copy Markdown
Contributor

@eyriehq-bot eyriehq-bot Bot commented Apr 27, 2026

Summary

MGupta deployed v0.12.0 to iwdev and surfaced two issues against the per-pod chart grid:

Issue 1 — CPU values 1000x too large

v0.12.0's max(Value) fix correctly suppressed argMax-induced spike noise (iwdev average dropped from 84.735 → 22.710), but the unit label was still wrong. Cross-check against kubectl top pods -n infrawatch for the same iwagent pod / window shows the rate values match millicores 1:1 — i.e., the v0.12.0 simulation's "0.022–0.035 cores" was right by raw scale but the chart label of "cores" was off by 1000x.

Most likely root cause (TODO to verify upstream): the kubelet-stats pipeline on iwdev appears to emit k8s.pod.cpu.time deltas in milliseconds rather than seconds. Specifically:

  • pod uses ~0.022 cores
  • in 60s of cumulative growth, delta ≈ 0.022 × 60 = 1.32 cpu-seconds = 1320 cpu-ms
  • divided by 60s bucket → 22 — exactly what we display

Safe fix this cycle (per brief): relabel chart unit to mCPU so the chart frame matches the data the user already sees side-by-side with kubectl top. Backend rate calc is unchanged — only labels + display formatting move.

Issue 2 — Memory chart label leakage

  • Y-axis was rendering the literal `"B/KB/MB/GB"` string (the unit-suffix list, not a label) → changed to "Utilization" (label describes what the chart shows; tick formatter already auto-scales).
  • Header next to avg readout was echoing the same literal as a trailing unit → removed. Auto-scaled values ("31.2M", "22 mCPU") are self-contained.

Changes

File Change
metrics/components/ServiceChart.jsx resolveUnitLabel → mCPU / Utilization. formatMetric for CPU → mCPU-friendly (one decimal non-compact, integer compact, "<1" floor). Header trailing-unit span removed. Tooltip: CPU appends "mCPU", memory shows auto-scaled value alone.
metrics/components/MetricsTab.jsx Page summary unit "cores" → "mCPU".
metrics/router.py Per-resource Pod CPU cards (CLUSTER + POD) unit "cores" → "mCPU". v0.12.1 docstring on _query_pod_cpu_time recording unit-trace finding + TODO.
_shared/otel_queries.py Same TODO on explorer_cpu_per_pod_query.
metrics/components/ux.test.js resolveUnitLabel + formatMetric tests rewritten for new labels and millicore formatter.
metrics/plugin.json 1.3.0 → 1.3.1.

Before / after (iwdev, iwagent pod)

v0.12.0 deployed v0.12.1 (this PR)
Y-axis label cores mCPU
Header avg avg 22.710 cores avg 22.7 mCPU
Memory Y-axis label B/KB/MB/GB Utilization
Memory header avg avg 31.2M B/KB/MB/GB avg 31.2M
kubectl top pods -n infrawatch ~22m ~22m

Numbers now match kubectl top 1:1.

Test plan

  • react-scripts test src/plugins/metrics/components/ux.test.js — 148/148 passing
  • On merge: GitHub Release on plugins repo to trigger image build
  • OSS submodule bump (companion PR on infrawatchlabs/infrawatch)
  • Visual verify on iwdev once helm chart picks up the new image: chart Y-axis shows "mCPU" + "Utilization", header has no trailing literal, average matches kubectl top

Follow-up (not in this PR)

Trace k8s.pod.cpu.time from kubelet → kubeletstats receiver → ClickHouse and confirm whether the upstream unit is genuinely milliseconds, or whether some helm transform is applying a 1000x scale. If we can fix it at source, push the divisor into rate_per_pod_from_cumulative and revert the chart unit to "cores".

MGupta deployed v0.12.0 to iwdev and surfaced two issues:

ISSUE 1 — CPU values are 1000x too large
  v0.12.0's ``max(Value)`` fix correctly suppressed argMax-induced spike noise
  (iwdev average dropped from 84.735 to 22.710), but the unit label was still
  wrong: 22.710 matches ``kubectl top``'s millicore reading 1:1 for the same
  pod, NOT cores. Empirically the kubelet-stats pipeline on iwdev appears to
  emit ``k8s.pod.cpu.time`` deltas in milliseconds rather than seconds, so the
  rate calc produces millicores. Tracing the upstream unit + pushing a divisor
  into ``rate_per_pod_from_cumulative`` is the proper fix; for now we relabel
  the chart unit to **mCPU** so the chart frame matches the data the user sees
  next to ``kubectl top``. Affects:
    - ServiceChart.jsx: resolveUnitLabel("cpu") → "mCPU"
    - ServiceChart.jsx: formatMetric for CPU now mCPU-friendly (one decimal
      non-compact, integer compact, "<1" floor)
    - MetricsTab.jsx page summary: "cores" → "mCPU"
    - router.py per-resource Pod CPU cards: unit "cores" → "mCPU"
  Backend rate computation is unchanged — only labels + display formatting.
  TODO documented in router.py + otel_queries.py: trace upstream kubelet-stats
  unit and push divisor into rate calc once confirmed.

ISSUE 2 — Memory chart label leakage
  ``resolveUnitLabel("memory")`` was returning the literal unit-suffix list
  "B/KB/MB/GB" — the format string itself, not a chart label. The Y-axis
  rendered that string verbatim, and the chart header next to the avg readout
  echoed it again ("avg 31.2M B/KB/MB/GB"). Fixes:
    - Y-axis label → "Utilization" (label describes WHAT the chart shows;
      the auto-scaled tick formatter carries the unit on each tick).
    - Header trailing-unit ``<span>`` removed for both metrics — auto-scaled
      values ("31.2M", "22 mCPU") are self-contained.
    - Tooltip mirrors header convention: CPU appends "mCPU", memory shows
      the auto-scaled value alone.

Tests: 148/148 passing (resolveUnitLabel + formatMetric tests rewritten for
the new mCPU/Utilization labels and millicore-friendly CPU formatter).

Bumps metrics plugin to 1.3.1.
@mguptahub mguptahub merged commit 61eca0a into main Apr 27, 2026
1 check failed
@mguptahub mguptahub deleted the arun/v0.12.1-cpu-mcpu-memory-labels branch April 27, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant