feat(nns): add maturity modulation freshness metrics#10261
feat(nns): add maturity modulation freshness metrics#10261jasonz-dfinity wants to merge 1 commit into
Conversation
Add two Prometheus gauges to the NNS Governance /metrics endpoint so monitoring can alert on stale maturity modulation and on growing gaps in the ICP/XDR rate history that drives it: * governance_maturity_modulation_updated_at_timestamp_seconds * governance_icp_xdr_price_history_missing_days_in_window Both gauges are skipped entirely while their backing state is None so a freshly-installed canister does not trip alerts before the initial ~30-minute backfill completes.
There was a problem hiding this comment.
Pull request overview
This PR adds Prometheus “freshness” metrics to the NNS Governance /metrics endpoint to support alerting on stale maturity modulation computation and gaps in the ICP/XDR price history used to compute it.
Changes:
- Added two new gauges to
encode_metrics: maturity modulation “updated at” (as a Unix timestamp in seconds) and missing ICP/XDR rate days over the last 365-day window. - Re-exported
ONE_DAY_SECONDSfromtimer_tasksfor use in metrics encoding. - Added unit test coverage for
NonevsSomestate behavior and window filtering, and documented the addition in the unreleased changelog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
rs/nns/governance/unreleased_changelog.md |
Documents the new /metrics gauges for freshness monitoring. |
rs/nns/governance/src/timer_tasks/mod.rs |
Re-exports ONE_DAY_SECONDS from the ICP/XDR rate update task module. |
rs/nns/governance/src/lib.rs |
Encodes the two new freshness gauges into the Governance metrics output. |
rs/nns/governance/src/governance/tests/mod.rs |
Adds a unit test validating emission/omission and correct window counting. |
Comments suppressed due to low confidence (1)
rs/nns/governance/src/lib.rs:719
updated_at_days_since_epoch * ONE_DAY_SECONDScan overflowu64before being cast tof64, which would silently wrap in release builds and produce an incorrect timestamp metric. Consider avoiding integer multiplication overflow by converting operands tof64before multiplying, or usingchecked_mul/saturating_muland skipping/clamping the metric on overflow.
w.encode_gauge(
"governance_maturity_modulation_updated_at_timestamp_seconds",
(updated_at_days_since_epoch * ONE_DAY_SECONDS) as f64,
"Timestamp (seconds since the Unix epoch) of the day for which the cached maturity modulation was last computed.",
)?;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| use storage::VOTING_POWER_SNAPSHOTS; | ||
| use timer_tasks::encode_timer_task_metrics; | ||
| use timer_tasks::{ONE_DAY_SECONDS, encode_timer_task_metrics}; | ||
|
|
| pub(crate) use update_icp_xdr_rate_related_data::{ | ||
| MATURITY_MODULATION_MAX_PERMYRIAD_MISSION_70, MATURITY_MODULATION_MIN_PERMYRIAD_MISSION_70, | ||
| ONE_DAY_SECONDS, |
| test_utils::{MockEnvironment, StubCMC, StubIcpLedger}, | ||
| }; | ||
|
|
||
| const ONE_DAY_SECONDS: u64 = 86_400; | ||
|
|
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
Summary
Adds two Prometheus gauges to NNS Governance
/metricsso monitoring can alert on stale maturity modulation and on growing gaps in the ICP/XDR rate history that drives it.governance_maturity_modulation_updated_at_timestamp_seconds— drivesIC_NNS_StaleMaturityModulation. Sourced fromheap_data.maturity_modulation.updated_at_days_since_epoch * 86_400sotime() - Xworks in PromQL.governance_icp_xdr_price_history_missing_days_in_window— drivesIC_NNS_MissingIcpXdrRatesHistory. Count of days in[today-364, today]with no entry inheap_data.icp_price_history.icp_xdr_rates.Both gauges are skipped entirely while the backing state is
None, so a freshly-installed canister does not trip alerts before the initial ~30-minute backfill completes. After PR #10210 added LOCF, isolated gaps no longer block modulation updates; these metrics let us alert before gaps grow large enough to compromise the 7-day or 365-day window.Sequencing
Must merge and roll out to mainnet NNS Governance before the k8s alerts go live, otherwise the alerts will fire on missing metrics. The k8s alert change is blocked on this one rolling out.
Testing
test_metrics_maturity_modulation_and_icp_price_historycovering bothSomeandNonestates forheap_data.maturity_modulationandheap_data.icp_price_history, including a stale entry outside the 365-day window that must not count towarddays_present.