metrics: show error time in changefeed error details panel#5086
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an error_time label to the changefeed_error_info Prometheus metric to track the occurrence time of changefeed errors and warnings. The changes include updating the metric definition, adding a time normalization helper, and modifying Grafana dashboards to display the new label. Feedback was provided regarding the potential for high cardinality issues when using timestamps as Prometheus labels, which could negatively impact monitoring system performance.
| func normalizeChangefeedErrorMetricTime(errorTime time.Time) string { | ||
| // Keep the label stable across nodes with different local time zones while remaining | ||
| // directly readable in Grafana's table view. | ||
| if errorTime.IsZero() { | ||
| return "" | ||
| } | ||
| return errorTime.UTC().Format(time.RFC3339) | ||
| } |
There was a problem hiding this comment.
Using a timestamp as a Prometheus label is generally discouraged because it can lead to high cardinality issues. Every unique timestamp creates a new time series in the Prometheus TSDB. If errors occur frequently or changefeeds flap, this could significantly increase the memory and storage usage of the monitoring system.
Since message is already a label in this metric, adding error_time further exacerbates the risk. Consider if this information could be provided as the gauge value instead. If the goal is to show it in a Grafana table alongside other labels, you might explore using Grafana's ability to join metrics or formatting the gauge value as a date in the dashboard, although this can be more complex to implement than using labels.
References
- Prometheus labels should have low cardinality to avoid performance issues in the monitoring stack. Timestamps are high-cardinality data.
📝 WalkthroughWalkthroughThis PR adds error occurrence time tracking to changefeed error metrics. The ChangesChangefeed Error Time Tracking
Sequence DiagramsequenceDiagram
participant RunningError
participant getChangefeedErrorMetricLabels
participant normalizeChangefeedErrorMetricTime
participant changefeedErrorMetricLabels
RunningError->>getChangefeedErrorMetricLabels: provides Time field
getChangefeedErrorMetricLabels->>normalizeChangefeedErrorMetricTime: passes runningErr.Time
normalizeChangefeedErrorMetricTime->>getChangefeedErrorMetricLabels: returns RFC3339 string or empty
getChangefeedErrorMetricLabels->>changefeedErrorMetricLabels: populates errorTime label
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test all |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lidezhu, wk989898 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: close #5085
What is changed and how it works?
ticdc_owner_changefeed_error_infometric as anerror_timelabel.error_timein theChangefeed Error DetailsGrafana panel for both standard and next-generation dashboards.Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No expected performance regression. This extends the current error-info metric with one stable label for the active error occurrence time. Existing selectors remain valid, while consumers that depend on the exact label set should account for the added
error_timelabel.Do you need to update user documentation, design documentation or monitoring documentation?
No separate documentation update is needed. The affected Grafana dashboard definitions are updated in this PR.
Release note
Summary by CodeRabbit
Release Notes
New Features
Tests