Skip to content

metrics: show error time in changefeed error details panel#5086

Merged
ti-chi-bot[bot] merged 1 commit into
pingcap:masterfrom
wlwilliamx:codex/changefeed-error-time-panel
May 18, 2026
Merged

metrics: show error time in changefeed error details panel#5086
ti-chi-bot[bot] merged 1 commit into
pingcap:masterfrom
wlwilliamx:codex/changefeed-error-time-panel

Conversation

@wlwilliamx
Copy link
Copy Markdown
Collaborator

@wlwilliamx wlwilliamx commented May 18, 2026

What problem does this PR solve?

Issue Number: close #5085

What is changed and how it works?

  • Expose the current changefeed error occurrence time through the ticdc_owner_changefeed_error_info metric as an error_time label.
  • Render error_time in the Changefeed Error Details Grafana panel for both standard and next-generation dashboards.
  • Format non-zero error times as UTC RFC3339 strings and leave missing historical timestamps blank.
  • Add unit coverage for the new metric-label formatting behavior.

Check List

Tests

  • Unit test
  • Manual test
img_v3_0211q_ad0021e1-a099-4610-bbda-469fd888ffcg

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. This extends the current error-info metric with one stable label for the active error occurrence time. Existing selectors remain valid, while consumers that depend on the exact label set should account for the added error_time label.

Do you need to update user documentation, design documentation or monitoring documentation?

No separate documentation update is needed. The affected Grafana dashboard definitions are updated in this PR.

Release note

None

Summary by CodeRabbit

Release Notes

  • New Features

    • Changefeed error metrics now capture the time errors occurred (UTC format) for improved troubleshooting and error tracking.
    • Updated monitoring dashboards to display error timestamps alongside error details.
  • Tests

    • Added unit tests for error timestamp handling and metric label validation.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 18, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an error_time label to the changefeed_error_info Prometheus metric to track the occurrence time of changefeed errors and warnings. The changes include updating the metric definition, adding a time normalization helper, and modifying Grafana dashboards to display the new label. Feedback was provided regarding the potential for high cardinality issues when using timestamps as Prometheus labels, which could negatively impact monitoring system performance.

Comment thread coordinator/helper.go
Comment on lines +89 to +96
func normalizeChangefeedErrorMetricTime(errorTime time.Time) string {
// Keep the label stable across nodes with different local time zones while remaining
// directly readable in Grafana's table view.
if errorTime.IsZero() {
return ""
}
return errorTime.UTC().Format(time.RFC3339)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a timestamp as a Prometheus label is generally discouraged because it can lead to high cardinality issues. Every unique timestamp creates a new time series in the Prometheus TSDB. If errors occur frequently or changefeeds flap, this could significantly increase the memory and storage usage of the monitoring system.

Since message is already a label in this metric, adding error_time further exacerbates the risk. Consider if this information could be provided as the gauge value instead. If the goal is to show it in a Grafana table alongside other labels, you might explore using Grafana's ability to join metrics or formatting the gauge value as a date in the dashboard, although this can be more complex to implement than using labels.

References
  1. Prometheus labels should have low cardinality to avoid performance issues in the monitoring stack. Timestamps are high-cardinality data.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR adds error occurrence time tracking to changefeed error metrics. The ticdc_owner_changefeed_error_info Prometheus metric gains an error_time label, the coordinator extracts and normalizes error timestamps to UTC RFC3339 strings, and three Grafana dashboards are updated to display the new column in error details tables.

Changes

Changefeed Error Time Tracking

Layer / File(s) Summary
Metric definition with error_time label
pkg/metrics/changefeed.go
ChangefeedErrorInfoGauge is extended to include error_time as a label alongside keyspace, changefeed, state, code, and message, with updated help text documenting the timestamp field.
Error time collection and normalization
coordinator/helper.go
changefeedErrorMetricLabels struct adds an errorTime field, normalizeChangefeedErrorMetricTime helper formats non-zero time.Time values as UTC RFC3339 strings (empty string for zero), and getChangefeedErrorMetricLabels populates the new label from runningErr.Time.
Testing error time normalization
coordinator/helper_test.go
Two unit tests verify that getChangefeedErrorMetricLabels includes UTC-normalized error timestamps and that zero-valued timestamps are rendered as empty strings.
Grafana dashboard panel updates
metrics/grafana/ticdc_new_arch.json, metrics/nextgengrafana/ticdc_new_arch_next_gen.json, metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
Three dashboard configurations add error_time column definitions with 180-pixel width overrides, update PromQL max by (...) aggregations to include error_time, and reorder table column mappings to place the timestamp between state and code columns.

Sequence Diagram

sequenceDiagram
  participant RunningError
  participant getChangefeedErrorMetricLabels
  participant normalizeChangefeedErrorMetricTime
  participant changefeedErrorMetricLabels
  RunningError->>getChangefeedErrorMetricLabels: provides Time field
  getChangefeedErrorMetricLabels->>normalizeChangefeedErrorMetricTime: passes runningErr.Time
  normalizeChangefeedErrorMetricTime->>getChangefeedErrorMetricLabels: returns RFC3339 string or empty
  getChangefeedErrorMetricLabels->>changefeedErrorMetricLabels: populates errorTime label
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • pingcap/ticdc#4499: Introduces the changefeed error metrics foundation that this PR extends with error occurrence time tracking and label normalization.

Suggested labels

lgtm, approved, size/L, release-note

Suggested reviewers

  • wk989898
  • lidezhu
  • tenfyzhong

Poem

🐰 A timestamp hops into the error flow,
UTC-normalized, gentle and slow,
Grafana panels now show when things break—
Debugging's faster with this time-aware shake! 🕐✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main change: adding error time visibility to the changefeed error details panel through metrics.
Linked Issues check ✅ Passed All code changes directly address the objective from issue #5085: exposing error occurrence time via the metric label and rendering it in Grafana dashboards.
Out of Scope Changes check ✅ Passed All changes are in scope. The PR modifies only the metric definition, related helper functions, tests, and Grafana dashboard configurations to expose and display error time.
Description check ✅ Passed PR description is mostly complete with issue reference, clear explanation of changes, test confirmation, and compatibility assessment.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 18, 2026
@ti-chi-bot ti-chi-bot Bot added the lgtm label May 18, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 18, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 18, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 18, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-18 11:10:00.284470228 +0000 UTC m=+176129.788600904: ☑️ agreed by wk989898.
  • 2026-05-18 11:39:43.32588438 +0000 UTC m=+177912.830015046: ☑️ agreed by lidezhu.

@ti-chi-bot ti-chi-bot Bot merged commit 7b68b70 into pingcap:master May 18, 2026
26 checks passed
@wlwilliamx wlwilliamx added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label May 20, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #5104.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changefeed Error Details panel should show error occurrence time

4 participants