Skip to content

minor: Add logs and metrics observability to the segment upgrade path that fires while using concurrent locking#19651

Open
capistrant wants to merge 1 commit into
apache:masterfrom
capistrant:segment-upgrade-metrics-realtime
Open

minor: Add logs and metrics observability to the segment upgrade path that fires while using concurrent locking#19651
capistrant wants to merge 1 commit into
apache:masterfrom
capistrant:segment-upgrade-metrics-realtime

Conversation

@capistrant

Copy link
Copy Markdown
Contributor

Description

Investigating potential issues with concurrent append and replace actions not being properly accepted and reflected on realtime ingestion tasks who should be upgrading their segments to reflect the concurrent append that happened on an interval they are ingesting to. This PR attempts to add some logs and metrics to the related code to help give observability into the process and smoke out if there are indeed issues going on. Biggest risk of these new mets and logs is probably chattiness, since they scale with the volume of upgrade segments that may be existing under a clusters normal operaions.

New Metrics

Metric Description Dimensions Normal value
ingest/segmentUpgrade/count Number of pending segments that a concurrent replace (for example, compaction) upgraded to a new version and asked the supervisor to have running tasks announce under the new version. Emitted by the replace task only when streaming ingestion is running concurrently with replace on the same interval. dataSource, taskId, taskType, groupId, tags 0 unless concurrent append and replace is in use.
ingest/segmentUpgrade/notified Number of upgraded pending segments the supervisor successfully routed to at least one running task. Compare with ingest/segmentUpgrade/count: count should equal notified + unmatched over the same period and dataSource. supervisorId, dataSource, stream, tags Equal to count in a healthy cluster.
ingest/segmentUpgrade/unmatched Number of upgraded pending segments the supervisor could not route to any running task. These are not announced under the new version until handoff, so the corresponding data may be briefly missing from queries. supervisorId, dataSource, stream, tags 0. A non-zero value indicates a lost upgrade announcement.
ingest/segmentUpgrade/sendFailed Number of upgrade requests that matched a running task but failed to reach it over the wire after retries. supervisorId, dataSource, stream, taskId, tags 0
ingest/segmentUpgrade/announced Number of upgraded segments a task announced under the new version. Emitted once per task, so it scales with the replica count. Do not subtract it directly from ingest/segmentUpgrade/count, which is per-segment rather than per-replica. dataSource, taskId, taskType, groupId, tags Greater than 0 while concurrent replace occurs.
ingest/segmentUpgrade/skipped Number of upgrade requests a task received but did not announce. The reason dimension is one of unknownBase (the request reached the wrong task), noSink (the base sink is no longer present), or dropping (the base sink is handing off, which is benign and covered by the durable publish path). dataSource, taskId, taskType, groupId, tags, reason 0, excluding reason=dropping.

Release note

TBD


Key changed/added classes in this PR

TBD


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…ncurrent append and replace ingestion

mets checkpoint

doc and test

tidy up

self review

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency/lifecycle, security, data loss, API compatibility, and missing-test risks; no high-confidence issues found.

Reviewed 11 of 11 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants