minor: Add logs and metrics observability to the segment upgrade path that fires while using concurrent locking by capistrant · Pull Request #19651 · apache/druid

capistrant · 2026-07-02T20:00:36Z

Description

Investigating potential issues with concurrent append and replace actions not being properly accepted and reflected on realtime ingestion tasks who should be upgrading their segments to reflect the concurrent append that happened on an interval they are ingesting to. This PR attempts to add some logs and metrics to the related code to help give observability into the process and smoke out if there are indeed issues going on. Biggest risk of these new mets and logs is probably chattiness, since they scale with the volume of upgrade segments that may be existing under a clusters normal operaions.

New Metrics

Metric	Description	Dimensions	Normal value
`ingest/segmentUpgrade/count`	Number of pending segments that a concurrent replace (for example, compaction) upgraded to a new version and asked the supervisor to have running tasks announce under the new version. Emitted by the replace task only when streaming ingestion is running concurrently with replace on the same interval.	`dataSource`, `taskId`, `taskType`, `groupId`, `tags`	0 unless concurrent append and replace is in use.
`ingest/segmentUpgrade/notified`	Number of upgraded pending segments the supervisor successfully routed to at least one running task. Compare with `ingest/segmentUpgrade/count`: `count` should equal `notified` + `unmatched` over the same period and `dataSource`.	`supervisorId`, `dataSource`, `stream`, `tags`	Equal to `count` in a healthy cluster.
`ingest/segmentUpgrade/unmatched`	Number of upgraded pending segments the supervisor could not route to any running task. These are not announced under the new version until handoff, so the corresponding data may be briefly missing from queries.	`supervisorId`, `dataSource`, `stream`, `tags`	0. A non-zero value indicates a lost upgrade announcement.
`ingest/segmentUpgrade/sendFailed`	Number of upgrade requests that matched a running task but failed to reach it over the wire after retries.	`supervisorId`, `dataSource`, `stream`, `taskId`, `tags`	0
`ingest/segmentUpgrade/announced`	Number of upgraded segments a task announced under the new version. Emitted once per task, so it scales with the replica count. Do not subtract it directly from `ingest/segmentUpgrade/count`, which is per-segment rather than per-replica.	`dataSource`, `taskId`, `taskType`, `groupId`, `tags`	Greater than 0 while concurrent replace occurs.
`ingest/segmentUpgrade/skipped`	Number of upgrade requests a task received but did not announce. The `reason` dimension is one of `unknownBase` (the request reached the wrong task), `noSink` (the base sink is no longer present), or `dropping` (the base sink is handing off, which is benign and covered by the durable publish path).	`dataSource`, `taskId`, `taskType`, `groupId`, `tags`, `reason`	0, excluding `reason=dropping`.

Release note

TBD

Key changed/added classes in this PR

TBD

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…ncurrent append and replace ingestion mets checkpoint doc and test tidy up self review

FrankChen021

I have reviewed the code for correctness, edge cases, concurrency/lifecycle, security, data loss, API compatibility, and missing-test risks; no high-confidence issues found.

Reviewed 11 of 11 changed files.

This is an automated review by Codex GPT-5.5

Add logs and metrics observability to the segment upgrade path for co…

58381a5

…ncurrent append and replace ingestion mets checkpoint doc and test tidy up self review

github-actions Bot added Area - Documentation Area - Metrics/Event Emitting Area - Ingestion labels Jul 2, 2026

FrankChen021 reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

minor: Add logs and metrics observability to the segment upgrade path that fires while using concurrent locking#19651

minor: Add logs and metrics observability to the segment upgrade path that fires while using concurrent locking#19651
capistrant wants to merge 1 commit into
apache:masterfrom
capistrant:segment-upgrade-metrics-realtime

capistrant commented Jul 2, 2026

Uh oh!

FrankChen021 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

capistrant commented Jul 2, 2026

Description

New Metrics

Release note

Key changed/added classes in this PR

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants