Skip to content

[ISSUE #10580] Fix static topic epoch comparator overflow#10581

Open
Aias00 wants to merge 5 commits into
apache:developfrom
Aias00:fix/static-topic-epoch-comparator
Open

[ISSUE #10580] Fix static topic epoch comparator overflow#10581
Aias00 wants to merge 5 commits into
apache:developfrom
Aias00:fix/static-topic-epoch-comparator

Conversation

@Aias00

@Aias00 Aias00 commented Jul 2, 2026

Copy link
Copy Markdown

What changed

Fix static topic epoch ordering to avoid integer overflow when comparing mapping epoch values.

Two comparators previously subtracted long epoch values and cast the result to int:

(int) (o2.getValue().getEpoch() - o1.getValue().getEpoch())

and:

(int) (o2.getEpoch() - o1.getEpoch())

When the epoch difference exceeds Integer.MAX_VALUE, the comparator can overflow and process stale mapping metadata before newer metadata.

This PR replaces subtraction-based comparisons with Long.compare(...) in:

  • ClientMetadata.topicRouteData2EndpointsForStaticTopic(...)
  • TopicQueueMappingUtils.checkAndBuildMappingItems(...)

Fixes #10580

Why

Static topic mapping epochs fence old metadata. Ordering by epoch must remain correct even when epoch gaps are larger than Integer.MAX_VALUE, otherwise route building or duplicate queue replacement can prefer stale mapping data.

Tests

Added regression coverage for large epoch gaps:

  • ClientMetadataTest: verifies routing resolves to the latest epoch broker mapping
  • TopicQueueMappingUtilsTest: verifies duplicate global queue replacement keeps the latest epoch mapping

Verified locally:

mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test
mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true
mvn -q -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true
git diff --check

Notes

The change keeps the existing ordering semantics: higher epoch first. It only replaces unsafe subtraction with Long.compare(...).

Static topic routing and mapping validation sorted epochs by subtracting long values and casting the result to int. Large epoch gaps can overflow the comparator result and allow stale mapping metadata to be processed before newer metadata.

Constraint: Preserve existing ordering semantics: higher epoch first

Rejected: Keep subtraction comparator | unsafe for long epoch deltas greater than Integer.MAX_VALUE

Confidence: high

Scope-risk: narrow

Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test

Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: mvn -q -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: git diff --check

Related: apache#10580
Copilot AI review requested due to automatic review settings July 2, 2026 13:52

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Fixes static-topic epoch ordering by replacing subtraction-based long comparisons (that could overflow when cast to int) with Long.compare(...), and adds regression tests to cover large epoch gaps.

Changes:

  • Replace overflow-prone epoch comparator implementations with Long.compare(...) (higher epoch first).
  • Add regression tests to ensure the latest-epoch mapping wins when epochs differ by more than Integer.MAX_VALUE.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
remoting/src/main/java/org/apache/rocketmq/remoting/rpc/ClientMetadata.java Uses Long.compare to correctly sort mapping infos by epoch without overflow.
remoting/src/main/java/org/apache/rocketmq/remoting/protocol/statictopic/TopicQueueMappingUtils.java Uses Long.compare to correctly sort mapping details by epoch without overflow.
remoting/src/test/java/org/apache/rocketmq/remoting/rpc/ClientMetadataTest.java Adds regression test for routing selection with very large epoch gaps.
remoting/src/test/java/org/apache/rocketmq/remoting/protocol/statictopic/TopicQueueMappingUtilsTest.java Adds regression test ensuring duplicate replacement keeps the latest-epoch mapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread remoting/src/main/java/org/apache/rocketmq/remoting/rpc/ClientMetadata.java Outdated
Aias00 added 3 commits July 2, 2026 22:22
The ClientMetadata regression test now uses LinkedHashMap so the old epoch mapping is processed before the new epoch mapping deterministically. This makes the overflow scenario reproducible under the previous subtraction comparator instead of depending on HashMap iteration order.

Constraint: Address PR review feedback without changing production semantics

Rejected: Keep HashMap in regression setup | iteration order is unspecified and can hide the old bug

Confidence: high

Scope-risk: narrow

Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test

Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: git diff --check

Related: apache#10580
Use Comparator.comparingLong(...).reversed() for static topic epoch sorting so the higher-epoch-first order is explicit while keeping the overflow-safe comparison introduced for apache#10580.

Constraint: Preserve the current higher-epoch-first semantics

Rejected: Keep argument-swapped Long.compare | correct but less self-descriptive per PR review

Confidence: high

Scope-risk: narrow

Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test

Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: git diff --check

Related: apache#10580
ClientMetadata tracked maxTotalNumOfEpoch but never updated it, so route construction could use a larger totalQueues value from stale lower-epoch metadata. Update the tracked epoch when a newer mapping is seen and only compare totalQueues within the latest epoch.

Constraint: Stay within static-topic epoch routing behavior touched by apache#10580

Rejected: Leave maxTotalNumOfEpoch unchanged | stale epoch metadata can create extra logical queues

Confidence: high

Scope-risk: narrow

Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test

Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: mvn -q -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: git diff --check

Related: apache#10580
@codecov-commenter

codecov-commenter commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 48.19%. Comparing base (8242c1e) to head (20696af).

Files with missing lines Patch % Lines
...g/apache/rocketmq/remoting/rpc/ClientMetadata.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop   #10581      +/-   ##
=============================================
- Coverage      48.26%   48.19%   -0.07%     
+ Complexity     13433    13426       -7     
=============================================
  Files           1378     1378              
  Lines         100817   100821       +4     
  Branches       13040    13041       +1     
=============================================
- Hits           48660    48593      -67     
- Misses         46211    46266      +55     
- Partials        5946     5962      +16     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Codecov reported partial patch coverage for the ClientMetadata maxTotalNums branch added in the static topic epoch fix. The existing tests covered lower-epoch entries being ignored but did not exercise the same-epoch path that expands totalQueues.

Add a focused regression test with two mappings in the latest epoch and different totalQueues so the same-epoch maxTotalNums branch is covered.

Constraint: Keep the production fix unchanged; this only addresses missing branch coverage

Confidence: high

Scope-risk: narrow

Tested: mvn -pl remoting -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest test -Dspotbugs.skip=true -Dcheckstyle.skip=true

Tested: git diff --check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Static topic routing may use stale epoch due to comparator overflow

3 participants