[ISSUE #10580] Fix static topic epoch comparator overflow#10581
Conversation
Static topic routing and mapping validation sorted epochs by subtracting long values and casting the result to int. Large epoch gaps can overflow the comparator result and allow stale mapping metadata to be processed before newer metadata. Constraint: Preserve existing ordering semantics: higher epoch first Rejected: Keep subtraction comparator | unsafe for long epoch deltas greater than Integer.MAX_VALUE Confidence: high Scope-risk: narrow Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: mvn -q -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: git diff --check Related: apache#10580
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Fixes static-topic epoch ordering by replacing subtraction-based long comparisons (that could overflow when cast to int) with Long.compare(...), and adds regression tests to cover large epoch gaps.
Changes:
- Replace overflow-prone epoch comparator implementations with
Long.compare(...)(higher epoch first). - Add regression tests to ensure the latest-epoch mapping wins when epochs differ by more than
Integer.MAX_VALUE.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| remoting/src/main/java/org/apache/rocketmq/remoting/rpc/ClientMetadata.java | Uses Long.compare to correctly sort mapping infos by epoch without overflow. |
| remoting/src/main/java/org/apache/rocketmq/remoting/protocol/statictopic/TopicQueueMappingUtils.java | Uses Long.compare to correctly sort mapping details by epoch without overflow. |
| remoting/src/test/java/org/apache/rocketmq/remoting/rpc/ClientMetadataTest.java | Adds regression test for routing selection with very large epoch gaps. |
| remoting/src/test/java/org/apache/rocketmq/remoting/protocol/statictopic/TopicQueueMappingUtilsTest.java | Adds regression test ensuring duplicate replacement keeps the latest-epoch mapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The ClientMetadata regression test now uses LinkedHashMap so the old epoch mapping is processed before the new epoch mapping deterministically. This makes the overflow scenario reproducible under the previous subtraction comparator instead of depending on HashMap iteration order. Constraint: Address PR review feedback without changing production semantics Rejected: Keep HashMap in regression setup | iteration order is unspecified and can hide the old bug Confidence: high Scope-risk: narrow Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: git diff --check Related: apache#10580
Use Comparator.comparingLong(...).reversed() for static topic epoch sorting so the higher-epoch-first order is explicit while keeping the overflow-safe comparison introduced for apache#10580. Constraint: Preserve the current higher-epoch-first semantics Rejected: Keep argument-swapped Long.compare | correct but less self-descriptive per PR review Confidence: high Scope-risk: narrow Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: git diff --check Related: apache#10580
ClientMetadata tracked maxTotalNumOfEpoch but never updated it, so route construction could use a larger totalQueues value from stale lower-epoch metadata. Update the tracked epoch when a newer mapping is seen and only compare totalQueues within the latest epoch. Constraint: Stay within static-topic epoch routing behavior touched by apache#10580 Rejected: Leave maxTotalNumOfEpoch unchanged | stale epoch metadata can create extra logical queues Confidence: high Scope-risk: narrow Tested: mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test Tested: mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: mvn -q -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: git diff --check Related: apache#10580
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #10581 +/- ##
=============================================
- Coverage 48.26% 48.19% -0.07%
+ Complexity 13433 13426 -7
=============================================
Files 1378 1378
Lines 100817 100821 +4
Branches 13040 13041 +1
=============================================
- Hits 48660 48593 -67
- Misses 46211 46266 +55
- Partials 5946 5962 +16 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Codecov reported partial patch coverage for the ClientMetadata maxTotalNums branch added in the static topic epoch fix. The existing tests covered lower-epoch entries being ignored but did not exercise the same-epoch path that expands totalQueues. Add a focused regression test with two mappings in the latest epoch and different totalQueues so the same-epoch maxTotalNums branch is covered. Constraint: Keep the production fix unchanged; this only addresses missing branch coverage Confidence: high Scope-risk: narrow Tested: mvn -pl remoting -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest test -Dspotbugs.skip=true -Dcheckstyle.skip=true Tested: git diff --check
What changed
Fix static topic epoch ordering to avoid integer overflow when comparing mapping epoch values.
Two comparators previously subtracted
longepoch values and cast the result toint:and:
When the epoch difference exceeds
Integer.MAX_VALUE, the comparator can overflow and process stale mapping metadata before newer metadata.This PR replaces subtraction-based comparisons with
Long.compare(...)in:ClientMetadata.topicRouteData2EndpointsForStaticTopic(...)TopicQueueMappingUtils.checkAndBuildMappingItems(...)Fixes #10580
Why
Static topic mapping epochs fence old metadata. Ordering by epoch must remain correct even when epoch gaps are larger than
Integer.MAX_VALUE, otherwise route building or duplicate queue replacement can prefer stale mapping data.Tests
Added regression coverage for large epoch gaps:
ClientMetadataTest: verifies routing resolves to the latest epoch broker mappingTopicQueueMappingUtilsTest: verifies duplicate global queue replacement keeps the latest epoch mappingVerified locally:
mvn -q -pl remoting -DskipTests=false -Dtest=ClientMetadataTest,TopicQueueMappingUtilsTest -Djacoco.skip=true test mvn -q -pl remoting -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true mvn -q -DskipTests compile -Dspotbugs.skip=true -Dcheckstyle.skip=true git diff --checkNotes
The change keeps the existing ordering semantics: higher epoch first. It only replaces unsafe subtraction with
Long.compare(...).