Skip to content

[ISSUE #10575] Fix race condition between scanResponseTable and processResponseCommand#10576

Open
wang-jiahua wants to merge 1 commit into
apache:developfrom
wang-jiahua:fix/response-table-race-condition
Open

[ISSUE #10575] Fix race condition between scanResponseTable and processResponseCommand#10576
wang-jiahua wants to merge 1 commit into
apache:developfrom
wang-jiahua:fix/response-table-race-condition

Conversation

@wang-jiahua

Copy link
Copy Markdown
Contributor

Which Issue(s) This PR Fixes

Fixes #10575

Brief Description

processResponseCommand used responseTable.get(opaque) + responseTable.remove(opaque) (non-atomic) to retrieve ResponseFuture. scanResponseTable used iterator.remove(). If the response arrived between scanResponseTable's remove and processResponseCommand's get, the response was logged as not matched any request and silently dropped.

Fix: Use ConcurrentHashMap.remove(key) (atomic) in both methods. Either processResponseCommand gets the future (success callback) or scanResponseTable gets it (timeout callback), but never both and never neither.

How Did You Implement It

processResponseCommand: Changed responseTable.get(opaque) + responseTable.remove(opaque) to a single responseTable.remove(opaque) call.

scanResponseTable: Changed it.remove() to responseTable.remove(next.getKey()) and added a null check on the return value.

How to Verify It

Reproduction: 256 threads, 50ms timeout, 1MB commitlog, single send thread pool. In 2 minutes, 776 not matched any request warnings appeared in client remoting.log — all with code=0 (SEND_OK) and valid msgId/queueOffset. After applying this fix, the warnings should disappear.

Existing tests: NettyRemotingAbstractTest (5 tests) all pass.

Copilot AI review requested due to automatic review settings July 2, 2026 09:36

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

… processResponseCommand

processResponseCommand used get+remove (non-atomic) to retrieve ResponseFuture
from responseTable. scanResponseTable used iterator.remove(). If the response
arrived between scanResponseTable's remove and processResponseCommand's get,
the response would be logged as 'not matched any request' and silently dropped.
The callback would fire with timeout instead of success, even though the broker
had successfully processed the request.

Fix: Use ConcurrentHashMap.remove(key) in both methods. This is atomic:
either processResponseCommand gets the future (and executes success callback),
or scanResponseTable gets it (and executes timeout callback), but never both
and never neither.
@wang-jiahua wang-jiahua force-pushed the fix/response-table-race-condition branch from 3c89a5a to c100a62 Compare July 2, 2026 09:39
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 48.18%. Comparing base (8242c1e) to head (c100a62).

Files with missing lines Patch % Lines
...rocketmq/remoting/netty/NettyRemotingAbstract.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop   #10576      +/-   ##
=============================================
- Coverage      48.26%   48.18%   -0.09%     
+ Complexity     13433    13415      -18     
=============================================
  Files           1378     1378              
  Lines         100817   100817              
  Branches       13040    13041       +1     
=============================================
- Hits           48660    48579      -81     
- Misses         46211    46274      +63     
- Partials        5946     5964      +18     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Race condition between scanResponseTable and processResponseCommand causes response loss

3 participants