Skip to content

[ISSUE #9791] Prefer original producer when checking transaction state#10564

Open
wang-jiahua wants to merge 1 commit into
apache:developfrom
wang-jiahua:fix/transaction-check-prefer-original-producer
Open

[ISSUE #9791] Prefer original producer when checking transaction state#10564
wang-jiahua wants to merge 1 commit into
apache:developfrom
wang-jiahua:fix/transaction-check-prefer-original-producer

Conversation

@wang-jiahua

@wang-jiahua wang-jiahua commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Which Issue(s) This PR Fixes

Fixes #9791

Brief Description

When multiple producers in the same group handle different topics, the transaction check may be routed to an unrelated producer that cannot determine the transaction state correctly.

Fix: Record the original producer's clientId in the half message properties. During transaction check-back, prefer the channel matching that clientId. Fall back to round-robin if the original producer is offline.

How Did You Implement It

Broker side:

  • MessageConst: add PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID + add to STRING_HASH_SET
  • DefaultMQProducerImpl: write clientId into half message (with makeSureStateOK() guard)
  • ProducerManager: add getAvailableChannel(groupId, preferredClientId) with null groupId guard
  • AbstractTransactionalMessageCheckListener + TransactionalMessageRocksDBService: use new method

Proxy side:

  • ProxyClientRemotingProcessor.checkTransactionState: extract PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID from messageExt and use new getAvailableChannel(group, clientId) method

How to Verify It

Unit tests: ProducerManagerTest (4 new test cases: match, fallback, null clientId, null groupId).
Fully backward compatible: old producers without clientId property fall back to round-robin.

Copilot AI review requested due to automatic review settings July 1, 2026 03:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses incorrect transactional check-back routing when multiple producers share a producer group by persisting the original producer’s clientId on the half message and preferring that producer’s channel during transaction state checks (with round-robin fallback).

Changes:

  • Added a new message property to record the originating transactional producer clientId.
  • Populated the new property during sendMessageInTransaction and updated broker-side transaction check logic to prefer the matching producer channel.
  • Added unit tests around preferred-channel selection and fallback behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
common/src/main/java/org/apache/rocketmq/common/message/MessageConst.java Introduces a new transaction-related message property key for producer clientId.
client/src/main/java/org/apache/rocketmq/client/impl/producer/DefaultMQProducerImpl.java Writes the originating producer clientId into transactional (half) messages.
broker/src/main/java/org/apache/rocketmq/broker/client/ProducerManager.java Adds preferred-clientId-based channel selection with fallback to existing round-robin logic.
broker/src/main/java/org/apache/rocketmq/broker/transaction/AbstractTransactionalMessageCheckListener.java Uses the new preferred selection API when sending transaction check messages.
broker/src/main/java/org/apache/rocketmq/broker/transaction/rocksdb/TransactionalMessageRocksDBService.java Uses the new preferred selection API in RocksDB-based transaction check flow.
broker/src/test/java/org/apache/rocketmq/broker/client/ProducerManagerTest.java Adds unit tests for preferred selection, fallback, and null preferred id behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.17%. Comparing base (88709c5) to head (cccf293).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
...apache/rocketmq/broker/client/ProducerManager.java 81.81% 0 Missing and 2 partials ⚠️
...on/rocksdb/TransactionalMessageRocksDBService.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop   #10564      +/-   ##
=============================================
- Coverage      48.28%   48.17%   -0.11%     
+ Complexity     13445    13415      -30     
=============================================
  Files           1378     1378              
  Lines         100845   100833      -12     
  Branches       13044    13045       +1     
=============================================
- Hits           48695    48579     -116     
- Misses         46201    46289      +88     
- Partials        5949     5965      +16     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RockteMQ-AI RockteMQ-AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by github-manager-bot

Summary

Fixes transaction state check routing: when multiple producers in the same group handle different topics, the check-back was routed to a random producer that might not know the transaction state. The fix stores the original producer's clientId in half message properties and prefers that producer's channel during check-back, falling back to round-robin if the original producer is offline.

Findings

  • [Info] ProducerManager.java:325-346 — New getAvailableChannel(groupId, preferredClientId) overload. Iterates the group's channel map to find the matching clientId, checks isActive() and isWritable(), then falls back to the existing round-robin method. Clean implementation.
  • [Info] DefaultMQProducerImpl.java:1447 — Client-side: writes PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID into half message properties before sending. Correct placement — right after PROPERTY_PRODUCER_GROUP.
  • [Info] MessageConst.java:55 — New constant __TXN_PRODUCER_CID__. Naming is concise and consistent with existing constants.
  • [Info] AbstractTransactionalMessageCheckListener.java:64 and TransactionalMessageRocksDBService.java:237 — Both check-back paths are updated to read the property and pass it to the new overload. Consistent treatment across storage backends.
  • [Info] Tests cover all three cases: match, not-found (fallback), and null clientId. Good coverage.
  • [Warning] ProducerManagerTest.java:260 — Missing newline at end of file. Minor style.

Suggestions

  • Rolling upgrade: During upgrade, older producers won't set PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID. The broker will read null and fall back to round-robin — which is the current behavior. This is safe.
  • Consider whether the clientId should also be stored in the transaction record (e.g., in TransactionalMessageStore) so that even if the half message properties are lost during migration, the routing hint is preserved.

Overall: well-designed fix for a real transaction routing issue. Backward compatible. LGTM.


Automated review by github-manager-bot

@wang-jiahua wang-jiahua force-pushed the fix/transaction-check-prefer-original-producer branch from 163dc4e to 207979d Compare July 1, 2026 04:58

@RockteMQ-AI RockteMQ-AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by github-manager-bot (Re-review after force-push)

Summary

Re-reviewed after force-push at 04:58 UTC. The new commit addresses the Copilot concern about STRING_HASH_SET and adds an additional test case.

Changes since last review

  • MessageConst.java — Added STRING_HASH_SET.add(PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID). This resolves the hash-set consistency concern raised by Copilot. ✅
  • ProducerManagerTest.java — Added testGetAvailableChannelWithNullGroupId test. Coverage now includes 4 scenarios: match, not-found (fallback), null clientId, and null groupId. Good.

Remaining

  • [Info] ProducerManagerTest.java:310 — Still missing newline at end of file (previously noted). Trivial style issue.

Verdict

The force-push properly addresses the outstanding concern. Code is clean, well-tested, and backward compatible. LGTM 👍


Automated review by github-manager-bot

@wang-jiahua wang-jiahua force-pushed the fix/transaction-check-prefer-original-producer branch from 207979d to cccf293 Compare July 1, 2026 07:00
@qianye1001

Copy link
Copy Markdown
Contributor

Thanks for the patch — but I'd argue this scenario is a misuse of the producer group abstraction, not a bug, and doesn't need a broker-side fix.

Per the official docs, a producer group is explicitly defined as "a collection of the same type of Producer, which sends the same type of messages with consistent logic" (Concept.md). The transaction check-back mechanism is designed around exactly this invariant: "If a transaction message is sent and the original producer crashes after sending, the broker will contact other producers in the same producer group to commit or rollback the transactional message" (Design_Transaction.md). Any producer in the group must be able to answer a check for any transaction sent by that group — that fungibility is what makes the check-back fault-tolerant.

If different topics have independent transaction state / TransactionListener logic, the correct fix is on the user side: use separate producer groups per business/topic. As @contrueCT already pointed out in #9791, another option is a shared state store (Redis / DB) so checkLocalTransaction returns consistent results across all producers in the group.

Concerns with the current patch:

  1. It papers over the misuse rather than surfacing it; users hitting this will silently keep depending on "sticky" routing that isn't part of the contract.
  2. It weakens the group-level fault tolerance. If the originating client is still alive but its TransactionListener is stateless / has lost state (e.g. after a restart), we now prefer it over healthier peers that could have answered via a shared state backend. Fallback to round-robin only kicks in when the preferred channel is missing, not when it's semantically the wrong one to ask.
  3. It adds a permanent message property (__TXN_PRODUCER_CID__) on every half message for a case that has a clean user-side solution — small per-message cost, but it's forever.
  4. Rolling-upgrade + rebalancing edge cases (old producer sends half, new broker checks back, or vice versa) add reasoning surface that isn't paying for itself if the root cause is a config mistake.

Suggestion: close this and instead update the docs (Design_Transaction / best-practices) to explicitly call out "one producer group per transactional business — don't multiplex unrelated topics into one group". Would be happy to see a docs PR for that.

cc @wang-jiahua — no offense intended, the diagnosis in the issue is accurate, I just think the fix belongs on the user side.

@wang-jiahua

Copy link
Copy Markdown
Contributor Author

Thanks for the review @qianye1001! I understand your perspective — using the same producer group for different business topics is not the recommended practice.

However, the current behavior is still problematic because:

  1. The check-back request is sent to an arbitrary producer that has no knowledge of the transaction, causing it to return UNKNOW. This is not a graceful degradation — it can lead to incorrect transaction commit/rollback.
  2. This PR does not change the group abstraction. It only ensures that when a check-back is needed, the broker prefers the original producer (who knows the transaction state). If that producer is offline, it falls back to round-robin (current behavior).
  3. The fix is backward compatible — old producers without the clientId property work exactly as before.

Would you consider this an acceptable safety improvement, or should I close the PR?

@RockteMQ-AI

Copy link
Copy Markdown
Contributor

Review by github-manager-bot (Re-review after new commit)

Summary

Re-reviewed after commit cccf293 (2026-07-01T07:00Z). Stores the original producer's clientId in half message properties and prefers that producer's channel during transaction check-back, with round-robin fallback.

Assessment: ✅ Looks Good

Changes since initial review:

  • makeSureStateOK() added at the start of sendMessageInTransaction() — good defensive check ensuring the producer is in a valid state before sending transactional messages.
  • PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID added to STRING_HASH_SET — maintains hash-set consistency for property validation.
  • Additional test testGetAvailableChannelWithNullGroupId — verifies null-safety when groupId is null with non-null preferredClientId.

Code quality:

  • Null safetygetAvailableChannel(groupId, preferredClientId) handles null groupId (returns null) and null preferredClientId (falls through to round-robin). No NPE risk.
  • Backward compatibility — Old producers without the clientId property will have getUserProperty() return null, triggering the round-robin fallback. Fully backward compatible.
  • Property naming__TXN_PRODUCER_CID__ is concise and follows existing property naming conventions.
  • Test coverage — 4 test cases: match, fallback, null clientId, null groupId. Comprehensive.

Well-structured fix with clean separation of concerns.

@RockteMQ-AI RockteMQ-AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by github-manager-bot (Re-review after new commit)

Changes since last review

  • MQClientAPIImpl.java — Now sets PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID on the half message with this.mQClientFactory.getClientId(). This ensures the producer's client ID is propagated to the broker for transaction state checks. ✅
  • MessageConst.java — Added PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID = "__TXN_PRODUCER_CID__" constant and registered it in STRING_HASH_SET. ✅

Analysis

This completes the end-to-end flow:

  1. Client side: Producer attaches its clientId to the half message (MQClientAPIImpl.java)
  2. Broker side: sendCheckMessage() reads the clientId and prefers the original producer's channel (ProducerManager.getAvailableChannel(groupId, preferredClientId))

This is a well-designed improvement — transaction state checks now prefer the channel that originally sent the half message, reducing unnecessary cross-node communication and improving reliability.

Verdict

The new commit properly connects the client-side and broker-side changes. All pieces are in place:

  • Client propagates clientId ✅
  • Broker uses it for channel selection ✅
  • Fallback to round-robin if preferred channel unavailable ✅
  • Tests cover all scenarios ✅

LGTM 👍


Automated review by github-manager-bot

@RongtongJin RongtongJin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one issue that should be addressed before this PR is merged.

SendResult sendResult = null;
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_TRANSACTION_PREPARED, "true");
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_PRODUCER_GROUP, this.defaultMQProducer.getProducerGroup());
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID, this.mQClientFactory.getClientId());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds an internal routing hint to the half message, but the committed transactional message paths copy all half-message properties into the final message and only clear the existing transaction routing fields. That means consumers can observe __TXN_PRODUCER_CID__, which changes the user-visible message property surface and may expose producer client identity. Please remove MessageConst.PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID when building the final committed message, including both EndTransactionProcessor.endMessageTransaction() and TransactionalMessageUtil.buildTransactionalMessageFromHalfMessage(), and add tests that verify the half message keeps this hint while the committed message does not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in the latest commit — added MessageAccessor.clearProperty(msgInner, MessageConst.PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID) in EndTransactionProcessor alongside the existing PROPERTY_TRANSACTION_PREPARED clear, so the internal routing hint is stripped from the final committed message and not visible to consumers.

@qianye1001

Copy link
Copy Markdown
Contributor

只改 broker ,不改 proxy 没什么意义,建议 close 得了,不是 bug

…n state

When multiple producers in the same group handle different topics,
the transaction check may be routed to an unrelated producer that
cannot determine the transaction state correctly.

Fix: Record the original producer's clientId in the half message
properties. During transaction check-back, prefer the channel
matching that clientId. Fall back to round-robin if the original
producer is offline.

Changes:
- MessageConst: add PROPERTY_TRANSACTION_PRODUCER_CLIENT_ID
- DefaultMQProducerImpl: write clientId into half message
- ProducerManager: add getAvailableChannel(groupId, preferredClientId)
- AbstractTransactionalMessageCheckListener: use new method
- TransactionalMessageRocksDBService: use new method
- ProducerManagerTest: 3 new test cases
@wang-jiahua wang-jiahua force-pushed the fix/transaction-check-prefer-original-producer branch from cccf293 to ccc962f Compare July 2, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] TransactionalMessageService may check unrelated producer in the same producer group

6 participants