proxy-client: fix TSan data race in clientDestroy#286
Conversation
clientDestroy() reads m_context.connection to decide whether to use MP_LOG vs KJ_LOG, but Connection::~Connection() can set it to null concurrently from the event loop thread (via the disconnect_cb sync cleanup callback) while the destructor runs on an async cleanup thread, causing a TSan-reported data race. The race is exposed by the test added in commit 90be835 ("test: regression for ~ProxyClient destroy after peer disconnect"). The KJ_LOG fallback was only needed before commit 315ff53 ("refactor: Add ProxyContext EventLoop* member"), when logging required going through connection->m_loop. Since that commit, m_context.loop is a direct EventLoopRef that is always valid regardless of whether m_context.connection is null. The KJ_LOG branch is now dead code, so drop it and the connection check entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
gen.cpp uses IWYU pragma: keep for include which appears to be necessary for newer versions of capnproto (1.4.0 but not 1.1.0)
|
Added 1 commits 25bb3e6 -> 0963634 ( Updated 0963634 -> 039e5ac ( |
|
Concept ACK, |
There was a problem hiding this comment.
ACK 039e5ac
Built and ran mptest with -fsanitize=thread, all 11 tests pass cleanly.
The fix makes sense to me. Instead of adding synchronization for what constitutes a log line, we drop the branch and unconditionally log via MP_LOG.
|
Thanks for the reviews. I also went back and tried to debug why CI did not catch the TSAN error. It turns out to be because an interaction between #273 and #218. The new test added in #273 does not trigger the TSAN error deterministically. When it is run standalone without #218 there's usually no error, although I did see a TSAN error once running the test maybe 10 times. But when #273 and #218 are combined, the test triggers TSAN failures most of the time. This seems to happen because #218 changes |
|
Started to merge this but I saw some more IWYU errors locally that seem better to fix. I think the errors don't happen in CI because nixpkgs are older and use an older version of capnproto. I also ran IWYU locally before my last push, but didn't see these errors because I was using a combined branch with #287 with example files removed. Errors look like: Updated 039e5ac -> 90982f7 ( |
#273 added a new test which exposed a longstanding race condition in
clientDestroylogging code, causing the sanitize CI job to now fail with a TSAN error. Fix the race by simplifying the logging code.Details about the fix are in the commit message. TSAN error looked like
https://github.com/bitcoin-core/libmultiprocess/actions/runs/26851297744/job/79183796152