Skip to content

proxy-client: fix TSan data race in clientDestroy#286

Open
ryanofsky wants to merge 2 commits into
bitcoin-core:masterfrom
ryanofsky:pr/drace
Open

proxy-client: fix TSan data race in clientDestroy#286
ryanofsky wants to merge 2 commits into
bitcoin-core:masterfrom
ryanofsky:pr/drace

Conversation

@ryanofsky
Copy link
Copy Markdown
Collaborator

@ryanofsky ryanofsky commented Jun 3, 2026

#273 added a new test which exposed a longstanding race condition in clientDestroy logging code, causing the sanitize CI job to now fail with a TSAN error. Fix the race by simplifying the logging code.

Details about the fix are in the commit message. TSAN error looked like

[ TEST ] test.cpp:437: Destroying ProxyClient<> with destroy method after peer disconnect
...
  Write of size 8 at 0x721400005020 by thread T24 (mutexes: write M0):
    #0 std::__1::__function::__func<mp::ProxyClientBase<mp::test::messages::FooCallback, mp::test::FooCallback>::ProxyClientBase(mp::test::messages::FooCallback::Client, mp::Connection*, bool)::{lambda()#1}, void ()>::operator()() /home/runner/work/libmultiprocess/libmultiprocess/include/mp/proxy-io.h:522 (mptest+0x242b20)
    #1 void mp::Unlock<mp::Lock, std::__1::function<void ()>&>(mp::Lock&, std::__1::function<void ()>&) /nix/store/6z6xj37ljxwrwd5w3rs0zyfkrxd48lzl-libcxx-21.1.2-dev/include/c++/v1/__functional/function.h:274 (mptest+0x35582d)
    #2 mp::Connection::~Connection() /home/runner/work/libmultiprocess/libmultiprocess/src/mp/proxy.cpp:150 (mptest+0x34e596)
    ...
    #11 mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/test.cpp:112 (mptest+0x1504fe)

  Previous read of size 8 at 0x721400005020 by thread T26:
    #0 void mp::clientDestroy<mp::ProxyClient<mp::test::messages::FooCallback> >(mp::ProxyClient<mp::test::messages::FooCallback>&) /home/runner/work/libmultiprocess/libmultiprocess/include/mp/proxy-types.h:647 (mptest+0x342d2a)
     ...
    #21 std::__1::__function::__func<mp::ProxyServerBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::~ProxyServerBase()::{lambda()#1}, void ()>::operator()() /nix/store/6z6xj37ljxwrwd5w3rs0zyfkrxd48lzl-libcxx-21.1.2-dev/include/c++/v1/__functional/function.h:174 (mptest+0x347c76)
     ...
    #23 void* std::__1::__thread_proxy[abi:ne210101]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, mp::EventLoop::startAsyncThread()::$_0> >(void*) /home/runner/work/libmultiprocess/libmultiprocess/src/mp/proxy.cpp:309 (mptest+0x353ef2)

  Thread T24 (tid=10032, running) created by main thread at:
    ...
    #4 mp::test::TestCase437::run() /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/test.cpp:449 (mptest+0x1459a7)

https://github.com/bitcoin-core/libmultiprocess/actions/runs/26851297744/job/79183796152

clientDestroy() reads m_context.connection to decide whether to use
MP_LOG vs KJ_LOG, but Connection::~Connection() can set it to null
concurrently from the event loop thread (via the disconnect_cb sync
cleanup callback) while the destructor runs on an async cleanup thread,
causing a TSan-reported data race. The race is exposed by the test added
in commit 90be835 ("test: regression for ~ProxyClient destroy after peer
disconnect").

The KJ_LOG fallback was only needed before commit 315ff53 ("refactor:
Add ProxyContext EventLoop* member"), when logging required going
through connection->m_loop. Since that commit, m_context.loop is a
direct EventLoopRef that is always valid regardless of whether
m_context.connection is null. The KJ_LOG branch is now dead code, so
drop it and the connection check entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DrahtBot
Copy link
Copy Markdown

DrahtBot commented Jun 3, 2026

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK w0xlt
Stale ACK enirox001

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #288 (Create support branch for CI scripts, documentation, and examples by ryanofsky)
  • #287 (Split repository into master (library source) and support (CI, docs, examples) branches. by ryanofsky)
  • #269 (proxy: add local connection limit to ListenConnections by enirox001)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

gen.cpp uses IWYU pragma: keep for include which appears to be necessary for
newer versions of capnproto (1.4.0 but not 1.1.0)
@ryanofsky
Copy link
Copy Markdown
Collaborator Author

ryanofsky commented Jun 4, 2026

@w0xlt
Copy link
Copy Markdown

w0xlt commented Jun 4, 2026

Concept ACK,

Copy link
Copy Markdown
Contributor

@enirox001 enirox001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 039e5ac

Built and ran mptest with -fsanitize=thread, all 11 tests pass cleanly.

The fix makes sense to me. Instead of adding synchronization for what constitutes a log line, we drop the branch and unconditionally log via MP_LOG.

@ryanofsky
Copy link
Copy Markdown
Collaborator Author

Thanks for the reviews. I also went back and tried to debug why CI did not catch the TSAN error. It turns out to be because an interaction between #273 and #218. The new test added in #273 does not trigger the TSAN error deterministically. When it is run standalone without #218 there's usually no error, although I did see a TSAN error once running the test maybe 10 times. But when #273 and #218 are combined, the test triggers TSAN failures most of the time. This seems to happen because #218 changes clientDestroy to call CxxTypeName which can change thread timing.

@ryanofsky
Copy link
Copy Markdown
Collaborator Author

Started to merge this but I saw some more IWYU errors locally that seem better to fix. I think the errors don't happen in CI because nixpkgs are older and use an older version of capnproto. I also ran IWYU locally before my last push, but didn't see these errors because I was using a combined branch with #287 with example files removed. Errors look like:

/home/ryan/mp/build-default/example/calculator.capnp.proxy-client.c++ should add these lines:
#include <capnp/common.h>                    // for MessageSize
#include <string>                            // for basic_string

Updated 039e5ac -> 90982f7 (pr/drace.3 -> pr/drace.4, compare) touching up IWYU commit to fix more local IWYU errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants