fix: detect unsupported transport multiaddrs before dialing (issue #391)#1351
Open
paschal533 wants to merge 3 commits into
Open
fix: detect unsupported transport multiaddrs before dialing (issue #391)#1351paschal533 wants to merge 3 commits into
paschal533 wants to merge 3 commits into
Conversation
Add ITransport.can_dial(maddr) with a default of True so existing
transports opt in without changes. TCP.can_dial() returns False for
any multiaddr that lacks a 'tcp' component (e.g. /udp/9090 or
/udp/9090/quic). Swarm.dial_peer() now filters peerstore addresses
through can_dial() before attempting a connection, raising a clear
SwarmException instead of a raw ProtocolLookupError when no supported
address exists.
Also fix TCP._dial_resolved() to catch ProtocolLookupError from
value_for_protocol('tcp') and re-raise as OpenConnectionError,
consistent with how the listen() path already handled it.
Closes libp2p#391
Dockerfile.issue391: builds a Python 3.13-slim image with all deps, runs the five issue libp2p#391 regression tests, then the benchmark. benchmarks/bench_issue391.py: measures the can_dial() filter on three dimensions — synchronous overhead (~30 µs), single transport.dial() with a bad addr (~67 µs), and full dial_peer() before vs after the fix (~620 µs → ~154 µs, 4x speedup, with production retries the gap is larger). All five regression tests pass in Docker: - test_tcp_can_dial_returns_true_for_tcp_addr PASSED - test_tcp_can_dial_returns_false_for_udp_addr PASSED - test_tcp_can_dial_returns_false_for_quic_addr PASSED - test_swarm_skips_unsupported_multiaddrs PASSED - test_swarm_can_dial_filters_unsupported PASSED
- benchmarks/bench_issue391.py: remove unused variables (ruff F841), fix import order (ruff I001), convert plain f-strings, ruff format - libp2p/network/swarm.py: ruff format (no logic changes) - libp2p/transport/tcp/tcp.py: guard against port_str=None from value_for_protocol() to satisfy mypy (int(None) overload error) - libp2p/transport/websocket/transport.py: change can_dial() from async def to def (no awaits inside) so it matches the sync ITransport base signature; update the one call-site to drop the await - tests/core/network/test_swarm.py: use TCP() directly instead of swarm.transport — new_swarm() returns INetworkService which does not expose .transport in its type interface
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #391.
What was happening
When the peerstore contained multiaddrs that the registered transport doesn't handle — for example,
/udp/9090or/udp/9090/quicaddresses when only TCP is registered —dial_peer()would calltransport.dial()for each of those addresses anyway. That eventually reachedTCP._dial_resolved(), which calledmaddr.value_for_protocol("tcp")without guarding against theProtocolLookupErrorit raises on non-TCP multiaddrs. The exception propagated all the way up uncaught, crashing the caller with an error that gives no indication of what actually went wrong.Making it worse: the default
RetryConfighasmax_retries=3andinitial_delay=0.1swithbackoff_multiplier=2.0. So for each unsupported address, the stack would sleep ~0.1s + ~0.2s + ~0.4s = ~700ms before finally raising the wrong exception. A peer with two bad addresses (UDP + QUIC) would waste ~1.4 seconds before crashing.The fix
Three small, self-contained changes:
ITransportgets acan_dial(maddr)method (defaultTrue) so the interface can express whether a transport handles a given multiaddr, without breaking existing custom transports that don't implement it.TCP.can_dial()returnsFalsefor any multiaddr without atcpcomponent. The check is a simple protocol lookup, ~30 µs — negligible compared to the cost of an actual TCP handshake (~1–100 ms).Swarm.dial_peer()filters addresses throughcan_dial()before entering the dial loop. If no address passes, it raisesSwarmException("No supported transport for any address of peer ...")immediately, before touching the transport or sleeping.TCP._dial_resolved()also now wraps thevalue_for_protocol("tcp")call in a try/except and re-raises asOpenConnectionError, consistent with howlisten()already handled it. This is a defence-in-depth change — thecan_dial()filter should prevent us from ever reaching this path with a bad address, but if something slips through it will still produce a clean error.Note: QUIC and WebSocket already had
can_dial()methods on their transport classes, but nothing was calling them. This PR formalises the pattern and wires it intoSwarm.What it looks like now
Before:
After:
Testing
Five new regression tests in
tests/core/network/test_swarm.py, all verified in a clean Docker environment (Python 3.13.14 / Linux):Dockerfile.issue391andbenchmarks/bench_issue391.pyare included so the numbers can be reproduced:Backwards compatibility
ITransport.can_dial()defaults toTrue, so any existing transport that doesn't implement it continues to work exactly as before.