Skip to content

fix: detect unsupported transport multiaddrs before dialing (issue #391)#1351

Open
paschal533 wants to merge 3 commits into
libp2p:mainfrom
paschal533:fix/issue-391-can-dial
Open

fix: detect unsupported transport multiaddrs before dialing (issue #391)#1351
paschal533 wants to merge 3 commits into
libp2p:mainfrom
paschal533:fix/issue-391-can-dial

Conversation

@paschal533

Copy link
Copy Markdown
Contributor

Fixes #391.

What was happening

When the peerstore contained multiaddrs that the registered transport doesn't handle — for example, /udp/9090 or /udp/9090/quic addresses when only TCP is registered — dial_peer() would call transport.dial() for each of those addresses anyway. That eventually reached TCP._dial_resolved(), which called maddr.value_for_protocol("tcp") without guarding against the ProtocolLookupError it raises on non-TCP multiaddrs. The exception propagated all the way up uncaught, crashing the caller with an error that gives no indication of what actually went wrong.

Making it worse: the default RetryConfig has max_retries=3 and initial_delay=0.1s with backoff_multiplier=2.0. So for each unsupported address, the stack would sleep ~0.1s + ~0.2s + ~0.4s = ~700ms before finally raising the wrong exception. A peer with two bad addresses (UDP + QUIC) would waste ~1.4 seconds before crashing.

The fix

Three small, self-contained changes:

ITransport gets a can_dial(maddr) method (default True) so the interface can express whether a transport handles a given multiaddr, without breaking existing custom transports that don't implement it.

TCP.can_dial() returns False for any multiaddr without a tcp component. The check is a simple protocol lookup, ~30 µs — negligible compared to the cost of an actual TCP handshake (~1–100 ms).

Swarm.dial_peer() filters addresses through can_dial() before entering the dial loop. If no address passes, it raises SwarmException("No supported transport for any address of peer ...") immediately, before touching the transport or sleeping.

TCP._dial_resolved() also now wraps the value_for_protocol("tcp") call in a try/except and re-raises as OpenConnectionError, consistent with how listen() already handled it. This is a defence-in-depth change — the can_dial() filter should prevent us from ever reaching this path with a bad address, but if something slips through it will still produce a clean error.

Note: QUIC and WebSocket already had can_dial() methods on their transport classes, but nothing was calling them. This PR formalises the pattern and wires it into Swarm.

What it looks like now

Before:

ProtocolLookupError: <no tcp in multiaddr>   # wrong layer, no context

After:

SwarmException: No supported transport for any address of peer <id>: [/ip4/127.0.0.1/udp/9090]

Testing

Five new regression tests in tests/core/network/test_swarm.py, all verified in a clean Docker environment (Python 3.13.14 / Linux):

test_tcp_can_dial_returns_true_for_tcp_addr   PASSED
test_tcp_can_dial_returns_false_for_udp_addr  PASSED
test_tcp_can_dial_returns_false_for_quic_addr PASSED
test_swarm_skips_unsupported_multiaddrs       PASSED   ← regression test for #391
test_swarm_can_dial_filters_unsupported       PASSED

Dockerfile.issue391 and benchmarks/bench_issue391.py are included so the numbers can be reproduced:

can_dial() guard cost:          ~30 µs/call
dial_peer() before (no filter): ~620 µs/call  (no retry, 2 bad addrs)
dial_peer() after  (filtered):  ~154 µs/call
Speedup (no retry baseline):    4x

With default max_retries=3 the before-fix cost is ~1.4 s of sleep alone,
before raising ProtocolLookupError on a peer that simply has no TCP address.

Backwards compatibility

ITransport.can_dial() defaults to True, so any existing transport that doesn't implement it continues to work exactly as before.

Add ITransport.can_dial(maddr) with a default of True so existing
transports opt in without changes. TCP.can_dial() returns False for
any multiaddr that lacks a 'tcp' component (e.g. /udp/9090 or
/udp/9090/quic). Swarm.dial_peer() now filters peerstore addresses
through can_dial() before attempting a connection, raising a clear
SwarmException instead of a raw ProtocolLookupError when no supported
address exists.

Also fix TCP._dial_resolved() to catch ProtocolLookupError from
value_for_protocol('tcp') and re-raise as OpenConnectionError,
consistent with how the listen() path already handled it.

Closes libp2p#391
Dockerfile.issue391: builds a Python 3.13-slim image with all deps,
runs the five issue libp2p#391 regression tests, then the benchmark.

benchmarks/bench_issue391.py: measures the can_dial() filter on three
dimensions — synchronous overhead (~30 µs), single transport.dial()
with a bad addr (~67 µs), and full dial_peer() before vs after the fix
(~620 µs → ~154 µs, 4x speedup, with production retries the gap is
larger). All five regression tests pass in Docker:
  - test_tcp_can_dial_returns_true_for_tcp_addr  PASSED
  - test_tcp_can_dial_returns_false_for_udp_addr PASSED
  - test_tcp_can_dial_returns_false_for_quic_addr PASSED
  - test_swarm_skips_unsupported_multiaddrs       PASSED
  - test_swarm_can_dial_filters_unsupported       PASSED
- benchmarks/bench_issue391.py: remove unused variables (ruff F841),
  fix import order (ruff I001), convert plain f-strings, ruff format
- libp2p/network/swarm.py: ruff format (no logic changes)
- libp2p/transport/tcp/tcp.py: guard against port_str=None from
  value_for_protocol() to satisfy mypy (int(None) overload error)
- libp2p/transport/websocket/transport.py: change can_dial() from
  async def to def (no awaits inside) so it matches the sync ITransport
  base signature; update the one call-site to drop the await
- tests/core/network/test_swarm.py: use TCP() directly instead of
  swarm.transport — new_swarm() returns INetworkService which does not
  expose .transport in its type interface
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

network/swarm.py and transport/tcp is not smart enough to detect that peerstore multiaddress is not reachable

1 participant