reverse_tunnel drain-aware HCM: observe peer GOAWAY and dial replacement#5
Open
ivpr wants to merge 2 commits into
Open
reverse_tunnel drain-aware HCM: observe peer GOAWAY and dial replacement#5ivpr wants to merge 2 commits into
ivpr wants to merge 2 commits into
Conversation
This was referenced Jun 11, 2026
737bc27 to
605bdd2
Compare
07d7ee8 to
387f079
Compare
605bdd2 to
6083fea
Compare
387f079 to
6efcae6
Compare
6083fea to
f80753a
Compare
6efcae6 to
56926c6
Compare
f80753a to
bd68589
Compare
56926c6 to
4264b32
Compare
ivpr
pushed a commit
that referenced
this pull request
Jun 15, 2026
…envoyproxy#45324) ### Description This PR resolves a use-after-free (UAF) / pure virtual method call crash that occurs during `ConnectivityGrid` teardown when connection pools (HTTP/3, HTTP/2, or HTTP/1) are destroyed while active connection attempts are still ongoing in the background. --- ### 💥 The Bug & Crash Callstack When the `ConnectivityGrid` is deleted, its destructor destroys its underlying connection pools in reverse order. If a pool has active connection attempts, deleting the pool synchronously cancels those attempts. In C++, once a derived class destructor finishes, the object's vtable pointer transitions to point to the base class vtable. Because `destructAllConnections()` is invoked in `~HttpConnPoolImplBase()` (the base class destructor), the connection cancellation callback wrappers execute `onConnectionAttemptFailed()` after the derived pool object has already been destructed. When `onConnectionAttemptFailed()` attempts to write a log trace calling `describePool(attempt->pool())` (which executes `attempt->pool().protocolDescription()`), virtual dispatch invokes a **pure virtual function call** on the base class `HttpConnPoolImplBase`, crashing Envoy instantly: ```log Program received signal SIGSEGV, Segmentation fault. 0x000056453663656e in Envoy::Http::ConnectivityGrid::WrapperCallbacks::onConnectionAttemptFailed() Backtrace: #0 0x000056453663656e in Envoy::Http::ConnectivityGrid::WrapperCallbacks::onConnectionAttemptFailed() #1 0x000056453663606d in Envoy::Http::ConnectivityGrid::WrapperCallbacks::ConnectionAttemptCallbacks::onPoolFailure() #2 0x000056453664e7f1 in Envoy::Http::HttpConnPoolImplBase::onPoolFailure() #3 0x00005645366cb447 in Envoy::ConnectionPool::ConnPoolImplBase::purgePendingStreams() #4 0x00005645366c9d6e in Envoy::ConnectionPool::ConnPoolImplBase::onConnectionEvent() #5 0x000056453666394d in Envoy::Tcp::ActiveTcpClient::onEvent() envoyproxy#6 0x000056453709b0e9 in Envoy::Network::MockConnectionBase::raiseEvent() ... envoyproxy#24 0x00005645366be287 in Envoy::ConnectionPool::ConnPoolImplBase::destructAllConnections() envoyproxy#25 0x00005645366766b8 in Envoy::Http::HttpConnPoolImplBase::~HttpConnPoolImplBase() envoyproxy#26 0x000056453664e70f in Envoy::Http::HttpConnPoolImplMixed::~HttpConnPoolImplMixed() envoyproxy#27 0x000056453664e739 in Envoy::Http::HttpConnPoolImplMixed::~HttpConnPoolImplMixed() envoyproxy#28 0x000056453647099c in std::__1::default_delete<>::operator()() envoyproxy#29 0x000056453647091c in std::__1::unique_ptr<>::reset() envoyproxy#30 0x000056453663a3ad in Envoy::Http::ConnectivityGrid::~ConnectivityGrid() envoyproxy#31 0x000056453663a4f9 in Envoy::Http::ConnectivityGrid::~ConnectivityGrid() envoyproxy#32 0x00005645365de65c in std::__1::default_delete<>::operator()() envoyproxy#33 0x000056453656b55c in std::__1::unique_ptr<>::reset() envoyproxy#34 0x00005645363174fa in ConnectivityGridTest_DestroyGridWithActiveH2Attempts_Test::TestBody() ``` --- ### 🛠️ The Fix Introduced a reloadable feature flag guard `envoy.reloadable_features.conn_pool_grid_early_return_on_teardown` (enabled by default) to shield both H3 and H2/H1 pools during destruction: * **Early Return**: Inside `WrapperCallbacks::onConnectionAttemptFailed()`, we check if `delete_started_` is `true`. If the flag is enabled and the wrapper is already being deleted/torn down, we return immediately, completely skipping the unsafe virtual method lookups on the half-destructed pool object. * **ALPN & Protocol Safety**: This safely protects the system regardless of which pool is being torn down (H3 or H2/H1 mixed pools). --- ### 🧪 Testing & Verification * Added the `DestroyGridWithActiveH2Attempts` unit test to verify that tearing down the `ConnectivityGrid` while an active H2 connection attempt is running does crash without the fix and not with the fix. Risk Level: low, only exposed with log level trace Testing: new unit test Docs Changes: N/A Release Notes: Y Platform Specific Features: N/A Runtime guard: envoy.reloadable_features.conn_pool_grid_early_return_on_teardown Signed-off-by: Dan Zhang <danzh@google.com> Co-authored-by: Dan Zhang <danzh@google.com>
added 2 commits
June 15, 2026 17:40
Add the envoy.upstreams.http.reverse_tunnel extension: a per-cluster upstream (client) codec for reverse-connection clusters that makes graceful drain observable and controllable, built on the Http::ClientCodecFactory seam. - ReverseTunnelUpstreamCodecOptions (typed_extension_protocol_options) doubles as the ClientCodecFactory; it is scoped to HTTP/2 envoy.clusters.reverse_connection clusters and is a no-op otherwise. - Observes a peer GOAWAY (stat) and forwards to the pool so normal drain runs. - Two-phase client GOAWAY (graceful max-stream-id, then final) and a graceful pool drain at the end of the drain window so in-flight requests finish. - UpstreamCodecDrainRegistry (TLS-backed singleton) + admin endpoint /reverse_tunnel/drain_clusters to fan a drain out to worker-thread codecs. Gated by the per-cluster proto field enable_drain_with_goaway; off by default. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Extend the downstream reverse-tunnel drain-aware HCM so a tunnel agent dials a replacement tunnel as soon as a tunnel begins draining, keeping the old tunnel alive for in-flight requests. - DrainAwareServerConnectionCallbacks interposes between the HTTP/2 server codec and the HCM callbacks to observe a peer GOAWAY (the upstream server's drain), which the default server path ignores. - Both the peer-GOAWAY observer and the local drain hook (max_connection_duration / graceful shutdown / listener drain) call markTunnelDrainingAndDialReplacement, which drops the tunnel from tracking and kicks maintenance to dial a replacement immediately; the underlying socket is left alone so in-flight HTTP/2 streams finish, and onDownstreamConnectionClosed() no-ops on the eventual close. - Adds connectionKey()/parent() accessors on the downstream IOHandle. Gated by runtime guard envoy.reloadable_features.reverse_tunnel_drain_with_goaway; off by default. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
bd68589 to
0d47032
Compare
454590d to
ff70787
Compare
a3b5f90 to
9415483
Compare
9415483 to
4835218
Compare
yanavlasov
pushed a commit
to envoyproxy/envoy
that referenced
this pull request
Jun 27, 2026
Adds `Http::ClientCodecFactory`, an optional per-cluster extension point for building the upstream (client) HTTP codec. It is discovered from `typed_extension_protocol_options` and surfaced via a new non-pure `ClusterInfo::upstreamHttpClientCodecFactory()` accessor that defaults to empty. `CodecClientProd` invokes the factory when present, handing it a thunk that builds the stock codec, so an implementation can either decorate the stock codec or return a fresh one. With no factory configured the stock codec is built exactly as before and behavior is unchanged. Motivating use case: reverse tunnels invert the HTTP roles -- the connection is dialed from the downstream to the upstream, but HTTP requests then flow from the upstream to the downstream over that tunnel, so the upstream is the HTTP/2 client and the downstream is the HTTP/2 server. Graceful drain of such a tunnel needs cooperation in both directions and both land on the upstream (client) codec: when the upstream is draining it must signal the downstream to dial a replacement connection while in-flight requests finish; when the downstream is draining it must tell the upstream to stop sending new requests and let the downstream re-establish, again finishing in-flight requests. Neither is expressible with the stock upstream codec, and Envoy currently has no per-cluster hook to substitute the upstream (client) codec (`CodecClientProd` hard-codes it by protocol). This change adds that hook generically; the reverse-tunnel codec that uses it is a follow-up change. Reverse-tunnel implementation that uses this seam (in the ivpr/envoy fork): the upstream drain-aware client codec (ivpr#4) and the downstream drain-aware HCM that dials a replacement tunnel (ivpr#5). ## Reverse-tunnel graceful drain (sequence diagram) ```mermaid sequenceDiagram autonumber participant D as Downstream participant U as Upstream Note over D,U: Reverse tunnel setup - unlike a normal hop, the downstream dials OUT to the upstream D->>U: open TCP connection and initiate reverse tunnel U-->>D: tunnel established Note over D,U: On the established tunnel the upstream is the HTTP/2 client and the downstream is the HTTP/2 server, so requests flow from upstream to downstream U->>D: request over tunnel D-->>U: 200 Note over D,U: Upstream graceful drain U->>D: GOAWAY - drain signal toward downstream D->>D: dial a replacement tunnel to a healthy upstream U->>D: in-flight request continues on the old tunnel D-->>U: 200 Note over D,U: old tunnel closes once idle, new requests use the replacement Note over D,U: Downstream graceful drain D->>U: GOAWAY - tell upstream to stop sending new requests U->>U: drain pool, route new requests to the replacement tunnel U->>D: in-flight request continues on the old tunnel D-->>U: 200 Note over D,U: downstream re-establishes, old tunnel closes once idle ``` Risk Level: Low -- new, opt-in per-cluster extension point. When no factory is configured the stock codec is used and there is no behavior change. Testing: `//test/common/http:codec_client_test` covers the unchanged no-factory path. A dedicated unit test for the factory-injection path is a planned follow-up. Docs Changes: None (the new interface is documented inline in `envoy/http/client_codec_factory.h`). Release Notes: None (extension point only; no user-facing behavior change on its own). Platform Specific Features: None. Runtime guard: None (this PR adds no runtime-guarded behavior). API Considerations: No xDS/proto API changes. Adds one new C++ extension interface (`envoy/http/client_codec_factory.h`) and a non-pure `ClusterInfo::upstreamHttpClientCodecFactory()` accessor that defaults to empty, so existing `ClusterInfo` implementations are unaffected. --------- Signed-off-by: Prasad I V <prasad.iv@databricks.com> Co-authored-by: Prasad I V <prasad.iv@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
replacement tunnel as soon as a tunnel begins draining, keeping the old tunnel alive for in-flight
requests.
DrainAwareServerConnectionCallbacksinterposes between the HTTP/2 server codec and the HCMcallbacks to observe a peer GOAWAY (the upstream's drain), which the default server path
ignores.
max_connection_duration/ graceful shutdown /listener drain) both call
markTunnelDrainingAndDialReplacement, which drops the tunnel fromtracking and kicks maintenance to dial a replacement; the socket is left alone so in-flight HTTP/2
streams finish.
connectionKey()/parent()accessors on the downstream IOHandle.envoy.reloadable_features.reverse_tunnel_drain_with_goaway; off bydefault.
How it fits (reverse tunnels)
On a reverse tunnel the downstream is the HTTP/2 server; this is the peer of #4. When the
upstream (or the downstream itself, via
max_connection_duration) begins draining, thedownstream dials a replacement tunnel while existing in-flight requests finish.
Stack
Stacked on #4 (the upstream drain-aware codec); both build on the seam in #3.
Test plan
bazel test //test/extensions/filters/network/reverse_tunnel/drain_aware_hcm/...