Skip to content

upstreams/http/reverse_tunnel: drain-aware upstream client codec#4

Open
ivpr wants to merge 3 commits into
per-cluster-upstream-codec-factoryfrom
reverse-tunnel-upstream-drain-codec
Open

upstreams/http/reverse_tunnel: drain-aware upstream client codec#4
ivpr wants to merge 3 commits into
per-cluster-upstream-codec-factoryfrom
reverse-tunnel-upstream-drain-codec

Conversation

@ivpr

@ivpr ivpr commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • New extension envoy.upstreams.http.reverse_tunnel: a per-cluster upstream (client) codec for
    reverse-connection clusters, built on the Http::ClientCodecFactory seam from http: add per-cluster upstream client codec factory seam #3.
  • ReverseTunnelUpstreamCodecOptions doubles as the factory; scoped to HTTP/2
    envoy.clusters.reverse_connection clusters and a no-op otherwise.
  • Observes a peer GOAWAY (stat); emits a two-phase client GOAWAY (graceful max-stream-id, then
    final) and drives a graceful pool drain so in-flight requests finish.
  • UpstreamCodecDrainRegistry (TLS-backed singleton) + admin endpoint
    /reverse_tunnel/drain_clusters to fan a drain out to worker-thread codecs.
  • Gated by the per-cluster proto field enable_drain_with_goaway; off by default.

How it fits (reverse tunnels)

On a reverse tunnel the upstream is the HTTP/2 client; this codec lets a draining upstream
emit GOAWAY from the client side and drain its pool gracefully, so in-flight requests finish while
clients move to a healthy upstream.

Stack

Stacked on #3 (the generic codec-factory seam). Paired with #5 (the downstream drain-aware HCM).

Test plan

  • bazel test //test/extensions/upstreams/http/reverse_tunnel:config_test

@ivpr ivpr force-pushed the per-cluster-upstream-codec-factory branch from 7febccb to ced2843 Compare June 15, 2026 04:49
@ivpr ivpr force-pushed the reverse-tunnel-upstream-drain-codec branch from 737bc27 to 605bdd2 Compare June 15, 2026 04:49
@ivpr ivpr force-pushed the per-cluster-upstream-codec-factory branch from ced2843 to f510d90 Compare June 15, 2026 05:18
@ivpr ivpr force-pushed the reverse-tunnel-upstream-drain-codec branch from 605bdd2 to 6083fea Compare June 15, 2026 05:18
@ivpr ivpr force-pushed the per-cluster-upstream-codec-factory branch from f510d90 to 22b605f Compare June 15, 2026 06:19
@ivpr ivpr force-pushed the reverse-tunnel-upstream-drain-codec branch from 6083fea to f80753a Compare June 15, 2026 06:19
@ivpr ivpr force-pushed the per-cluster-upstream-codec-factory branch from 22b605f to dc1fddb Compare June 15, 2026 16:23
@ivpr ivpr force-pushed the reverse-tunnel-upstream-drain-codec branch from f80753a to bd68589 Compare June 15, 2026 16:23
ivpr pushed a commit that referenced this pull request Jun 15, 2026
…envoyproxy#45324)

### Description
This PR resolves a use-after-free (UAF) / pure virtual method call crash
that occurs during `ConnectivityGrid` teardown when connection pools
(HTTP/3, HTTP/2, or HTTP/1) are destroyed while active connection
attempts are still ongoing in the background.

---

### 💥 The Bug & Crash Callstack

When the `ConnectivityGrid` is deleted, its destructor destroys its
underlying connection pools in reverse order. If a pool has active
connection attempts, deleting the pool synchronously cancels those
attempts. In C++, once a derived class destructor finishes, the object's
vtable pointer transitions to point to the base class vtable.

Because `destructAllConnections()` is invoked in
`~HttpConnPoolImplBase()` (the base class destructor), the connection
cancellation callback wrappers execute `onConnectionAttemptFailed()`
after the derived pool object has already been destructed.

When `onConnectionAttemptFailed()` attempts to write a log trace calling
`describePool(attempt->pool())` (which executes
`attempt->pool().protocolDescription()`), virtual dispatch invokes a
**pure virtual function call** on the base class `HttpConnPoolImplBase`,
crashing Envoy instantly:

```log
Program received signal SIGSEGV, Segmentation fault.
0x000056453663656e in Envoy::Http::ConnectivityGrid::WrapperCallbacks::onConnectionAttemptFailed()
Backtrace:
  #0  0x000056453663656e in Envoy::Http::ConnectivityGrid::WrapperCallbacks::onConnectionAttemptFailed()
  #1  0x000056453663606d in Envoy::Http::ConnectivityGrid::WrapperCallbacks::ConnectionAttemptCallbacks::onPoolFailure()
  #2  0x000056453664e7f1 in Envoy::Http::HttpConnPoolImplBase::onPoolFailure()
  #3  0x00005645366cb447 in Envoy::ConnectionPool::ConnPoolImplBase::purgePendingStreams()
  #4  0x00005645366c9d6e in Envoy::ConnectionPool::ConnPoolImplBase::onConnectionEvent()
  #5  0x000056453666394d in Envoy::Tcp::ActiveTcpClient::onEvent()
  envoyproxy#6  0x000056453709b0e9 in Envoy::Network::MockConnectionBase::raiseEvent()
  ...
  envoyproxy#24 0x00005645366be287 in Envoy::ConnectionPool::ConnPoolImplBase::destructAllConnections()
  envoyproxy#25 0x00005645366766b8 in Envoy::Http::HttpConnPoolImplBase::~HttpConnPoolImplBase()
  envoyproxy#26 0x000056453664e70f in Envoy::Http::HttpConnPoolImplMixed::~HttpConnPoolImplMixed()
  envoyproxy#27 0x000056453664e739 in Envoy::Http::HttpConnPoolImplMixed::~HttpConnPoolImplMixed()
  envoyproxy#28 0x000056453647099c in std::__1::default_delete<>::operator()()
  envoyproxy#29 0x000056453647091c in std::__1::unique_ptr<>::reset()
  envoyproxy#30 0x000056453663a3ad in Envoy::Http::ConnectivityGrid::~ConnectivityGrid()
  envoyproxy#31 0x000056453663a4f9 in Envoy::Http::ConnectivityGrid::~ConnectivityGrid()
  envoyproxy#32 0x00005645365de65c in std::__1::default_delete<>::operator()()
  envoyproxy#33 0x000056453656b55c in std::__1::unique_ptr<>::reset()
  envoyproxy#34 0x00005645363174fa in ConnectivityGridTest_DestroyGridWithActiveH2Attempts_Test::TestBody()
```

---

### 🛠️ The Fix

Introduced a reloadable feature flag guard
`envoy.reloadable_features.conn_pool_grid_early_return_on_teardown`
(enabled by default) to shield both H3 and H2/H1 pools during
destruction:
* **Early Return**: Inside
`WrapperCallbacks::onConnectionAttemptFailed()`, we check if
`delete_started_` is `true`. If the flag is enabled and the wrapper is
already being deleted/torn down, we return immediately, completely
skipping the unsafe virtual method lookups on the half-destructed pool
object.
* **ALPN & Protocol Safety**: This safely protects the system regardless
of which pool is being torn down (H3 or H2/H1 mixed pools).

---

### 🧪 Testing & Verification

* Added the `DestroyGridWithActiveH2Attempts` unit test to verify that
tearing down the `ConnectivityGrid` while an active H2 connection
attempt is running does crash without the fix and not with the fix.

Risk Level: low, only exposed with log level trace
Testing: new unit test
Docs Changes: N/A 
Release Notes: Y
Platform Specific Features: N/A
Runtime guard:
envoy.reloadable_features.conn_pool_grid_early_return_on_teardown

Signed-off-by: Dan Zhang <danzh@google.com>
Co-authored-by: Dan Zhang <danzh@google.com>
@ivpr ivpr force-pushed the per-cluster-upstream-codec-factory branch from dc1fddb to eb48580 Compare June 15, 2026 17:42
@ivpr ivpr force-pushed the reverse-tunnel-upstream-drain-codec branch 8 times, most recently from a3b5f90 to 9415483 Compare June 18, 2026 20:56
Prasad I V added 3 commits June 26, 2026 04:20
Add the envoy.upstreams.http.reverse_tunnel extension: a per-cluster upstream
(client) codec for reverse-connection clusters that makes graceful drain
observable and controllable, built on the Http::ClientCodecFactory seam.

- ReverseTunnelUpstreamCodecOptions (typed_extension_protocol_options) doubles
  as the ClientCodecFactory; it is scoped to HTTP/2 envoy.clusters.reverse_connection
  clusters and is a no-op otherwise.
- Observes a peer GOAWAY (stat) and forwards to the pool so normal drain runs.
- Two-phase client GOAWAY (graceful max-stream-id, then final) and a graceful
  pool drain at the end of the drain window so in-flight requests finish.
- UpstreamCodecDrainRegistry (TLS-backed singleton) + admin endpoint
  /reverse_tunnel/drain_clusters to fan a drain out to worker-thread codecs.

Gated by the per-cluster proto field enable_drain_with_goaway; off by default.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Expose the codec factory via ProtocolOptionsConfig::upstreamHttpClientCodecFactory(),
switch createClientCodec to the one-arg (nullptr-to-defer) contract, and return
nullptr in the decline cases instead of calling create_default.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Update the config_test Context helper for the new ClientCodecFactory::Context
options field.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr ivpr force-pushed the reverse-tunnel-upstream-drain-codec branch from 9415483 to 4835218 Compare June 26, 2026 04:20
yanavlasov pushed a commit to envoyproxy/envoy that referenced this pull request Jun 27, 2026
Adds `Http::ClientCodecFactory`, an optional per-cluster extension point
for building the upstream
(client) HTTP codec. It is discovered from
`typed_extension_protocol_options` and surfaced via a new
non-pure `ClusterInfo::upstreamHttpClientCodecFactory()` accessor that
defaults to empty.
`CodecClientProd` invokes the factory when present, handing it a thunk
that builds the stock codec,
so an implementation can either decorate the stock codec or return a
fresh one. With no factory
configured the stock codec is built exactly as before and behavior is
unchanged.

Motivating use case: reverse tunnels invert the HTTP roles -- the
connection is dialed from the
downstream to the upstream, but HTTP requests then flow from the
upstream to the downstream over
that tunnel, so the upstream is the HTTP/2 client and the downstream is
the HTTP/2 server. Graceful
drain of such a tunnel needs cooperation in both directions and both
land on the upstream (client)
codec: when the upstream is draining it must signal the downstream to
dial a replacement connection
while in-flight requests finish; when the downstream is draining it must
tell the upstream to stop
sending new requests and let the downstream re-establish, again
finishing in-flight requests.
Neither is expressible with the stock upstream codec, and Envoy
currently has no per-cluster hook to
substitute the upstream (client) codec (`CodecClientProd` hard-codes it
by protocol). This change
adds that hook generically; the reverse-tunnel codec that uses it is a
follow-up change.

Reverse-tunnel implementation that uses this seam (in the ivpr/envoy
fork): the upstream drain-aware client codec
(ivpr#4) and the downstream drain-aware
HCM that dials a replacement tunnel
(ivpr#5).

## Reverse-tunnel graceful drain (sequence diagram)

```mermaid
sequenceDiagram
    autonumber
    participant D as Downstream
    participant U as Upstream

    Note over D,U: Reverse tunnel setup - unlike a normal hop, the downstream dials OUT to the upstream
    D->>U: open TCP connection and initiate reverse tunnel
    U-->>D: tunnel established
    Note over D,U: On the established tunnel the upstream is the HTTP/2 client and the downstream is the HTTP/2 server, so requests flow from upstream to downstream
    U->>D: request over tunnel
    D-->>U: 200

    Note over D,U: Upstream graceful drain
    U->>D: GOAWAY - drain signal toward downstream
    D->>D: dial a replacement tunnel to a healthy upstream
    U->>D: in-flight request continues on the old tunnel
    D-->>U: 200
    Note over D,U: old tunnel closes once idle, new requests use the replacement

    Note over D,U: Downstream graceful drain
    D->>U: GOAWAY - tell upstream to stop sending new requests
    U->>U: drain pool, route new requests to the replacement tunnel
    U->>D: in-flight request continues on the old tunnel
    D-->>U: 200
    Note over D,U: downstream re-establishes, old tunnel closes once idle
```

Risk Level: Low -- new, opt-in per-cluster extension point. When no
factory is configured the stock
codec is used and there is no behavior change.

Testing: `//test/common/http:codec_client_test` covers the unchanged
no-factory path. A dedicated
unit test for the factory-injection path is a planned follow-up.

Docs Changes: None (the new interface is documented inline in
`envoy/http/client_codec_factory.h`).

Release Notes: None (extension point only; no user-facing behavior
change on its own).

Platform Specific Features: None.

Runtime guard: None (this PR adds no runtime-guarded behavior).

API Considerations: No xDS/proto API changes. Adds one new C++ extension
interface
(`envoy/http/client_codec_factory.h`) and a non-pure
`ClusterInfo::upstreamHttpClientCodecFactory()`
accessor that defaults to empty, so existing `ClusterInfo`
implementations are unaffected.

---------

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Co-authored-by: Prasad I V <prasad.iv@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant