reverse_tunnel: drain-aware HCM re-dials on peer GOAWAY#45875
Conversation
On a reverse tunnel the downstream is the HTTP/2 server. When the peer upstream drains it sends GOAWAY, but HCM ignores GOAWAY by default, so the tunnel lingers until the socket closes and the agent does not proactively restore capacity. Make the drain-aware HCM react to a peer GOAWAY by re-dialing a replacement tunnel, so capacity is restored before the draining tunnel closes while its in-flight streams finish. Opt-in via enable_drain_with_goaway on the drain-aware HCM config; default off, so unupgraded or disabled downstreams are unaffected (GOAWAY stays a no-op). Signed-off-by: Prasad I V <prasad.iv@databricks.com>
1c6b4e7 to
b51ce0e
Compare
Add unit coverage for the code introduced by the drain-aware re-dial feature so the touched directories meet their coverage thresholds: - markTunnelDrainingAndDialReplacement() drops the draining tunnel from tracking and arms the retry timer, plus the unknown-key no-op paths in markTunnelDrainingAndDialReplacement()/onDownstreamConnectionClosed(). - DrainAwareHttpConnectionManagerConfig::createCodec() with enable_drain_with_goaway: both the non-reverse-socket case (no wiring) and the reverse-tunnel case that installs the local-drain closure and the GOAWAY-observing callbacks wrapper. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
de1ff26 to
dc1bc84
Compare
- Cover createCodec's defensive nullptr-codec branch with drain_with_goaway enabled, closing the last coverage gap in the drain_aware_hcm directory. - Reflow an over-length test name flagged by clang-format. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The drain_aware_hcm directory was 0.1% under threshold because every createCodec test overrode createBaseCodec, leaving the real parent-HCM codec construction uncovered. Add a test that builds an actual HTTP/1 base codec and wraps it. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
…rding Close the remaining drain_aware_hcm coverage gap (confirmed via local lcov): - fire the local-drain closure from the reverse-tunnel createCodec test so the wired markTunnelDrainingAndDialReplacement path is exercised; - delegate onMaxStreamsChanged through the GOAWAY-observing callbacks wrapper. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Condense the verbose how-focused comments on the server-connection wrapper and createCodec wiring to concise why-focused ones. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
|
/gemini review |
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
There was a problem hiding this comment.
Code Review
This pull request introduces an opt-in feature (enable_drain_with_goaway) to the reverse-tunnel drain-aware HTTP connection manager. When enabled, a peer-initiated GOAWAY or a local drain event on a reverse tunnel triggers the initiator to immediately dial a replacement tunnel, restoring capacity while the old tunnel gracefully finishes its in-flight streams. The review feedback highlights a potential use-after-free vulnerability in drain_aware_config.cc where a raw pointer is captured by value in a lambda. Capturing the tunnel IOHandle instead and dynamically querying the parent when invoked will prevent a crash if the parent is destroyed before the connection drains.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Address review feedback on a use-after-free risk: the re-dial closure captured the raw parent ReverseConnectionIOHandle by value, which would dangle if the parent were torn down before the tunnel drained. Make the parent's "defensive nullptr" contract real: child tunnel IoHandles now register with the parent and the parent clears their back-pointers on teardown (cleanup()). The closure captures the tunnel IoHandle (owned by the connection, so alive whenever it can fire) and resolves parent() at invocation, skipping the re-dial if the parent is already gone. Adds a test covering parent-teardown detachment. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
basundhara-c
left a comment
There was a problem hiding this comment.
LGTM after fixing CI
|
/retest |
On a reverse tunnel the downstream is the HTTP/2 server. When the peer upstream drains it sends GOAWAY, but HCM ignores GOAWAY by default, so the tunnel lingers until the socket closes and the agent does not proactively restore capacity.
This makes the drain-aware HCM react to a peer GOAWAY by re-dialing a replacement tunnel, so capacity is restored before the draining tunnel closes while its in-flight streams finish. Opt-in via
enable_drain_with_goawayon the drain-aware HCM config; default off, so unupgraded or disabled downstreams are unaffected (a received GOAWAY stays a no-op) -- a safe migration path.Companion to #45874 (the upstream emits the drain GOAWAY); the two are independent and each is harmless alone.
Commit Message: reverse_tunnel: drain-aware HCM re-dials on peer GOAWAY
Additional Description: See above.
Risk Level: Low -- config-gated, default off; unchanged behavior when off.
Testing: unit tests (drain_aware_config_test, drain_aware_server_connection_test); manual end-to-end + mixed-version safety (an old downstream safely ignores the GOAWAY without crashing).
Docs Changes: drain_aware_hcm config field docs.
Release Notes: changelog fragment included.
Platform Specific Features: n/a
Optional Runtime guard: n/a (config-gated)
Optional API Considerations: adds
enable_drain_with_goawaytoDrainAwareHttpConnectionManager.