Skip to content

reverse_tunnel: drain-aware HCM re-dials on peer GOAWAY#45875

Open
ivpr wants to merge 7 commits into
envoyproxy:mainfrom
ivpr:submit/drain-aware-hcm-redial
Open

reverse_tunnel: drain-aware HCM re-dials on peer GOAWAY#45875
ivpr wants to merge 7 commits into
envoyproxy:mainfrom
ivpr:submit/drain-aware-hcm-redial

Conversation

@ivpr

@ivpr ivpr commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

On a reverse tunnel the downstream is the HTTP/2 server. When the peer upstream drains it sends GOAWAY, but HCM ignores GOAWAY by default, so the tunnel lingers until the socket closes and the agent does not proactively restore capacity.

This makes the drain-aware HCM react to a peer GOAWAY by re-dialing a replacement tunnel, so capacity is restored before the draining tunnel closes while its in-flight streams finish. Opt-in via enable_drain_with_goaway on the drain-aware HCM config; default off, so unupgraded or disabled downstreams are unaffected (a received GOAWAY stays a no-op) -- a safe migration path.

Companion to #45874 (the upstream emits the drain GOAWAY); the two are independent and each is harmless alone.

Commit Message: reverse_tunnel: drain-aware HCM re-dials on peer GOAWAY
Additional Description: See above.
Risk Level: Low -- config-gated, default off; unchanged behavior when off.
Testing: unit tests (drain_aware_config_test, drain_aware_server_connection_test); manual end-to-end + mixed-version safety (an old downstream safely ignores the GOAWAY without crashing).
Docs Changes: drain_aware_hcm config field docs.
Release Notes: changelog fragment included.
Platform Specific Features: n/a
Optional Runtime guard: n/a (config-gated)
Optional API Considerations: adds enable_drain_with_goaway to DrainAwareHttpConnectionManager.

@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45875 was opened by ivpr.

see: more, trace.

On a reverse tunnel the downstream is the HTTP/2 server. When the peer upstream
drains it sends GOAWAY, but HCM ignores GOAWAY by default, so the tunnel lingers
until the socket closes and the agent does not proactively restore capacity.

Make the drain-aware HCM react to a peer GOAWAY by re-dialing a replacement
tunnel, so capacity is restored before the draining tunnel closes while its
in-flight streams finish. Opt-in via enable_drain_with_goaway on the drain-aware
HCM config; default off, so unupgraded or disabled downstreams are unaffected
(GOAWAY stays a no-op).

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr ivpr force-pushed the submit/drain-aware-hcm-redial branch from 1c6b4e7 to b51ce0e Compare June 29, 2026 06:38
Add unit coverage for the code introduced by the drain-aware re-dial
feature so the touched directories meet their coverage thresholds:

- markTunnelDrainingAndDialReplacement() drops the draining tunnel from
  tracking and arms the retry timer, plus the unknown-key no-op paths in
  markTunnelDrainingAndDialReplacement()/onDownstreamConnectionClosed().
- DrainAwareHttpConnectionManagerConfig::createCodec() with
  enable_drain_with_goaway: both the non-reverse-socket case (no wiring)
  and the reverse-tunnel case that installs the local-drain closure and
  the GOAWAY-observing callbacks wrapper.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr ivpr force-pushed the submit/drain-aware-hcm-redial branch from de1ff26 to dc1bc84 Compare June 29, 2026 07:19
Prasad I V added 4 commits June 29, 2026 08:30
- Cover createCodec's defensive nullptr-codec branch with drain_with_goaway
  enabled, closing the last coverage gap in the drain_aware_hcm directory.
- Reflow an over-length test name flagged by clang-format.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The drain_aware_hcm directory was 0.1% under threshold because every
createCodec test overrode createBaseCodec, leaving the real parent-HCM
codec construction uncovered. Add a test that builds an actual HTTP/1
base codec and wraps it.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
…rding

Close the remaining drain_aware_hcm coverage gap (confirmed via local lcov):
- fire the local-drain closure from the reverse-tunnel createCodec test so
  the wired markTunnelDrainingAndDialReplacement path is exercised;
- delegate onMaxStreamsChanged through the GOAWAY-observing callbacks wrapper.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Condense the verbose how-focused comments on the server-connection wrapper
and createCodec wiring to concise why-focused ones.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr ivpr marked this pull request as ready for review June 30, 2026 17:14
@ivpr

ivpr commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@repokitteh-read-only

Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #45875 was ready_for_review by ivpr.

see: more, trace.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an opt-in feature (enable_drain_with_goaway) to the reverse-tunnel drain-aware HTTP connection manager. When enabled, a peer-initiated GOAWAY or a local drain event on a reverse tunnel triggers the initiator to immediately dial a replacement tunnel, restoring capacity while the old tunnel gracefully finishes its in-flight streams. The review feedback highlights a potential use-after-free vulnerability in drain_aware_config.cc where a raw pointer is captured by value in a lambda. Capturing the tunnel IOHandle instead and dynamically querying the parent when invoked will prevent a crash if the parent is destroyed before the connection drains.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Address review feedback on a use-after-free risk: the re-dial closure
captured the raw parent ReverseConnectionIOHandle by value, which would
dangle if the parent were torn down before the tunnel drained.

Make the parent's "defensive nullptr" contract real: child tunnel
IoHandles now register with the parent and the parent clears their
back-pointers on teardown (cleanup()). The closure captures the tunnel
IoHandle (owned by the connection, so alive whenever it can fire) and
resolves parent() at invocation, skipping the re-dial if the parent is
already gone. Adds a test covering parent-teardown detachment.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

@basundhara-c basundhara-c left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after fixing CI

@ivpr

ivpr commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants