Skip to content

reverse_tunnel: add drain-aware upstream client codec#45874

Open
ivpr wants to merge 14 commits into
envoyproxy:mainfrom
ivpr:submit/upstream-drain-codec
Open

reverse_tunnel: add drain-aware upstream client codec#45874
ivpr wants to merge 14 commits into
envoyproxy:mainfrom
ivpr:submit/upstream-drain-codec

Conversation

@ivpr

@ivpr ivpr commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Reverse-connection clusters hold a long-lived multiplexed HTTP/2 tunnel to a peer Envoy, with the cluster on the client side of the tunnel. When such an upstream drains, peers otherwise only learn of it when the TCP connection finally closes -- too late to fail in-flight requests over gracefully.

This adds an opt-in, per-cluster upstream client codec (built on the Http::ClientCodecFactory seam from #45632) that drains a reverse tunnel gracefully so peers can move to a sibling upstream before streams are reset. Opt-in via typed_extension_protocol_options on the envoy.clusters.reverse_connection cluster; default off.

Companion to #45875 (the downstream reacts to the drain GOAWAY by re-dialing); the two are independent and each is harmless alone.

Commit Message: reverse_tunnel: add drain-aware upstream client codec
Additional Description: See above.
Risk Level: Low -- runs only for opted-in reverse_connection clusters; default off; existing clusters unaffected.
Testing: unit tests (config_test); manual end-to-end reverse-tunnel drain validation.
Docs Changes: extension docs for the new options.
Release Notes: changelog fragment included.
Platform Specific Features: n/a
Optional Runtime guard: n/a (config-gated)
Optional API Considerations: new ReverseTunnelUpstreamCodecOptions proto.

@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45874 was opened by ivpr.

see: more, trace.

Reverse-connection clusters hold a long-lived multiplexed HTTP/2 tunnel to a
peer Envoy, with the cluster on the client side. When such an upstream drains,
peers otherwise only notice when the TCP connection finally closes, too late to
fail in-flight requests over gracefully.

Add an opt-in, per-cluster upstream client codec (via the upstream
ClientCodecFactory seam) that drains a reverse tunnel gracefully so peers can
move to a sibling upstream before streams are reset. Opt-in via
typed_extension_protocol_options; default off.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr ivpr force-pushed the submit/upstream-drain-codec branch from bd4f5f8 to 78941cb Compare June 29, 2026 06:40
Prasad I V added 10 commits June 29, 2026 08:20
- Move the [#extension:] annotation onto the options message so protodoc
  emits the extension_envoy.upstreams.http.reverse_tunnel doc label that
  cluster.proto and the security-posture docs reference (fixes docs/external).
- Drop a stray "sidecast" comment and reword "GOAWAYs" to satisfy the
  pedantic spell check.
- Add unit coverage for the codec extension to meet the directory threshold:
  createClientCodec's drain-aware HTTP/2 path, the remaining decorator
  forwarding methods, the callbacks settings/max-streams forwarding, and the
  factory's createProtocolOptionsConfig including the admin drain handler.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
protodoc emits the extension_<name> doc anchor from the [#extension:]
annotation in the file-level protodoc-title block (as the canonical
http_protocol_options proto does). Placing it elsewhere left the
extension_envoy.upstreams.http.reverse_tunnel label undefined, failing
the docs/external prechecks that reference it from cluster.proto and the
security-posture pages.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Add coverage for the paths the mock-inner tests cannot reach, closing the
remaining gap in source/extensions/upstreams/http/reverse_tunnel:
- two-phase startGracefulDrain on a real DrainAwareHttp2ClientConnection,
  exercising sendGracefulGoAway() and the h2_codec_ branch;
- the registry's empty-key "drain all clusters" fan-out.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The docs rst (and the extension_<name> anchor) is only generated for
packages listed in //api:v3_protos. The reverse_tunnel upstream codec
package was missing there, so the extension_envoy.upstreams.http.reverse_tunnel
label referenced by cluster.proto and the security-posture pages was
never defined, failing the docs/external prechecks.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
createEmptyProtocolOptionsProto() and createEmptyConfigProto() were the
remaining uncovered lines in source/extensions/upstreams/http/reverse_tunnel
(confirmed via local lcov). Add a test that exercises both.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Use the server's configured drain time (--drain-time-s) as the default
grace window for the /reverse_tunnel/drain_clusters admin trigger instead
of a hardcoded constant, so in-flight requests get the same window to
finish as a normal drain. Still overridable per call via drain_time_ms.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Trim verbose how-focused comments across the upstream codec extension to
concise why-focused ones, and fix a stale drainAll() reference in the
registry docstring.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Add a single info log per cluster config load indicating the drain-aware
upstream codec is configured, complementing the per-connection debug logs.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
sendGracefulGoAway() runs per connection on each drain and duplicates the
info-level startGracefulDrain log; demote the frame-submission detail to
debug to avoid per-connection noise during a drain-all.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The "scenario 1/2/3" labels referred to an external design doc and aren't
defined anywhere in-tree; reword each test comment to stand on its own.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr ivpr marked this pull request as ready for review June 30, 2026 05:33
@repokitteh-read-only

Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #45874 was ready_for_review by ivpr.

see: more, trace.

@ivpr

ivpr commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an opt-in, drain-aware upstream HTTP/2 client codec for reverse-tunnel clusters, along with a thread-local registry and an admin endpoint to trigger graceful, two-phase draining. The review feedback identifies critical safety issues, including a potential use-after-free during synchronous connection closure in startGracefulDrain and a dangling pointer risk when iterating over snapshotted codecs in drainCluster. Additionally, a minor improvement is suggested to clean up empty map entries in the registry upon codec removal to prevent memory accumulation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread source/extensions/upstreams/http/reverse_tunnel/drain_aware_client_connection.h Outdated
Comment thread source/extensions/upstreams/http/reverse_tunnel/drain_registry.cc Outdated
Comment thread source/extensions/upstreams/http/reverse_tunnel/drain_registry.h
Prasad I V added 2 commits June 30, 2026 05:47
Address review feedback on synchronous-close hazards:
- startGracefulDrain: arm the Phase 2 drain timer before sending the
  Phase 1 graceful GOAWAY, since that send can synchronously close the
  connection and destroy this object; touching members afterward would be
  a use-after-free.
- drainCluster: a codec draining can synchronously unregister itself (and
  erase its now-empty cluster entry), so copy the target cluster keys, then
  re-find the entry and confirm the codec is still registered before each
  drain instead of iterating a held set reference / snapshot of raw pointers.
- registry remove(): drop emptied cluster entries so empty sets do not
  accumulate as dynamic clusters churn.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The pedantic spell check rejects "re-entrancy"; use the dictionary word
"reentrant" and reflow the comment to stay under the line limit.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@ivpr

ivpr commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/retest

// Register the admin trigger once (subsequent clusters get addHandler() == false, which is fine).
// The captured registry shared_ptr keeps it alive for the handler's lifetime.
if (auto admin = server_context.admin(); admin.has_value()) {
// Default the rotation grace window to the server's configured drain time (--drain-time-s) so

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we create the admin handler even when enable_drain_with_goaway=false?

uint64_t drain_time_ms = default_drain_time_ms;
if (auto v = params.getFirstValue("drain_time_ms"); v.has_value()) {
uint64_t parsed = 0;
if (absl::SimpleAtoi(v.value(), &parsed)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If parse fails, let's return a HTTP BadRequest?

Address review feedback:
- Only create the drain registry and register the /reverse_tunnel/drain_clusters
  admin handler when enable_drain_with_goaway is true; when disabled the codec is
  never installed, so both are pointless. The options object gets a null registry.
- Return 400 Bad Request from the admin handler when drain_time_ms fails to parse,
  instead of silently falling back to the default.

Adds tests for the disabled (no-handler) path and the bad-request path.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants