reverse_tunnel: add drain-aware upstream client codec#45874
Conversation
Reverse-connection clusters hold a long-lived multiplexed HTTP/2 tunnel to a peer Envoy, with the cluster on the client side. When such an upstream drains, peers otherwise only notice when the TCP connection finally closes, too late to fail in-flight requests over gracefully. Add an opt-in, per-cluster upstream client codec (via the upstream ClientCodecFactory seam) that drains a reverse tunnel gracefully so peers can move to a sibling upstream before streams are reset. Opt-in via typed_extension_protocol_options; default off. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
bd4f5f8 to
78941cb
Compare
- Move the [#extension:] annotation onto the options message so protodoc emits the extension_envoy.upstreams.http.reverse_tunnel doc label that cluster.proto and the security-posture docs reference (fixes docs/external). - Drop a stray "sidecast" comment and reword "GOAWAYs" to satisfy the pedantic spell check. - Add unit coverage for the codec extension to meet the directory threshold: createClientCodec's drain-aware HTTP/2 path, the remaining decorator forwarding methods, the callbacks settings/max-streams forwarding, and the factory's createProtocolOptionsConfig including the admin drain handler. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
protodoc emits the extension_<name> doc anchor from the [#extension:] annotation in the file-level protodoc-title block (as the canonical http_protocol_options proto does). Placing it elsewhere left the extension_envoy.upstreams.http.reverse_tunnel label undefined, failing the docs/external prechecks that reference it from cluster.proto and the security-posture pages. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Add coverage for the paths the mock-inner tests cannot reach, closing the remaining gap in source/extensions/upstreams/http/reverse_tunnel: - two-phase startGracefulDrain on a real DrainAwareHttp2ClientConnection, exercising sendGracefulGoAway() and the h2_codec_ branch; - the registry's empty-key "drain all clusters" fan-out. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The docs rst (and the extension_<name> anchor) is only generated for packages listed in //api:v3_protos. The reverse_tunnel upstream codec package was missing there, so the extension_envoy.upstreams.http.reverse_tunnel label referenced by cluster.proto and the security-posture pages was never defined, failing the docs/external prechecks. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
createEmptyProtocolOptionsProto() and createEmptyConfigProto() were the remaining uncovered lines in source/extensions/upstreams/http/reverse_tunnel (confirmed via local lcov). Add a test that exercises both. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Use the server's configured drain time (--drain-time-s) as the default grace window for the /reverse_tunnel/drain_clusters admin trigger instead of a hardcoded constant, so in-flight requests get the same window to finish as a normal drain. Still overridable per call via drain_time_ms. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Trim verbose how-focused comments across the upstream codec extension to concise why-focused ones, and fix a stale drainAll() reference in the registry docstring. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Add a single info log per cluster config load indicating the drain-aware upstream codec is configured, complementing the per-connection debug logs. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
sendGracefulGoAway() runs per connection on each drain and duplicates the info-level startGracefulDrain log; demote the frame-submission detail to debug to avoid per-connection noise during a drain-all. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The "scenario 1/2/3" labels referred to an external design doc and aren't defined anywhere in-tree; reword each test comment to stand on its own. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an opt-in, drain-aware upstream HTTP/2 client codec for reverse-tunnel clusters, along with a thread-local registry and an admin endpoint to trigger graceful, two-phase draining. The review feedback identifies critical safety issues, including a potential use-after-free during synchronous connection closure in startGracefulDrain and a dangling pointer risk when iterating over snapshotted codecs in drainCluster. Additionally, a minor improvement is suggested to clean up empty map entries in the registry upon codec removal to prevent memory accumulation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Address review feedback on synchronous-close hazards: - startGracefulDrain: arm the Phase 2 drain timer before sending the Phase 1 graceful GOAWAY, since that send can synchronously close the connection and destroy this object; touching members afterward would be a use-after-free. - drainCluster: a codec draining can synchronously unregister itself (and erase its now-empty cluster entry), so copy the target cluster keys, then re-find the entry and confirm the codec is still registered before each drain instead of iterating a held set reference / snapshot of raw pointers. - registry remove(): drop emptied cluster entries so empty sets do not accumulate as dynamic clusters churn. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
The pedantic spell check rejects "re-entrancy"; use the dictionary word "reentrant" and reflow the comment to stay under the line limit. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
|
/retest |
| // Register the admin trigger once (subsequent clusters get addHandler() == false, which is fine). | ||
| // The captured registry shared_ptr keeps it alive for the handler's lifetime. | ||
| if (auto admin = server_context.admin(); admin.has_value()) { | ||
| // Default the rotation grace window to the server's configured drain time (--drain-time-s) so |
There was a problem hiding this comment.
seems like we create the admin handler even when enable_drain_with_goaway=false?
| uint64_t drain_time_ms = default_drain_time_ms; | ||
| if (auto v = params.getFirstValue("drain_time_ms"); v.has_value()) { | ||
| uint64_t parsed = 0; | ||
| if (absl::SimpleAtoi(v.value(), &parsed)) { |
There was a problem hiding this comment.
If parse fails, let's return a HTTP BadRequest?
Address review feedback: - Only create the drain registry and register the /reverse_tunnel/drain_clusters admin handler when enable_drain_with_goaway is true; when disabled the codec is never installed, so both are pointless. The options object gets a null registry. - Return 400 Bad Request from the admin handler when drain_time_ms fails to parse, instead of silently falling back to the default. Adds tests for the disabled (no-handler) path and the bad-request path. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Reverse-connection clusters hold a long-lived multiplexed HTTP/2 tunnel to a peer Envoy, with the cluster on the client side of the tunnel. When such an upstream drains, peers otherwise only learn of it when the TCP connection finally closes -- too late to fail in-flight requests over gracefully.
This adds an opt-in, per-cluster upstream client codec (built on the
Http::ClientCodecFactoryseam from #45632) that drains a reverse tunnel gracefully so peers can move to a sibling upstream before streams are reset. Opt-in viatyped_extension_protocol_optionson theenvoy.clusters.reverse_connectioncluster; default off.Companion to #45875 (the downstream reacts to the drain GOAWAY by re-dialing); the two are independent and each is harmless alone.
Commit Message: reverse_tunnel: add drain-aware upstream client codec
Additional Description: See above.
Risk Level: Low -- runs only for opted-in reverse_connection clusters; default off; existing clusters unaffected.
Testing: unit tests (config_test); manual end-to-end reverse-tunnel drain validation.
Docs Changes: extension docs for the new options.
Release Notes: changelog fragment included.
Platform Specific Features: n/a
Optional Runtime guard: n/a (config-gated)
Optional API Considerations: new
ReverseTunnelUpstreamCodecOptionsproto.