fix(electric-telemetry): bound process_type cardinality for proxy-supplied request ids#4658
fix(electric-telemetry): bound process_type cardinality for proxy-supplied request ids#4658robacourt wants to merge 1 commit into
Conversation
…plied request ids parse_binary_label/1 only stripped the request id from process labels when it was exactly 20 bytes (the length Plug.RequestId self-generates). A proxy- or load-balancer-supplied x-request-id (UUID, Envoy/nginx trace id, etc.) has a different length, so the raw per-request id leaked into process_type, adding a permanent new series to telemetry events like vm.monitor.long_gc and vm.monitor.long_schedule for every distinct id. Keep the fast fixed-width match for the common 20-byte case and add a slow path that splits on the " - " delimiter regardless of request-id length, collapsing all requests to a route back to a single process_type. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4658 +/- ##
==========================================
+ Coverage 60.00% 60.15% +0.15%
==========================================
Files 395 410 +15
Lines 43747 44348 +601
Branches 12580 12580
==========================================
+ Hits 26249 26679 +430
- Misses 17420 17590 +170
- Partials 78 79 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryThis PR fixes a real, unbounded What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)
Issue ConformanceNo linked issue, but the PR description is unusually thorough and clearly documents the problem, the mechanism, and the fix. The implementation matches the described solution exactly, with no scope creep. (Per project convention, a linked issue would be nice-to-have, but the self-contained description more than compensates here.) Previous Review StatusN/A — first review of this PR. Review iteration: 1 | 2026-06-28 |
| _ -> | ||
| truncate_label(label) |
There was a problem hiding this comment.
I'm wondering if we should have a fixed label, say "Unparseable Request" and to log a warning with the full label, to prevent an unbounded process type in this scenario
Summary
Fixes an unbounded
process_typecardinality leak in telemetry for request-handling processes.Problem
process_typefor request-handling processes is derived from the process label set byElectric.Plug.LabelProcessPlug, which looks like:parse_binary_label/1is meant to strip the request id and query, collapsing this to a boundedGET /v1/shape. But it did so with a fixed-width binary match that assumes the request id is exactly 20 bytes:That holds for ids
Plug.RequestIdself-generates, butPlug.RequestIdreuses an incomingx-request-idheader when a client/proxy supplies one. A UUID (36 chars), an Envoy/nginx id, or an LB trace id is not 20 bytes, so the clause falls through to the generic branch that returns the first 20 bytes of the whole label — i.e.Request <first 12 chars of the id>. Every distinct incoming request id that GCs slowly then adds a permanent new series toprometheus_metrics_distfor events likevm.monitor.long_gcandvm.monitor.long_schedule. Request handlers building large shape responses are exactly the processes that trip those monitors, so the table grows without bound when a fronting proxy injectsx-request-id.Solution
Keep the fast fixed-width binary match for the common 20-byte case (
Plug.RequestId's self-generated ids), and add a slow path that only runs when the id is a different length: it splits on the" - "delimiter regardless of request-id length, then strips the query string. All requests to a route collapse to a singleprocess_type(e.g.GET /v1/shape) regardless of request-id source or length. A"Request "label with no delimiter falls back to truncation.