Skip to content

fix(electric-telemetry): bound process_type cardinality for proxy-supplied request ids#4658

Open
robacourt wants to merge 1 commit into
mainfrom
rob/request-process-type-fix
Open

fix(electric-telemetry): bound process_type cardinality for proxy-supplied request ids#4658
robacourt wants to merge 1 commit into
mainfrom
rob/request-process-type-fix

Conversation

@robacourt

@robacourt robacourt commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes an unbounded process_type cardinality leak in telemetry for request-handling processes.

Problem

process_type for request-handling processes is derived from the process label set by Electric.Plug.LabelProcessPlug, which looks like:

Request F-jPUudNHxbD8lIAABQG - GET /v1/shape?table=users&offset=-1

parse_binary_label/1 is meant to strip the request id and query, collapsing this to a bounded GET /v1/shape. But it did so with a fixed-width binary match that assumes the request id is exactly 20 bytes:

"Request " <> <<_req_id::binary-20, " - ", rest::binary>> -> ...

That holds for ids Plug.RequestId self-generates, but Plug.RequestId reuses an incoming x-request-id header when a client/proxy supplies one. A UUID (36 chars), an Envoy/nginx id, or an LB trace id is not 20 bytes, so the clause falls through to the generic branch that returns the first 20 bytes of the whole label — i.e. Request <first 12 chars of the id>. Every distinct incoming request id that GCs slowly then adds a permanent new series to prometheus_metrics_dist for events like vm.monitor.long_gc and vm.monitor.long_schedule. Request handlers building large shape responses are exactly the processes that trip those monitors, so the table grows without bound when a fronting proxy injects x-request-id.

Solution

Keep the fast fixed-width binary match for the common 20-byte case (Plug.RequestId's self-generated ids), and add a slow path that only runs when the id is a different length: it splits on the " - " delimiter regardless of request-id length, then strips the query string. All requests to a route collapse to a single process_type (e.g. GET /v1/shape) regardless of request-id source or length. A "Request " label with no delimiter falls back to truncation.

…plied request ids

parse_binary_label/1 only stripped the request id from process labels when it
was exactly 20 bytes (the length Plug.RequestId self-generates). A proxy- or
load-balancer-supplied x-request-id (UUID, Envoy/nginx trace id, etc.) has a
different length, so the raw per-request id leaked into process_type, adding a
permanent new series to telemetry events like vm.monitor.long_gc and
vm.monitor.long_schedule for every distinct id.

Keep the fast fixed-width match for the common 20-byte case and add a slow path
that splits on the " - " delimiter regardless of request-id length, collapsing
all requests to a route back to a single process_type.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.15%. Comparing base (d1db07a) to head (9663568).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4658      +/-   ##
==========================================
+ Coverage   60.00%   60.15%   +0.15%     
==========================================
  Files         395      410      +15     
  Lines       43747    44348     +601     
  Branches    12580    12580              
==========================================
+ Hits        26249    26679     +430     
- Misses      17420    17590     +170     
- Partials       78       79       +1     
Flag Coverage Δ
electric-telemetry 71.38% <100.00%> (?)
elixir 71.38% <100.00%> (?)
packages/agents 72.64% <ø> (ø)
packages/agents-mcp 77.70% <ø> (ø)
packages/agents-mobile 80.67% <ø> (ø)
packages/agents-runtime 83.73% <ø> (+0.01%) ⬆️
packages/agents-server 75.47% <ø> (ø)
packages/agents-server-ui 8.32% <ø> (ø)
packages/electric-ax 51.06% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.71% <ø> (-0.12%) ⬇️
packages/y-electric 56.05% <ø> (ø)
typescript 60.00% <ø> (+<0.01%) ⬆️
unit-tests 60.15% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robacourt robacourt self-assigned this Jun 28, 2026
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR fixes a real, unbounded process_type cardinality leak in ElectricTelemetry.Processes.parse_binary_label/1: when a proxy or load balancer supplies an x-request-id of a non-20-byte length (e.g. a 36-char UUID), the fixed-width binary match fell through to the generic truncation branch, leaking a per-request id fragment into process_type and adding a permanent new series for events like vm.monitor.long_gc/long_schedule. The fix is correct, well-targeted, and well-tested.

What's Working Well

  • Accurate root-cause analysis. The <<_req_id::binary-20, " - ", rest::binary>> clause only matched Plug.RequestId's self-generated 20-byte ids; longer/shorter ids fell through to min(20, ...) truncation, which is exactly where per-id cardinality leaked in. The diagnosis matches the code.
  • Performance-conscious design. Keeping the fixed-width fast path for the common case and only falling back to :binary.split/2 for non-default lengths avoids a split allocation on the hot path while still bounding cardinality in all cases.
  • Robust delimiter handling. Splitting on the first " - " correctly handles UUIDs containing internal hyphens (e.g. 123e4567-...-426614174000), since the request id never contains the literal " - " (space-hyphen-space) sequence, and parse_request_label/1 then cuts the query at the first ?. Verified against the label format produced by Electric.Plug.LabelProcessPlug.process_label/1.
  • Good refactor + safe fallback. Extracting parse_request_label/1 and truncate_label/1 removes duplication, and the _ -> truncate_label(label) branch correctly handles a malformed "Request " label with no delimiter (:binary.split returns a single-element list).
  • Changeset present (@core/electric-telemetry, patch), and the three added tests cover the UUID-length, short-id, and no-delimiter cases. The cross-reference comment in LabelProcessPlug ("do not forget to update ... parse_binary_label") is honored.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

  • Test realism note (informational, no change required): Plug.RequestId only reuses an incoming x-request-id when it passes its validity check (byte size in 20..200); shorter incoming ids are rejected and a fresh 20-byte id is generated. So the "Request abc123 - GET /v1/health" (6-byte id) case shouldn't occur in production behind Plug.RequestId. The test is still valuable as a robustness/regression guard and the code handles it correctly — just worth knowing the real motivating case is the longer-than-20-byte id (UUID/trace id), which the other test covers.
  • Pre-existing, minor: parse_request_label/1 returns a sub-binary slice of the (small) process label via :binary.part/3, which retains a reference to the parent label binary. This behavior is unchanged from the original code and the parent is tiny, so there's no meaningful retention concern — flagging only for completeness given the repo's binary-retention guidance.

Issue Conformance

No linked issue, but the PR description is unusually thorough and clearly documents the problem, the mechanism, and the fix. The implementation matches the described solution exactly, with no scope creep. (Per project convention, a linked issue would be nice-to-have, but the self-contained description more than compensates here.)

Previous Review Status

N/A — first review of this PR.


Review iteration: 1 | 2026-06-28

Comment on lines +279 to +280
_ ->
truncate_label(label)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should have a fixed label, say "Unparseable Request" and to log a warning with the full label, to prevent an unbounded process type in this scenario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant