Skip to content

perf(precompute): jemalloc allocator + Arc<str> group-key interning (2.2× on KLL window-close)#381

Merged
milindsrivastava1997 merged 1 commit into
mainfrom
perf/jemalloc-and-key-interning
May 28, 2026
Merged

perf(precompute): jemalloc allocator + Arc<str> group-key interning (2.2× on KLL window-close)#381
milindsrivastava1997 merged 1 commit into
mainfrom
perf/jemalloc-and-key-interning

Conversation

@zzylol

@zzylol zzylol commented May 28, 2026

Copy link
Copy Markdown
Contributor

What

Two CPU optimizations for the precompute engine hot path, found by profiling the tumbling-window KLL sketch workload (perf -F499, 4000 groups). Builds on the window-close move (#379, now in main).

  1. jemalloc global allocator (tikv-jemallocator, default jemalloc feature, #[global_allocator] in lib.rs). The hot path was dominated by malloc/free churn (43% of CPU) from per-window sketch buffers; jemalloc's per-thread arenas / size-class caching recycle them, dropping allocator cost to ~7%.
  2. Arc<str> group-key interning. group_states is now HashMap<u64, HashMap<Arc<str>, GroupState>>, so the per-sample/per-batch lookup borrows by &str with zero allocation; the key string is allocated once per group. Also removes a redundant per-batch hashmap lookup in process_group_samples.

Why

Profiling showed allocator churn was the #1 CPU cost. jemalloc is the single biggest lever; the key interning removes the remaining per-sample String allocation/clone on the lookup path.

Measured

High cardinality, 4000 groups, 2.4M window-closes, mean of 3 reps, relative to the pre-#379 clone + system-malloc baseline:

Variant Wall Throughput Speedup
clone + system malloc 7.95s 301k closes/s 1.00×
move (#379) + system malloc 6.33s 380k closes/s 1.26×
move + jemalloc 3.71s 646k closes/s 2.14×
move + jemalloc + Arc<str> 3.61s 665k closes/s 2.20×

Testing

  • All 473 lib tests pass (cargo test --release -p query_engine_rust --lib).
  • Production binary (precompute_engine) builds with jemalloc.

Not in scope (follow-ups)

  • fmod/ceil in sketchlib_kll_update (~17%) — algorithmic KLL level-math in the sketch library.
  • Keyed aggregations (CMS / HydraKLL / MultipleSum) allocate a KeyByLabelValues per sample in extract_aggregated_key_from_series; the single-subpopulation KLL workload doesn't exercise this.

🤖 Generated with Claude Code

…rning

Two CPU optimizations for the precompute engine hot path, motivated by
profiling the tumbling-window KLL sketch workload (perf -F499, 4000 groups).

1. jemalloc global allocator (tikv-jemallocator, default `jemalloc` feature).
   The hot path was dominated by malloc/free churn (43% of CPU) from
   per-window sketch buffers. jemalloc's per-thread arenas / size-class
   caching recycle those buffers, dropping allocator cost to ~7%.

2. Arc<str> group-key interning. group_states is now nested
   HashMap<u64, HashMap<Arc<str>, GroupState>>, so the per-sample/per-batch
   lookup borrows by &str with zero allocation; the key string is allocated
   once per group. Also removes a redundant per-batch hashmap lookup in
   process_group_samples.

Measured (4000 groups, 2.4M window-closes, mean of 3 reps), relative to the
pre-PR clone+system-malloc baseline:
  clone + system malloc        7.95s  301k closes/s  1.00x
  move (into_accumulator)      6.33s  380k closes/s  1.26x
  move + jemalloc              3.71s  646k closes/s  2.14x
  move + jemalloc + Arc<str>   3.61s  665k closes/s  2.20x

All 473 lib tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zzylol zzylol force-pushed the perf/jemalloc-and-key-interning branch from e2258ca to 7affa97 Compare May 28, 2026 21:01
@milindsrivastava1997 milindsrivastava1997 merged commit b0b9057 into main May 28, 2026
8 checks passed
@milindsrivastava1997 milindsrivastava1997 deleted the perf/jemalloc-and-key-interning branch May 28, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants