perf(precompute): jemalloc allocator + Arc<str> group-key interning (2.2× on KLL window-close) by zzylol · Pull Request #381 · ProjectASAP/ASAPQuery

zzylol · 2026-05-28T20:47:12Z

What

Two CPU optimizations for the precompute engine hot path, found by profiling the tumbling-window KLL sketch workload (perf -F499, 4000 groups). Builds on the window-close move (#379, now in main).

jemalloc global allocator (tikv-jemallocator, default jemalloc feature, #[global_allocator] in lib.rs). The hot path was dominated by malloc/free churn (43% of CPU) from per-window sketch buffers; jemalloc's per-thread arenas / size-class caching recycle them, dropping allocator cost to ~7%.
Arc<str> group-key interning. group_states is now HashMap<u64, HashMap<Arc<str>, GroupState>>, so the per-sample/per-batch lookup borrows by &str with zero allocation; the key string is allocated once per group. Also removes a redundant per-batch hashmap lookup in process_group_samples.

Why

Profiling showed allocator churn was the #1 CPU cost. jemalloc is the single biggest lever; the key interning removes the remaining per-sample String allocation/clone on the lookup path.

Measured

High cardinality, 4000 groups, 2.4M window-closes, mean of 3 reps, relative to the pre-#379 clone + system-malloc baseline:

Variant	Wall	Throughput	Speedup
clone + system malloc	7.95s	301k closes/s	1.00×
move (#379) + system malloc	6.33s	380k closes/s	1.26×
move + jemalloc	3.71s	646k closes/s	2.14×
move + jemalloc + `Arc<str>`	3.61s	665k closes/s	2.20×

Testing

All 473 lib tests pass (cargo test --release -p query_engine_rust --lib).
Production binary (precompute_engine) builds with jemalloc.

Not in scope (follow-ups)

fmod/ceil in sketchlib_kll_update (~17%) — algorithmic KLL level-math in the sketch library.
Keyed aggregations (CMS / HydraKLL / MultipleSum) allocate a KeyByLabelValues per sample in extract_aggregated_key_from_series; the single-subpopulation KLL workload doesn't exercise this.

🤖 Generated with Claude Code

…rning Two CPU optimizations for the precompute engine hot path, motivated by profiling the tumbling-window KLL sketch workload (perf -F499, 4000 groups). 1. jemalloc global allocator (tikv-jemallocator, default `jemalloc` feature). The hot path was dominated by malloc/free churn (43% of CPU) from per-window sketch buffers. jemalloc's per-thread arenas / size-class caching recycle those buffers, dropping allocator cost to ~7%. 2. Arc<str> group-key interning. group_states is now nested HashMap<u64, HashMap<Arc<str>, GroupState>>, so the per-sample/per-batch lookup borrows by &str with zero allocation; the key string is allocated once per group. Also removes a redundant per-batch hashmap lookup in process_group_samples. Measured (4000 groups, 2.4M window-closes, mean of 3 reps), relative to the pre-PR clone+system-malloc baseline: clone + system malloc 7.95s 301k closes/s 1.00x move (into_accumulator) 6.33s 380k closes/s 1.26x move + jemalloc 3.71s 646k closes/s 2.14x move + jemalloc + Arc<str> 3.61s 665k closes/s 2.20x All 473 lib tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

zzylol requested a review from milindsrivastava1997 May 28, 2026 20:50

zzylol force-pushed the perf/jemalloc-and-key-interning branch from e2258ca to 7affa97 Compare May 28, 2026 21:01

milindsrivastava1997 approved these changes May 28, 2026

View reviewed changes

milindsrivastava1997 merged commit b0b9057 into main May 28, 2026
8 checks passed

milindsrivastava1997 deleted the perf/jemalloc-and-key-interning branch May 28, 2026 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(precompute): jemalloc allocator + Arc<str> group-key interning (2.2× on KLL window-close)#381

perf(precompute): jemalloc allocator + Arc<str> group-key interning (2.2× on KLL window-close)#381
milindsrivastava1997 merged 1 commit into
mainfrom
perf/jemalloc-and-key-interning

zzylol commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zzylol commented May 28, 2026

What

Why

Measured

Testing

Not in scope (follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants