A lock-free, per-thread, four-tier pool allocator spanning 1 B to multi-GiB —
bucketed small objects, dedicated mid chunks, munmap-backed large blocks, and a
multi-region huge tier for requests above 32 MiB, all freed through one uniform path. Born for multi-threaded STM-style
transactional workloads but usable as a general-purpose drop-in new /
delete (or C malloc) replacement on macOS, Linux, and Windows (MinGW + MSVC).
Carved out of the KAME measurement framework
and dual-licensed under Apache 2.0 OR GPL-2.0-or-later at your choice — so
it can be embedded into permissive / proprietary projects (Apache 2.0 path) or
linked into GPLv2-only projects such as KAME itself (GPL path).
Its sibling library
kamestm — the
lock-free software transactional memory this allocator was born to serve —
is maintained the same way (dual-licensed, TLA+/GenMC verified) and shares
the atomic_smart_ptr.h lock-free smart-pointer header that ships with
this library.
- Lock-free fast path — TLS freelist pop/push, no atomics on the hot path.
Hot-path TLS is
initial-exec(no__tls_get_addrthunk). - Four allocation tiers, one
free— every pointer is resolved in O(1) by a 2-level radix tree, sofree/realloc/malloc_usable_sizework uniformly across all tiers:- Buckets (1 B .. 32 KiB): 51 size classes. The ALIGN ≥ 1024 tiers are full-usable (per-slot metadata in a chunk-header side array, not a borrow header), so power-of-2 page requests (4/8/16/32 KiB) get an exact, page-aligned slot with 0 % round-up.
- Dedicated chunks (32 KiB .. 4 MiB): one N-unit chunk from a 32 MiB pool region.
- Large mmap (4 MiB .. 32 MiB): one 32-MiB-aligned
mmapper allocation, served warm from the recycle cache,munmap'd on free when not recycled. - Huge mmap (> 32 MiB): a multi-region
mmapper allocation, of which only the head 32-MiB radix slot is registered — safe because the allocation's sole valid pointer resolves to that slot and the tail slots are never standalone lookup targets (and the OS keeps the whole span mapped, so no other allocation can claim a tail slot's VA).munmap'd on free; bypasses the recycle cache (its log index tops out at 32 MiB, so a cached huge block could over-satisfy a smaller huge request and pin its RSS — the reason libc / jemalloc / mimalloc don't pool their huge class). Plain libcmallocis reached only if themmapitself fails.
- Two-level recycle cache for the dedicated-chunk and large-
mmaptiers (32 KiB .. 32 MiB; the > 32 MiB huge tier bypasses it): a per-thread L1 (no atomics, ping-pong absorbed) in front of a global lock-free L2 log-slot cache (no working-set cliff, byte-capped). Reuses warm, resident blocks — skips themmap/madvise/re-fault cost that makes every other allocator fall off a cliff above ~64 KiB. - Aligned allocation served from the pool —
posix_memalign/aligned_alloc/operator new(align_val_t)up to 4 KiB alignment route to the matching bucket (slots are inherently ALIGN-aligned); larger alignments use the 256 KiB-aligned dedicated/mmap tiers. No_aligned_freepairing — freed by the ordinaryfree. - Per-thread DLL chunks — no global allocator lock, no contention until the chunk-claim slow path; cross-thread frees via a holding batch + bit-clear coalescing.
- Lock-free orphan-chunk reclaim (
atomic_shared_ptr) — a chunk left non-empty by an exited owner thread (its slots still draining via cross-thread frees) is pushed onto a per-template lock-free orphan chain instead of being stranded: another thread later adopts it (reuses its free slots) and a sweep pass reclaims it once drained, so churned worker threads don't grow reserved memory. The chain is built on KAME'satomic_smart_ptr(the same lock-free reference-counted smart pointer as the STM) — that refcount is what makes adopt-vs-reclaim safe without hazard pointers, and a re-owned chunk holds a self-referential owner-ref so a concurrent sweeper can't free it mid-reuse. Replaces an earlier ABA-tagged Treiber stack that leaked drained orphans. - Standards-conformant OOM — throwing
operator newruns the installedstd::new_handlerloop then throwsstd::bad_alloc; nothrow / C-API paths returnnullptr+errno = ENOMEM. Nostd::terminateacross the noexcept C boundary. - Bounded VA + prompt RSS — two independent runtime caps:
kame_pool_set_max_bytes()(fresh-region mmap ceiling) andkame_pool_set_large_cache_cap()(the large-recycle cache's resident footprint, split ~half global L2 / ~half aggregate per-thread L1).madvise(MADV_FREE/DONTNEED)on chunk release — default also at thread exit, toggle viakame_pool_set_thread_exit_reclaim(). - Coexists with a foreign allocator on every OS — the pool intercepts
free/deletethe native way per platform so a pointer it allocated is pool-freed no matter which module frees it: ELF strong symbols (Linux), Mach-O__DATA,__interpose(macOS), and a runtime free-family IAT redirect on Windows (§31) — the PE/COFF analogue, scoped to the deallocation family, so KAME allocations use the pool while Qt / CRT allocations stay on the CRT heap and frees are reconciled wherever they happen. - Drop-in
mallocreplacement (macOS dylib, default-on) — when built as a standalone dynamic library (-DKAMEPOOLALLOC_DYLIB) on macOS, the pool also interposes the fullmallocfamily by default, so it works as a realDYLD_INSERT_LIBRARIES=libkamepoolalloc.dylibdrop-in like mimalloc / jemalloc — everymalloc/free/reallocin the host process routes through the pool. Amalloc_sizeco-interpose returns the true capacity of pool pointers, which is what makes this Swift- and Objective-C-safe (the Swift runtime's__StringStorageand ObjC class realization querymalloc_size; an allocator that lies here corrupts them). Soak-validated against Foundation, libswiftCore, CPython, QtCore, and the C++ STL. Opt out with-DKAMEPOOLALLOC_CONSERVATIVE_INTERCEPTfor thefree/realloc-only behaviour. Linux/Windows dylibs keep this behind the explicit-DKAMEPOOLALLOC_FULL_INTERCEPTopt-in pending their own soak. The inline kame.app build (anMH_EXECUTEimage) is unaffected either way — dyld honours__interposeonly fromMH_DYLIB, so only the strong-symboloperator new/deleteoverride is live there, as before. - Verified — TSAN race-free, UBSAN clean (incl.
vptr), ASan clean; the chunk-claim / chunk-recycle protocol is TLA+ model-checked and the large-recycle cache's exclusive-ownership / no-premature-release (UAF / double-free) safety is GenMC (RC11) model-checked. The orphan-chunk reclaim/adopt chain is TLA+ model-checked too (tests/tlaplus/OrphanChain_*.tla, run byrun_orphan_chain.sh) — its owner-free-vs-concurrent-sweeper-pin race is a model-only catch (runtime stress can't reproduce it), kept as a standing regression guard. Builds 64-bit and 32-bit, on macOS / Linux / Windows (MinGW + MSVC).
Production-stable in KAME since 2008 (the lock-free pool joined the
KAME measurement framework — itself in 24/7 research-lab operation since
2002 — as the system allocator that year, and has been the production
allocator on every release since). The Phase 5 / §15–§30 work (2025 – 2026) added the
buddy chunk allocator, the full-usable page-aligned bucket tiers, the
dedicated / large-mmap / huge (> 32 MiB) tiers with a two-level recycle cache,
pool-routed aligned allocation, and standards-conformant OOM.
Targets: macOS and Windows (64-bit) for KAME itself; the standalone library
also builds and is tested on Linux (64-bit and 32-bit). Requires a host with
mmap (or VirtualAlloc) and threads — not an MCU / bare-metal allocator.
Per-toolchain status of the live pool:
| Toolchain | Live pool | Notes |
|---|---|---|
| macOS clang (x86-64 / arm64) | ✅ default | __DATA,__interpose free redirect; dylib build also interposes the full malloc family + malloc_size by default (Swift/ObjC-safe drop-in), opt out with -DKAMEPOOLALLOC_CONSERVATIVE_INTERCEPT; primary target |
| Linux gcc/clang (x86-64, 32-bit) | ✅ default | strong-symbol free redirect; primary CI target |
| Windows MinGW64 + lld | ✅ default | §31 free-family IAT redirect lets the pool coexist with Qt / libc++ (PE/COFF has no cross-module operator new interposition); verified on-target with Qt 6.10.1 |
| Windows MSVC (cl) | ✅ default | runs the full live pool — the _MSC_VER shim in allocator_prv.h bridges the GCC-isms (_Interlocked* atomics, <intrin.h> bit-scan / overflow, static-init constructor hook) and the §31 redirect handles Qt/CRT coexistence; opt OUT with KAME_DISABLE_POOL_MSVC |
Tight alloc/free loop, one slot at a time (tests/bench/bench_loop.c —
malloc(N) → write p[0] → free(p) repeated), median of repeated runs.
kame is LD_PRELOAD'd against the same binary as the others; all are
default-Release builds (no -flto / -march=native — mimalloc and jemalloc
ship the same way). Multi-thread numbers run 4 independent bench_loop
processes in parallel and sum the per-process rates (true intra-process MT
is measured separately by bench_xthread). All tables below are
reproducible via tests/bench/bench_compare.sh (see that script's
--help). The x86-64 reference machine is Ohtaka (ISSP supercomputer,
AMD EPYC bare metal — single-tenant, srun --exclusive); cloud VM numbers
were intentionally dropped because shared-tenant scheduling jitter on a
single allocator could swing 3× between sessions and made any absolute
comparison unreliable.
The four charts above are plotted from the tables that follow
(regenerate with tests/bench/plots/make_plots.py). The story is the
flat red curve: no mmap-per-call cliff at the ≥ 1 MiB tier, and
near-linear scaling to 128 cores.
Apple MacBook Air M3 (arm64, macOS), single thread, M ops/s — kamepoolalloc at
8f2e6980 (word-cache default ON), median of 7 runs via bench_compare.sh
(mimalloc 3.3.2 + jemalloc 5.3.0 via Homebrew). Both M3 tables below were
measured on this commit:
| size | system | mimalloc | jemalloc | kame |
|---|---|---|---|---|
| 64B | 112 | 544 | 133 | 739 |
| 1 KiB | 89 | 485 | 128 | 500 |
| 16 KiB | 95 | 208 | 75 | 573 |
| 64 KiB | 25 | 206 | 19 | 154 |
| 256 KiB | 25 | 202 | 19 | 142 |
| 1 MiB | 25 | 7 | 19 | 130 |
| 4 MiB | 45 | 6 | 19 | 119 |
Apple MacBook Air M3, 4 processes aggregate, M ops/s — median of 7 runs:
| size | system | mimalloc | jemalloc | kame |
|---|---|---|---|---|
| 64B | 420 | 1879 | 477 | 2548 |
| 16 KiB | 337 | 768 | 269 | 2091 |
| 64 KiB | 90 | 781 | 66 | 552 |
| 1 MiB | 91 | 21 | 67 | 493 |
kame now leads every band except 64–256 KiB. The 64 B bucket leads
mimalloc 3.3 — 739 vs 544 M ops/s single-thread, 2548 vs 1879 across 4
processes — after the deallocate / operator new hot/cold splits (lean
always_inline FS=true fast path, no prologue spill), the KameTlsPage
fast-TSD unification (§hot-tls, a single mrs TPIDRRO_EL0 instead of
per-variable _tlv_get_addr), and the §word-cache cold path below. 1 KiB
flipped to a kame lead (500 vs 485; was 428 vs 447). At 16 KiB kame
leads decisively — 573 vs 208 single-thread, 2091 vs 768 4-process — helped
by the void / tail-call FS=false free path. At the 1 MiB large tier kame
leads at every thread count: system is lock-serialised, mimalloc collapses to
~7 M single-thread / ~21 M aggregate, while kame stays memory-warm at
130 / 493 M ops/s. The one band where mimalloc still leads is the
64–256 KiB §19 large tier (kame ~150 vs ~205): there the per-free radix
lookup plus recycle-cache bookkeeping cost more than mimalloc's segment
scheme — an already-known weaker tier, not a regression.
Since 8a4d2622 the owner-side recovery of small fixed-size slots runs
through a claimed-word mask instead of per-slot work: when the owner
freelist runs dry, ONE bitmap CAS claims every zero bit of a word straight
into the chunk's own cells [1] (the mask) / [2] (the word base) — unused
under FS=true and on the same cache line as the freelist head the pop just
touched — and each subsequent lean-cold alloc serves a slot with a ctz +
clear-lowest + one multiply-add. No per-slot next-pointer store, no
dependent block-line load, CAS amortised 1/64; the only bit return is the
thread-exit drain. Hot paths (freelist pop/push) are byte-identical to the
opt-out build (-DKAME_FS_WORDCACHE=0).
Measured deltas on M3 (default ON vs opt-out, interleaved medians of 5):
bench_loop 1 KiB / 16 KiB and STM 3level_mixed N=128 (K=10 and K=0)
unchanged; 64 B +25 % (part of which is the known ±10 % build-to-build
layout swing); SPSC producer/consumer ring +29 % at 4 slots in flight and
+74 % at 32; bench_xthread free-side +10 %. Cascade Lake-SP shows no
STM regression. Zen 2 (Ohtaka) numbers pending machine maintenance — by
construction the design carries none of the per-slot stores that capped the
earlier dependency-cut experiments at ~300 M ops/s on that core.
An independent run of the contention-heavy multi-thread benches from mimalloc-bench (M3, median of 5, mimalloc 3.3.2 / jemalloc 5.3.0). kame is the FS=true word-cache build.
| bench (M ops/s, ↑) | system | mimalloc | jemalloc | kame |
|---|---|---|---|---|
| larson | 21.4 | 108.3 | 108.6 | 113.0 |
| xmalloc-test | 150 | 245 | 118 | 210 |
| rptest | 2.86 | 3.95 | 3.97 | 2.75 |
kame leads larson (the alloc-on-one-thread / free-on-another pattern that
mirrors the STM Snapshot handoff — kame's design target) and clears system /
jemalloc on xmalloc-test. On rptest's mixed cross-thread churn the
segment-based allocators (mimalloc / jemalloc) still lead, and kame sits with
system malloc — the same contention tier flagged as a weaker spot on x86.
Intel Xeon E5-1630 v4 (x86-64, 4-core / 8-thread, Windows), single thread,
M ops/s — kame at 8f2e6980 (post hot/cold-split + 32 KiB bucket LUT +
word-cache), llvm-mingw clang 17, median of 5 via bench_compare.sh. On
Windows the kame column is the direct pool route (bench_loop_pool →
kame_pool_malloc): PE/COFF has no LD_PRELOAD, and llvm-mingw resolves the
plain malloc statically (no IAT entry for the §31 redirect to patch), so —
unlike the macOS/Linux tables — this measures the pool core only, not the
malloc-override layer. mimalloc/jemalloc are not standard on Windows
(columns omitted):
| size | system | kame |
|---|---|---|
| 64B | 37 | 301 |
| 1 KiB | 40 | 223 |
| 16 KiB | 39 | 191 |
| 64 KiB | 9 | 71 |
| 256 KiB | 10 | 68 |
| 1 MiB | ~0 | 64 |
| 4 MiB | ~0 | 35 |
Same host, 4 processes aggregate, M ops/s — median of 5:
| size | system | kame |
|---|---|---|
| 64B | 144 | 1055 |
| 16 KiB | 134 | 687 |
| 64 KiB | 37 | 238 |
| 1 MiB | ~1 | 231 |
The system ~0 at ≥ 1 MiB is the Windows UCRT per-call VirtualAlloc /
VirtualFree cliff (the same shape mimalloc / jemalloc hit at this tier on
Linux): a tight 1 MiB malloc/free loop drops below 0.5 M ops/s, while
kame's two-level recycle cache stays memory-warm at 35–231 M ops/s. The
16 KiB jump (90 → 191 single-thread, 302 → 687 4-process vs the prior
84f97566 numbers) is the full-range bucket LUT + word-cache landing; no
size regressed.
Ohtaka (ISSP supercomputer — AMD EPYC 7702, 128-core / 8-NUMA-node, Linux
4 KiB pages, THP=always), bench_compare.sh, srun --exclusive — kame at
efbe6dcb (allocator code 5e127eb5) for the 1T, 4-process and 128-process
tables below, all measured on the same i8cpu node c15u01n1 (the
alloc_tune_report table further down was not re-run and remains at the
earlier 0e9413a6). mimalloc/jemalloc versions same as the competitive tables:
Architecture note — memory renaming: kamepoolalloc's LIFO freelist
creates a tight store-then-load chain on every alloc/free pair
(free: *p = head; head = p → next malloc: head = *head). On
CPUs with memory renaming — Intel Sunny Cove (Ice Lake, 2019) and
AMD Zen 3 (Milan, 2020) onward, plus Apple A12+ / M-series — the
rename stage forwards the just-stored value to the load via the
register file, dispatch-free. AMD Zen 2 (Ohtaka's EPYC 7702)
predates this optimization and pays store-to-load forwarding
latency (~5 cycles) per hot iteration instead. This is the
structural reason the Ohtaka margins over mimalloc / jemalloc are
smaller than the M3 figures — most visible at the 64 B tier where
the freelist hot path dominates (Ohtaka kame 260 vs mi 331, M3 kame
651 vs mi 503 — a 50 % flip in relative position). Larger tiers
where the freelist is not the bottleneck (16 KiB, 1 MiB+) are not
affected and kame retains its lead on both CPUs. The 64 B gap is
expected to narrow on Zen 3 / Zen 4 (Milan, Genoa, Bergamo) and on
Sapphire Rapids / Granite Rapids Intel parts; Cascade Lake-SP cloud
VMs (e.g. AWS C5 / GCP N2) inherit the same Skylake-derived non-
renaming microarchitecture and behave like Zen 2 in this respect.
| size | system | mimalloc | jemalloc | kame |
|---|---|---|---|---|
| 64B | 212 | 331 | 182 | 260 |
| 1.0KB | 214 | 203 | 160 | 236 |
| 16KB | 69 | 93 | 45 | 168 |
| 64KB | 65 | 95 | 4 | 82 |
| 256KB | 70 | 95 | 4 | 87 |
| 1.0MB | 71 | 5 | 4 | 84 |
| 4.0MB | 67 | 4 | 4 | 62 |
| size | system | mimalloc | jemalloc | kame |
|---|---|---|---|---|
| 64B | 845 | 1261 | 727 | 1045 |
| 16KB | 274 | 381 | 177 | 666 |
| 64KB | 259 | 371 | 18 | 321 |
| 1.0MB | 277 | 20 | 17 | 335 |
| size | system | mimalloc | jemalloc | kame |
|---|---|---|---|---|
| 64B | 22634 | 39572 | 21280 | 29044 |
| 16KB | 7920 | 11808 | 5282 | 21496 |
| 64KB | 8128 | 11866 | 465 | 10352 |
| 1.0MB | 8908 | 566 | 475 | 10703 |
At 16 KiB kame now leads decisively at every thread count — 168 vs 93 M
single-thread, 666 vs 381 at 4 processes, 21496 vs 11808 at 128 (1.8×) —
after the bucket_for_size LUT extension to the full 32 KiB bucketed range
(9b65ce42) plus the force-inlined size→bucket fold (0e43ee82). At
1 MiB / 128 processes kame reaches 10703 M ops/s (19× ahead of
mimalloc 566 M): the two-level recycle cache keeps all 128 cores at
memory-warm speed while mimalloc stalls at its per-call mmap path. At
1 KiB single-thread kame widens its lead over mimalloc (236 vs 203 M, up
from 221) after 5e127eb5 restricted the new_redirected large-branch
unlikely-hint to Apple targets — on linux-x86/clang the hint had cost 1 KiB
−15% and 64 B −8% (same-node interleaved A/B) for a +23% gain only at
16 KiB, which the LUT already dominates. mimalloc still leads at 64 B on
x86-64 — 331 vs 260 M single-thread, 39572 vs 29044 at 128T: unlike
arm64/M3, where the deallocate / operator new hot/cold splits put kame
ahead at 64 B, on x86-64 mimalloc's thread-local bump allocator keeps the
edge on the smallest bucket.
Ohtaka, srun --exclusive, single-binary self-validation via
tests/alloc_tune_report — kame at 0e9413a6:
Aggregate M ops/s (alloc → touch byte 0 → free loop) at 1 / 4 / 16 / 64 /
128 concurrent threads:
| size,tier | 1T | 4T | 16T | 64T | 128T |
|---|---|---|---|---|---|
| 64 B (bucket) | 121 | 482 | 1887 | 6356 | 10596 |
| 1 KiB (bucket) | 95 | 389 | 1543 | 5713 | 7831 |
| 64 KiB (chunk) | 42 | 27 | 116 | 195 | 292 |
| 1 MiB (chunk) | 44 | 31 | 81 | 132 | 543 |
| 8 MiB (large_va) | 27 | 3 | 3 | 3 | 3 |
| 40 MiB (large_va, cached) | 13 | 3 | 3 | 2 | 2 |
The bucket tier reaches 10.6 G ops/s aggregate for 64 B at 128 cores
(88× linear) — vindicating the per-thread DLL + per-thread freelist +
per-thread KameTlsPage design on a heavily-NUMA host. The 1 MiB chunk tier
peaks at 543 M ops/s at 128T — the per-thread L1 recycle cache absorbs
the churn without inter-core contention. The large_va tier (8 MiB+) plateaus
because the benchmark touches byte 0 each cycle, which serialises across
cores in the Linux process-wide mmap_lock page-fault path; real KAME
workloads write the whole buffer right after alloc (memcpy of acquisition
data, FFT in-place) so the one-page fault is amortised inside the buffer
write, not paid per op.
The same alloc_tune_report run self-validated the defaults on this hardware:
LRC_LAZY_INTERVAL_NS = 10 ms: 0.63 % per-thread wallclock pressure — printed "default 10 ms is fine; auto-tune kept it (raise-only)".LRC_K_MAX = 256: matched to 128 cores — printed "default is appropriate".MADV_HUGEPAGE: ineffective on this kernel (sameµs/pageas plain first-touch) becauseTHP=alwaysis already auto-promoting 4 KiB pages to 2 MiB at the kernel level.- TLB shootdown: 22.01× worst-case at 128 threads (1543 µs / 32 MiB
munmap) — bounded, not catastrophic; the warm cache absorbs most syscalls in practice.
kame leads at every size from 64 B through 1 MiB, single- and multi-threaded.
The cliff at 256 KiB is the headline: mimalloc drops from 107 to 3 M ops/s,
jemalloc from 86 to 9 M ops/s — both revert to per-call mmap/munmap
(≈ 3–9 M ops/s flat), while kame's per-thread recycle cache keeps it at
115–135 M ops/s through 1 MiB — a 13–40× lead over the others at this
tier. The system allocator (glibc ptmalloc) is also flat at large sizes
in this particular tight-loop pattern because ptmalloc adaptively raises
its M_MMAP_THRESHOLD when it observes repeated large alloc/free pairs;
in production workloads with mixed sizes, that threshold is not permanently
raised and large allocs revert to per-call mmap, closing the same cliff.
The multi-thread picture is the same shape: kame scales through 4 threads
at every tier while jemalloc collapses at 64 KiB (34 M ops/s) and mimalloc
at 1 MiB (12 M ops/s). These are touch-first-byte benchmarks —
malloc(N), write p[0], free(p) — so only one page fault is issued per
op regardless of N. Workloads that write the whole buffer converge on
memory bandwidth and the per-op allocator cost is amortised.
The point is not the micro-benchmark peak but the flat curve: kame has no
size cliff and no per-thread working-set cliff, so a real mixed workload (small
objects + large arrays/waveforms — KAME's own profile) never hits the
mmap-per-large-alloc wall the others do.
Ohtaka — competitive comparison, mimalloc-bench suite (sys = glibc,
mi3 = mimalloc 3, mi = mimalloc, je = jemalloc, tc = tcmalloc).
Measured on the same EPYC / 128-core / 8-NUMA node as above
(kernel 4.18.0-372.9.1.el8.x86_64, glibc 2.28, clang 18.1.8); kame at 9ecc613d,
mimalloc-bench at 941c265d (allocator versions are whatever that
revision of mimalloc-bench builds via its build-bench-env.sh).
Bench invocations: xmalloc-test -w N -t 5; mstress N 50 50;
rptest N 0 1 2 500 1000 100 8 16000 (random size 8..16000 B, cross-thread
free rate 2, 500 loops, 1000 allocs, 100 ops per iter):
xmalloc-test — cross-thread handoff, 64 B objects, M frees/s (higher=better):
One allocator thread, one free-er thread — the STM Snapshot handoff pattern where a measurement thread allocates a Payload and the UI thread releases it.
| workers | sys | mi3 | mi | je | tc | kame |
|---|---|---|---|---|---|---|
| 1 | 3.1 | 25.7 | 22.6 | 5.1 | 2.4 | 2.9 |
| 4 | 8.1 | 48.5 | 66.0 | 15.5 | 1.0 | 8.8 |
| 16 | 29.0 | 119 | 169 | 63.0 | 0.82 | 35.8 |
| 64 | 15.5 | 206 | 262 | 157 | 0.64 | 60.8 |
| 128 | 7.6 | 201 | 176 | 159 | 0.60 | 47.3 |
At ≥ 4 workers kame pulls ahead of glibc; at 64–128 workers it leads glibc 4–6× (glibc's tcache reclaim serialises) while tcmalloc collapses entirely. kame trails mi / mi3 / je — those are purpose-built for this workload — but the relevant comparison for KAME deployments is against the system allocator.
mstress — random object migration across threads, wall time s (lower=better):
Allocations are passed between threads at random — analogous to STM Transactions that write a new Payload on one core and the old one is released on another.
| threads | sys | mi3 | mi | je | tc | kame |
|---|---|---|---|---|---|---|
| 1 | 0.081 | 0.052 | 0.053 | 0.057 | 0.055 | 0.083 |
| 4 | 0.557 | 0.432 | 0.419 | 0.786 | 0.536 | 0.620 |
| 16 | 2.42 | 1.83 | 1.84 | 2.56 | 2.03 | 2.62 |
| 64 | 7.56 | 3.67 | 3.61 | 8.03 | 6.02 | 6.22 |
| 128 | 13.7 | 5.72 | 4.96 | 14.2 | 9.42 | 9.91 |
kame beats glibc at 64–128 threads (the range KAME actually runs at) and matches it at 1 thread (0.083 vs 0.081 s — margin within run-to-run noise); it trails glibc at 4–16 threads where glibc's per-thread arena is simpler than kame's DLL adoption. kame also beats jemalloc at 64–128T. mi / mi3 lead throughout.
rptest — realistic mixed workload, 8..16000 B random sizes,
cross-thread free, long-lived threads, M ops/s (higher=better) — kame at 9ecc613d:
Generally regarded as the most production-representative scenario in the mimalloc-bench suite — long-lived worker threads (not the worker-respawn pattern of larson), random size distribution spanning small to medium objects, and a fraction of frees crossing thread boundaries.
| threads | sys | mi3 | mi | je | tc | kame |
|---|---|---|---|---|---|---|
| 1 | 10.9 | 16.7 | 14.8 | 16.3 | 18.7 | 11.5 |
| 4 | 2.90 | 7.81 | 9.97 | 6.46 | 6.41 | 3.96 |
| 16 | 2.27 | 4.46 | 6.42 | 2.78 | 2.12 | 2.31 |
| 64 | 0.48 | 1.68 | 2.68 | 0.77 | 0.24 | 0.74 |
| 128 | 0.31 | 1.32 | 2.23 | 0.45 | 0.05 | 0.32 |
kame beats glibc at every thread count and pulls ahead of tcmalloc at 64–128 threads where tc collapses to 0.05–0.24 M. At 64T kame (0.74 M) leads glibc (0.48 M) by 54 % and beats jemalloc (0.77 M) closely, even while mi3 / mi lead overall. mi / mi3 / je lead — purpose-built modern allocators in their target regime — but the relevant comparison for KAME remains the system allocator, and on a realistic mixed workload kame holds the line throughout.
cd kamepoolalloc
qmake kamepoolalloc.pro
make
# produces libkamepoolalloc.dylib (macOS) / libkamepoolalloc.so (Linux)cd tests
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
ctest --output-on-failurectest runs the C-API conformance test. The repo also ships manual
correctness / perf drivers built alongside it: alloc_stress_test
(adversarial multi-thread, sentinel-checked), alloc_minimal_bench
(single-size hot / fifo loops), and alloc_bucket34_repro. Sanitizer
coverage (TSAN / UBSAN / ASan) is obtained by configuring the build with
the matching -fsanitize= flags.
- MinGW64 + lld (the validated Windows toolchain): builds with the live
pool on by default. The qmake test path inline-compiles
allocator.cppinto each test binary (PE/COFF won't let a test bind to the DLL'soperator new), so the test actually exercises the pool; the §31 free-family redirect installs from the auto-activator to reconcile the cross-module frees. - MSVC (cl): runs the full live pool by default — the
_MSC_VERshim inallocator_prv.hprovides the_Interlocked*atomics, the<intrin.h>bit-scan /_mul_overflowmappings, andNOMINMAX. Opt OUT withKAME_DISABLE_POOL_MSVCto force thestd::allocatorfallback (e.g. when bisecting an allocator-suspected issue). - Env knobs (Windows live pool):
KAME_POOL_WIN_REDIRECT=0disables the §31 free redirect (restores the historical unreconciled behaviour);KAME_POOL_VERBOSE=1prints the patched-slot count at activation.
#include "allocator.h"
int main() {
KamePooledAllocGuard guard; // static-link only — see note below
auto *p = new char[256]; // routed to the pool's 256-B bucket
delete[] p; // back to the per-thread freelist
}KamePooledAllocGuard is only needed when allocator.cpp is compiled
into the binary or a static library (the qmake / inline-compiled
path). When linked as a shared library (-DKAMEPOOLALLOC_DYLIB,
the cmake test scaffold and any standalone packaging), the dylib
auto-activates from a __attribute__((constructor(101))) and the guard
class compiles to an empty stub — write KamePooledAllocGuard guard;
either way for portability, or omit it entirely in dylib-only builds.
Pre-main() allocations (dyld init, static ctors) always stay on
libsystem malloc regardless of mode. For static builds the guard
MUST be the first statement in main() for the pool to be active
throughout the program.
#include "allocator.h"
int main() {
KamePooledAllocGuard guard;
// Cap at 128 MiB: pool will not mmap more than 4 regions
// (32 MiB each, rounded up). Beyond the cap, allocations
// fall back to libsystem std::malloc.
kame_pool_set_max_bytes(128 * 1024 * 1024);
// ... your code ...
fprintf(stderr, "pool reserved: %zu KiB\n",
kame_pool_reserved_bytes() / 1024);
}Pass 0 to disable the cap (default = compile-time ceiling: 100 GiB
on 64-bit, 3 GiB on 32-bit).
When you cannot interpose new/delete (static link, sandbox, FFI),
include <kame_pool.h> and call the pool directly. Pure C linkage,
fully reentrant; pre-activation calls transparently fall through to libc.
#include <kame_pool.h>
void *p = kame_pool_malloc(64);
void *q = kame_pool_aligned_alloc(4096, 1 << 20); // 1 MiB, page-aligned
size_t cap = kame_pool_malloc_usable_size(p); // bucket-rounded size
kame_pool_free(q);
kame_pool_free(p);
kame_pool_stats_t st = { .version = KAME_POOL_STATS_VERSION };
kame_pool_get_stats(&st); // regions / live chunks / claimed unitsQuick index of every public symbol in <kame_pool.h>. All are
pure C linkage, noexcept (KAMEPOOLALLOC_NOEXCEPT), thread-safe, and
transparently fall through to libc on pre-activation / post-teardown.
See the header for full per-function semantics.
Allocation (libc-equivalent surface):
| Symbol | Purpose |
|---|---|
void *kame_pool_malloc(size_t) |
malloc; routes to the bucket/chunk/large-va/huge tier by size |
void *kame_pool_calloc(size_t n, size_t sz) |
calloc; mmap'd tiers come zeroed for free |
void *kame_pool_realloc(void *, size_t) |
realloc; resize-in-place when the bucket fits, else move |
void kame_pool_free(void *) |
free; libc-foreign pointers transparently pass through |
void *kame_pool_aligned_alloc(size_t align, size_t sz) |
C11 aligned_alloc |
int kame_pool_posix_memalign(void **, size_t align, size_t sz) |
POSIX-style aligned malloc |
size_t kame_pool_malloc_usable_size(const void *) |
bucket-rounded usable size |
Runtime caps:
| Symbol | Default | Purpose |
|---|---|---|
void kame_pool_set_max_bytes(size_t) |
100 GiB / 3 GiB 32-bit | upper bound on reserved (region VA) |
size_t kame_pool_get_max_bytes(void) |
— | current cap |
size_t kame_pool_reserved_bytes(void) |
— | bytes of 32-MiB regions currently mapped |
void kame_pool_set_large_cache_cap(size_t) |
≈ 2 GiB total | LRC_MMAP/CHUNK recycle-cache total cap |
size_t kame_pool_get_large_cache_cap(void) |
— | current cap |
Background maintenance (silenceable for realtime work):
| Symbol | Default | Purpose |
|---|---|---|
void kame_pool_set_lazy_drain_interval_ms(unsigned) |
10 ms (auto-tuned at startup) | §28.1 lazy drain interval — bigger = fewer munmap ticks |
unsigned kame_pool_get_lazy_drain_interval_ms(void) |
— | current interval |
void kame_pool_set_thread_exit_reclaim(int) |
on | §21 madvise(MADV_DONTNEED) at worker exit |
void kame_pool_set_realtime_mode(int) |
off | §30 one-shot preset: silences all three of the above |
Observability:
| Symbol | Purpose |
|---|---|
void kame_pool_get_stats(kame_pool_stats_t *) |
snapshot of regions/units/chunks/cache/tier counters; versioned struct (KAME_POOL_STATS_VERSION) |
atomic_smart_ptr.h (an installed public header) provides a lock-free
atomic_shared_ptr<T> (atomic, CAS-able shared owner), local_shared_ptr<T>
(thread-local owner), and local_weak_ptr<T>. It underpins
KAME's STM (kamestm) and the
pool's own orphan-chunk reclaim chain, and is usable on its own. A technique
deep-dive (local + global refcount, the intrusive atomic_countable path, and
a comparison against the libstdc++ / MSVC / libc++ std::atomic<shared_ptr>
implementations) lives in
kamestm/README.md § Lock-free atomic shared pointer.
The control-block layout is chosen at compile time from ref_traits<T>, driven
by an opt-in marker you inherit on T — no other wiring:
Inherit on T |
Alloc | weak_ptr |
Construct with | Use for |
|---|---|---|---|---|
| (nothing — default) | 2× | yes | local_shared_ptr<T>(new T(…)) |
anything |
atomic_emplaced |
1× | yes | make_local_shared<T>(args…) |
weakable + hot |
atomic_strictrefonly |
2× | no | local_shared_ptr<T>(new T(…)) |
small / cold |
atomic_countable |
1× (intrusive) | no | local_shared_ptr<T>(new T(…)) |
hottest |
atomic_weakable is a back-compat alias for atomic_emplaced.
#include "atomic_smart_ptr.h" // on the include path via find_package(kamepoolalloc)
struct Plain { int x; }; // default mode
local_shared_ptr<Plain> a(new Plain{1});
atomic_shared_ptr<Plain> shared; shared.swap(a); // atomic / CAS-able
struct Hot : atomic_emplaced { int x; }; // 1 allocation; weak_ptr OK
auto h = make_local_shared<Hot>(); // emplaced T: NOT local_shared_ptr<Hot>(new Hot)Self-referential intrusive node — a lock-free list/DLL node that embeds an
atomic_shared_ptr<T> link is incomplete at first use, so the marker cannot be
auto-detected. Opt in explicitly (before the first use of the pointer) and give
T the intrusive contract:
template <…> struct force_intrusive_ref<MyNode<…>> : std::true_type {};
struct MyNode {
typedef uintptr_t Refcnt;
atomic<Refcnt> refcnt{1};
atomic_shared_ptr<MyNode> next; // the self-reference
// optional: void atomic_intrusive_dispose() noexcept { … } (else: delete)
};Circular / incomplete template-id member — local_shared_ptr<T> resolves
its control-block category (ref_traits<T>) at member-declaration time, since
the method signatures reference the chosen Ref (unlike std::shared_ptr, which
type-erases the deleter at construction). The sizeof(T) marker probe then
instantiates T. A plain incomplete class (struct N;) soft-fails to the
non-intrusive fallback, but a not-yet-instantiated class template-id —
local_shared_ptr<Holder<A>> as a member of A, where Holder<A> transitively
needs A complete — makes the probe force Holder<A>'s instantiation, and the
failure is a hard error, not soft SFINAE. Opt out of the probe (the type then
takes the default non-intrusive control block, local_weak_ptr<T> still works):
template <class X> struct force_incomplete_ref<Holder<X>> : std::true_type {};
struct A { local_shared_ptr<Holder<A>> m; }; // now compiles, like std::shared_ptrFull trait reference: the USAGE header block + ref_traits /
force_intrusive_ref / force_incomplete_ref in atomic_smart_ptr.h. Working
self-referential examples: tests/atomic_intrusive_dispose_test.cpp and
tests/atomic_intrusive_chain_test.cpp.
Most consumers don't need to touch these — the defaults are picked for a few-
to a few-hundred-core machine. Override via -D… at compile time.
Each LrcKArray is one cache line (or more) on its own — pushes from
different threads to the same idx land on different cache lines as long as
their kame_owner_id() & (LRC_K_MAX − 1) start positions differ. K therefore
sets the upper bound on concurrent pushers to one size class that won't
collide on a cache line.
The default LRC_K_MAX = 256 is the conservative choice for many-core NUMA
servers. Smaller K reduces the static slot-array memory AND the cold pop scan
length (each pop scans up to K cache lines for a fit); larger K spreads
contention further. K must be a power of two.
| Use case | LRC_K_MAX | Static memory (global L2) | Pop cold scan worst |
|---|---|---|---|
| Desktop / few cores | 32 | 10 KiB | 32 lines |
| Single-socket server (~64 cores) | 64 | 20 KiB | 64 lines |
| Multi-socket NUMA (~256 cores) — default | 256 | 80 KiB | 256 lines |
| Huge NUMA (≥ 512 cores or many domains) | 512 | 160 KiB | 512 lines |
K-major (current) over N-major: N-major would compact a band into one cache line (1-line pop scan) but force every concurrent pusher of that size onto a SHARED line — catastrophic cache-line bouncing across NUMA nodes. K-major's per-K cache-line independence keeps inter-NUMA coherence traffic minimal, which is the dominant cost on a 256-core node. Trade-off: pop cold scan is ~K cache-line loads (≈ K × ~50 ns NUMA-remote / ~5 ns hot-cache).
The cache covers [LRC_LO, LRC_HI] at 4 indices per octave (= ~19% per
step). LRC_LO = ALLOC_MIN_CHUNK_SIZE = 256 KiB; LRC_HI = LRC_LO << (LRC_N_MAX / 4). Default LRC_N_MAX = 32 ⇒ LRC_HI = 64 MiB. Sizes above
LRC_HI bypass the cache (the index space would otherwise saturate at the
top slot and over-satisfy smaller huge requests — see §27). Must be a
multiple of 4.
| LRC_N_MAX | LRC_HI | Use case |
|---|---|---|
| 24 | 16 MiB | Constrained RSS; never reuse > 16 MiB |
| 32 (default) | 64 MiB | General — covers typical "large buffer" sizes |
| 40 | 256 MiB | Image / FFT / NN workloads with large buffer reuse |
| 48 | 1 GiB | Massive buffer reuse (HPC) |
Raising LRC_N_MAX also adds 4 more idx slots × LRC_K_MAX (sizeof(atomic) × roundup(...)) per octave to the static slot array.
L1 is per-thread (TLS), no atomics, no false-sharing concern. LRC_K_L1 = 32
(fixed), LRC_N_MAX_L1 = 24 covers idx up to 16 MiB. Per-thread footprint
= LRC_K_L1 × (LRC_N_MAX_L1 + 1) × 8 B ≈ 6.4 KiB. Raise for workloads with
many distinct per-thread sizes.
Hardcoded at 10 ms. Each LRC_MMAP push past this interval evicts one slot
to keep the steady-state cache from growing unboundedly. On platforms where
munmap() is expensive (e.g. very many-core systems with TLB-shootdown
cost), raising this to 100 ms (or higher) trades drain rate for fewer
munmap syscalls. Not currently a -D… knob; trivial to expose if
measurement shows it matters.
The cmake test suite includes a runnable diagnostic that measures the
host's mmap / munmap / madvise costs, sweeps multi-thread munmap
(to expose TLB-shootdown scaling), benchmarks kamepoolalloc throughput at
every tier, and emits a recommendation block. Build + run:
cmake --build <build-dir> --target alloc_tune_report
./<build-dir>/alloc_tune_report [seconds-per-bench, default 2]It is NOT registered as a ctest (long-running, machine-specific output);
intended to be run once per target and the output captured for tuning
discussions. Recommends concrete -D… rebuild flags when it detects
that the defaults are inappropriate (e.g. munmap so expensive that
LRC_LAZY_INTERVAL_NS should be raised, or LRC_K_MAX mis-sized for the
host's core count).
- TLB shootdown on
munmap/madvise(MADV_DONTNEED)scales with core count. The warm cache absorbs most of these — push/pop hit avoids the syscall. - Pool regions (the 32 MiB units holding bucket-tier + dedicated-chunk allocations) are mmap'd push-only — once warmed up there is no munmap on them. The munmap cost is paid only by the §19/§27 large-tier path on cache miss/eviction.
madvise(MADV_HUGEPAGE)is NOT currently requested. If THP is enabled on the system, the kernel may transparently coalesce 2 MiB huge pages anyway; explicitMADV_HUGEPAGEcould reduce page-table footprint and TLB-shootdown cost. Not yet measured.
Dual-licensed under your choice of EITHER:
- Apache License, Version 2.0 — see LICENSE-APACHE-2.0. Best for embedding into permissive / proprietary projects.
- GNU GPL, version 2 of the License, or (at your option) any later version — see LICENSE-GPL-2.0. Best for linking into GPLv2-only projects such as KAME itself.
Pick whichever license suits your downstream project; see LICENSE for the full dual-grant statement.
Copyright (C) 2008-2026 Kentaro Kitagawa <kitag@issp.u-tokyo.ac.jp>, The University of Tokyo, ISSP.
Both license grants explicitly require preservation of the copyright notice and the choice-of-license clause when redistributing this software, in source or binary form.
See design/ for the structural invariant catalogue
(INVARIANTS.md — the blast-radius map for safe
edits) and the §-chapter → subsystem → code navigation map
(SUBSYSTEMS.md). For deeper rationale see
git log kamepoolalloc/allocator.cpp and the §-numbered design notes
(CHUNK_CLAIM_TLA_NOTES.md,
LARGE_RECYCLE_DESIGN.md).
- Runtime memory caps (
kame_pool_set_max_bytesregion ceiling +kame_pool_set_large_cache_caprecycle-cache footprint) - Apache-2.0 relicense
- Explicit C API (
kame_pool_malloc/_free/_calloc/_realloc/_aligned_alloc/_posix_memalignfamily) —<kame_pool.h> -
kame_pool_malloc_usable_size()public - Statistics API (
kame_pool_get_stats— regions, live chunks, units) - Aligned-alloc served from the pool (§17)
- User-installable OOM handler (
std::new_handlerloop +bad_alloc) (§18) - Large-tier
munmapon free — VA returned to the OS (§19) - TSAN / UBSAN (incl.
vptr) / ASan clean; chunk-claim/recycle TLA+ model-checked; large-recycle cache ownership/release GenMC (RC11)-checked (tests/cds/cds_lrc_ownership.c) - 32-bit verified; 16 KiB-page (Apple Silicon) / 64 KiB-page (POWER) page-multiple slot layout (§16)
- Windows live pool, default-on for both MinGW64 + lld and MSVC —
free-family IAT redirect for Qt / libc++ coexistence (§31),
_MSC_VERGCC-ism shim; opt out viaKAME_DISABLE_POOL_MSVC - Experimental contrib adaptors —
std::pmr::memory_resource(covers every PMR-aware container + Boost.Container / Folly / abseil),rclcpp::Allocatorfor ROS 2 RT nodes, andEigen::aligned_allocator- style over-aligned allocator for SIMD/cacheline-aligned buffers — all header-only, all concept-conformance tested viactest. Seecontrib/(also includes an OpenCVcv::MatAllocatorpaste recipe)
- Static-buffer mode (
kame_pool_init(buf, len)) — nommapdependency (toward MCU / bare-metal use) - Multiple heap instances (per-subsystem isolation)
- Unified
KAME_POOL_*env / config surface (hugepages, prewarm, caps) -
fork()safety (malloc_disable/_enable) - musl / uclibc, RISC-V verification
Carved out of KAME and generalised from "KAME-specific small-object pool" to a
four-tier general allocator. Selected milestones (full history in git log):
| § | subject |
|---|---|
| 5l–5u | buddy 32 MiB regions, multi-unit chunks, bucket layout, runtime cap, Apache-2.0 |
| 13 | O(1) p → chunk via 2-level radix tree (retires the O(N) region scan) |
| 13.2/13.3 | per-region metadata in unit 0; push-only region list (retire cap-sized globals) |
| 14 | NUMA-aware claim; opt-in hugepages; kame_pool_get_stats |
| 15 | forward-shift — every chunk's slot region starts on a 256 KiB boundary |
| 16 | full-usable ALIGN ≥ 1024 tiers — 0 % round-up at power-of-2 page sizes |
| 17 | pool-routed posix_memalign / aligned_alloc / operator new(align_val_t) |
| 18 | standards-conformant OOM (new_handler + bad_alloc; nothrow / errno) |
| 19 | large-mmap tier (4–32 MiB) with real munmap on free |
| 20 | fix cross-thread-free vptr-after-release UB (UBSAN) |
| 21–23 | thread-exit reclaim default; recycle cache; IE-TLS hot-path slots |
| 24 | slow_allocate scans DLL freelists across chunks (multi-chunk working sets) |
| 25 | global lock-free log-slot recycle cache (no working-set cliff) |
| 26 | per-thread L1 recycle log + global L2 — MT scaling without the cliff |
| 27 | serve > 32 MiB from the pool (multi-region mmap; huge class bypasses the cache) |
| 28 | K-line recycle cache (K-major, 1:1 size rounding); lazy drain + auto-tune; sharded stats |
| 29 | FS=true freelist pre-fill at chunk claim (cold-path bitmap scan → O(1) pop; auto-prewarm) |
| 30 | kame_pool_set_realtime_mode() — one-call silence of all background maintenance |
| 31 | Windows free-family IAT redirect — pool coexists with Qt / libc++ on PE/COFF |
| 36 / S7 | lock-free orphan-chunk reclaim — atomic_shared_ptr chain (owner-exit push, sweep-reclaim, adopt + chunk self-ref owner-ref) replaces the leak-prone ABA Treiber stack; TLA+-verified, now the default |
Designed and implemented by Kentaro Kitagawa. Phase 4 / Phase 5 refactor work co-authored with Anthropic Claude (Opus 4.5–4.7) under explicit direction; all algorithmic and performance decisions reviewed and verified by the original author.