perf: use jemalloc for make_examples in the runtime image#1087
Conversation
make_examples spends a large share of time in allocation-heavy pileup and local-realignment work. Preloading jemalloc (vs glibc malloc) measurably reduces its wall-clock; it has no measurable effect on call_variants (TF inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers only. Installed via the distro libjemalloc2 package; the bare soname (libjemalloc.so.2) keeps it architecture-portable.
f5129f6 to
da4f616
Compare
… pangenome images The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload. This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.
|
Hi @tfenne , Thanks for the PR! Since I believe you're already familiar with our process, I'll go ahead and start the review. As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description. Please let me know if you have any concerns with this approach. -pichuan |
|
Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs. |
|
I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a baseline runtime: 113.1 |
|
Hi Tim (@tfenne) and Pi-Chuan (@pichuan), Since
and then tune from there by playing with the cache size and cleanup as it relates to the number of threads for different sample types. Fortunately, It might be interesting to see how it compares to tcmalloc, as the thread lifespan under some runtime condition might favor that architecture. Again, lots of fun to tune through many options ;) Hope it helps, |
|
Hi @tfenne, I ran benchmarks on the changes from this PR using our internal case studies pipeline (running on The testing methodology and baseline codebase are the same as described in #1086 (comment). Output and Stage Observations:
Runtime summary:The results show consistent runtime improvements across the board for the Runtime Comparison:
|
| uid | stage | head937500229 (mean) | gh1087 (mean) | speedup (sec) | speedup (%) |
|---|---|---|---|---|---|
| wgs | make_examples | 11180.41s (3h 6m 20s) | 9806.62s (2h 43m 26s) | 1373.79s | 12.29% |
| total | 17412.56s (4h 50m 12s) | 16153.49s (4h 29m 13s) | 1259.07s | 7.23% | |
| ont-r104 | make_examples | 13062.19s (3h 37m 42s) | 11527.95s (3h 12m 7s) | 1534.24s | 11.75% |
| total | 25910.83s (7h 11m 50s) | 24457.72s (6h 47m 37s) | 1453.11s | 5.61% | |
| pacbio | make_examples | 8615.63s (2h 23m 35s) | 7671.91s (2h 7m 51s) | 943.72s | 10.95% |
| total | 15373.77s (4h 16m 13s) | 14407.26s (4h 0m 7s) | 966.51s | 6.29% | |
| hybrid-pacbio-illumina | make_examples | 15513.45s (4h 18m 33s) | 14417.19s (4h 0m 17s) | 1096.26s | 7.07% |
| total | 39323.79s (10h 55m 23s) | 38229.74s (10h 37m 9s) | 1094.05s | 2.78% | |
| rnaseq | make_examples | 1506.26s (25m 6s) | 1418.62s (23m 38s) | 87.64s | 5.82% |
| total | 1805.56s (30m 5s) | 1718.49s (28m 38s) | 87.07s | 4.82% | |
| exome | make_examples | 481.59s (8m 1s) | 458.96s (7m 38s) | 22.63s | 4.70% |
| total | 659.81s (10m 59s) | 637.41s (10m 37s) | 22.40s | 3.39% |
Click to view raw `gh1087` runtime table
gh1087 Runtime Table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 458.96 | 7.741 | 5 | 7m 38s |
| exome | HG003 | call_variants | 151.24 | 1.731 | 5 | 2m 31s |
| exome | HG003 | postprocess_variants | 27.21 | 0.177 | 5 | 27s |
| exome | HG003 | vcf_stats | 5.53 | 0.038 | 5 | 5s |
| exome | HG003 | total | 637.41 | 9.455 | 5 | 10m 37s |
| hybrid-pacbio-illumina | HG003 | make_examples | 14417.19 | 39.708 | 5 | 4h 17s |
| hybrid-pacbio-illumina | HG003 | call_variants | 23515.27 | 29.91 | 5 | 6h 31m 55s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 297.29 | 6.776 | 5 | 4m 57s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 245.6 | 2.75 | 5 | 4m 5s |
| hybrid-pacbio-illumina | HG003 | total | 38229.74 | 47.021 | 5 | 10h 37m 9s |
| ont-r104 | HG003 | make_examples | 11527.95 | 17.278 | 5 | 3h 12m 7s |
| ont-r104 | HG003 | call_variants | 11905.93 | 120.625 | 5 | 3h 18m 25s |
| ont-r104 | HG003 | postprocess_variants | 1023.84 | 8.715 | 5 | 17m 3s |
| ont-r104 | HG003 | vcf_stats | 359.27 | 1.067 | 5 | 5m 59s |
| ont-r104 | HG003 | total | 24457.72 | 127.981 | 5 | 6h 47m 37s |
| pacbio | HG003 | make_examples | 7671.91 | 22.629 | 5 | 2h 7m 51s |
| pacbio | HG003 | call_variants | 6204.24 | 7.886 | 5 | 1h 43m 24s |
| pacbio | HG003 | postprocess_variants | 531.11 | 8.841 | 5 | 8m 51s |
| pacbio | HG003 | vcf_stats | 282.95 | 1.985 | 5 | 4m 42s |
| pacbio | HG003 | total | 14407.26 | 23.523 | 5 | 4h 7s |
| rnaseq | HG005 | make_examples | 1418.62 | 9.414 | 5 | 23m 38s |
| rnaseq | HG005 | call_variants | 94.95 | 0.119 | 5 | 1m 34s |
| rnaseq | HG005 | postprocess_variants | 204.92 | 0.65 | 5 | 3m 24s |
| rnaseq | HG005 | vcf_stats | 4.94 | 0.04 | 5 | 4s |
| rnaseq | HG005 | total | 1718.49 | 9.864 | 5 | 28m 38s |
| wgs | HG003 | make_examples | 9806.62 | 30.525 | 5 | 2h 43m 26s |
| wgs | HG003 | call_variants | 5935.29 | 192.789 | 5 | 1h 38m 55s |
| wgs | HG003 | postprocess_variants | 411.57 | 8.592 | 5 | 6m 51s |
| wgs | HG003 | vcf_stats | 254.71 | 1.315 | 5 | 4m 14s |
| wgs | HG003 | total | 16153.49 | 201.882 | 5 | 4h 29m 13s |
Click to view raw `head937500229` runtime table
head937500229 Runtime Table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 481.59 | 8.884 | 5 | 8m 1s |
| exome | HG003 | call_variants | 150.94 | 0.808 | 5 | 2m 30s |
| exome | HG003 | postprocess_variants | 27.28 | 0.182 | 5 | 27s |
| exome | HG003 | vcf_stats | 5.54 | 0.058 | 5 | 5s |
| exome | HG003 | total | 659.81 | 9.358 | 5 | 10m 59s |
| hybrid-pacbio-illumina | HG003 | make_examples | 15513.45 | 51.654 | 5 | 4h 18m 33s |
| hybrid-pacbio-illumina | HG003 | call_variants | 23520.08 | 29.808 | 5 | 6h 32m 0s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 290.26 | 4.803 | 5 | 4m 50s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 243.47 | 2.018 | 5 | 4m 3s |
| hybrid-pacbio-illumina | HG003 | total | 39323.79 | 67.138 | 5 | 10h 55m 23s |
| ont-r104 | HG003 | make_examples | 13062.19 | 156.502 | 5 | 3h 37m 42s |
| ont-r104 | HG003 | call_variants | 11814.9 | 7.661 | 5 | 3h 16m 54s |
| ont-r104 | HG003 | postprocess_variants | 1033.73 | 8.6 | 5 | 17m 13s |
| ont-r104 | HG003 | vcf_stats | 361.09 | 3.496 | 5 | 6m 1s |
| ont-r104 | HG003 | total | 25910.83 | 158.26 | 5 | 7h 11m 50s |
| pacbio | HG003 | make_examples | 8615.63 | 155.868 | 5 | 2h 23m 35s |
| pacbio | HG003 | call_variants | 6237.02 | 32.756 | 5 | 1h 43m 57s |
| pacbio | HG003 | postprocess_variants | 521.13 | 8.723 | 5 | 8m 41s |
| pacbio | HG003 | vcf_stats | 280.33 | 1.387 | 5 | 4m 40s |
| pacbio | HG003 | total | 15373.77 | 182.451 | 5 | 4h 16m 13s |
| rnaseq | HG005 | make_examples | 1506.26 | 9.424 | 5 | 25m 6s |
| rnaseq | HG005 | call_variants | 94.85 | 0.066 | 5 | 1m 34s |
| rnaseq | HG005 | postprocess_variants | 204.45 | 1.424 | 5 | 3m 24s |
| rnaseq | HG005 | vcf_stats | 4.94 | 0.057 | 5 | 4s |
| rnaseq | HG005 | total | 1805.56 | 10.267 | 5 | 30m 5s |
| wgs | HG003 | make_examples | 11180.41 | 111.767 | 5 | 3h 6m 20s |
| wgs | HG003 | call_variants | 5825.96 | 5.848 | 5 | 1h 37m 5s |
| wgs | HG003 | postprocess_variants | 406.19 | 1.954 | 5 | 6m 46s |
| wgs | HG003 | vcf_stats | 252.8 | 1.256 | 5 | 4m 12s |
| wgs | HG003 | total | 17412.56 | 110.714 | 5 | 4h 50m 12s |
I have created an internal changelist for the team to review. Since it's the weekend, we'll review and submit internally next week.
Out of curiosity, I'll also try to test this on our regular n2-standard-96 setup to see how the speedups hold up there!
|
Hi @tfenne, As a side note (with no action required for this PR), we also have a lesser-known alternative code path: fast-pipeline. Because it runs as a native C++ binary rather than using python shell wrappers, it probably won't benefit from the (For context, the fast pipeline is a stream-based design we created to run make_examples and call_variants concurrently to reduce I/O overhead. It's still experimental and not yet a fully supported path.) |
|
Thanks @pichuan - I'm aware of the fast-pipeline, but didn't want to touch it as it seems like it's probably under much more active development. I'd actually be very interested in the fast-pipeline for CPU, to avoid writing intermediates to disk - if contributions in that area are welcome, and wouldn't trip over in-house work, I'd be interested in helping there too. |
PR 1087 (jemalloc) Runtime Report —
|
| uid | stage | head938924210 (mean) | gh1087 (mean) | speedup (sec) | speedup (%) |
|---|---|---|---|---|---|
| pacbio | make_examples | 2315.63s (38m 35s) | 1990.22s (33m 10s) | 325.41s | 14.05% |
| total | 3927.05s (1h 5m 27s) | 3565.25s (59m 25s) | 361.80s | 9.21% | |
| ont-r104 | make_examples | 3292.11s (54m 52s) | 2895.39s (48m 15s) | 396.72s | 12.05% |
| total | 6137.74s (1h 42m 17s) | 5759.00s (1h 35m 59s) | 378.74s | 6.17% | |
| wgs | make_examples | 2655.11s (44m 15s) | 2375.42s (39m 35s) | 279.69s | 10.53% |
| total | 4036.19s (1h 7m 16s) | 3763.43s (1h 2m 43s) | 272.76s | 6.76% | |
| hybrid-pacbio-illumina | make_examples | 3404.10s (56m 44s) | 3278.77s (54m 38s) | 125.33s | 3.68% |
| total | 7772.25s (2h 9m 32s) | 7904.32s (2h 11m 44s) | -132.07s | -1.70%* | |
| exome | make_examples | 190.24s (3m 10s) | 184.19s (3m 4s) | 6.05s | 3.18% |
| total | 255.09s (4m 15s) | 248.78s (4m 8s) | 6.31s | 2.47% | |
| rnaseq | make_examples | 436.98s (7m 16s) | 427.93s (7m 7s) | 9.05s | 2.07% |
| total | 527.70s (8m 47s) | 518.68s (8m 38s) | 9.02s | 1.71% |
Note
* The hybrid-pacbio-illumina total shows a slight regression, but this is due to high variance in the call_variants stage (std_dev of 450s for the baseline vs 143s for gh1087). The make_examples stage itself still shows a 3.68% improvement. The call_variants variance is unrelated to jemalloc since it is not preloaded for that stage.
Statistical Significance (Welch's t-test)
With only 5 trials and relatively high variance on n2-standard-96, it's important to check which improvements are statistically significant vs. run-to-run noise.
| uid | diff (sec) | diff (%) | t-stat | df | significant? |
|---|---|---|---|---|---|
| pacbio | 325.41s | 14.05% | 6.06 | 4.7 | ✅ YES (p<0.01) |
| ont-r104 | 396.72s | 12.05% | 14.73 | 7.3 | ✅ YES (p<0.01) |
| wgs | 279.69s | 10.53% | 4.88 | 8.0 | ✅ YES (p<0.01) |
| hybrid-pacbio-illumina | 125.33s | 3.68% | 1.94 | 7.9 | |
| exome | 6.05s | 3.18% | 1.18 | 4.2 | ❌ Not significant |
| rnaseq | 9.05s | 2.07% | 0.89 | 7.8 | ❌ Not significant |
Important
On n2-standard-96, only 3 of 6 datasets (wgs, ont-r104, pacbio) show clearly significant make_examples speedups. The exome and rnaseq improvements are indistinguishable from noise at this sample size.
However, all 6 datasets show improvement in the same direction — if the effect were pure noise, we'd expect some positive and some negative. The c3d-standard-16 results (which have much tighter variance) confirm significance across all 6 datasets (all p<0.01).
Other Stages (Sanity Check)
Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical — most differences are well under 2%, consistent with run-to-run noise. The one outlier is hybrid-pacbio-illumina call_variants (-6.12%), which is explained by the high baseline std_dev (450s) — the baseline happened to have some faster trials in that stage.
Comparison with c3d-standard-16 results
The make_examples speedup percentages on n2-standard-96 are broadly consistent with the c3d-standard-16 results, though the n2 data has higher variance:
| uid | c3d-standard-16 speedup | c3d significant? | n2-standard-96 speedup | n2 significant? |
|---|---|---|---|---|
| pacbio | 10.95% | ✅ p<0.01 | 14.05% | ✅ p<0.01 |
| ont-r104 | 11.75% | ✅ p<0.01 | 12.05% | ✅ p<0.01 |
| wgs | 12.29% | ✅ p<0.01 | 10.53% | ✅ p<0.01 |
| hybrid-pacbio-illumina | 7.07% | ✅ p<0.01 | 3.68% | |
| rnaseq | 5.82% | ✅ p<0.01 | 2.07% | ❌ |
| exome | 4.70% | ✅ p<0.01 | 3.18% | ❌ |
Summary: The benefit is real and clearly significant for long-running workloads (wgs, pacbio, ont — saving 5–7 minutes each). For shorter workloads (exome, rnaseq), the n2 data alone can't confirm a benefit, though the c3d data (with much tighter variance) confirms it across all datasets.
Click to view raw gh1087-head938924210 runtime table
gh1087-head938924210 Runtime Table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 184.19 | 1.77 | 5 | 3m 4s |
| exome | HG003 | call_variants | 33.54 | 0.216 | 5 | 33s |
| exome | HG003 | postprocess_variants | 31.06 | 0.983 | 5 | 31s |
| exome | HG003 | vcf_stats | 6.16 | 0.072 | 5 | 6s |
| exome | HG003 | total | 248.78 | 2.193 | 5 | 4m 8s |
| hybrid-pacbio-illumina | HG003 | make_examples | 3278.77 | 107.917 | 5 | 54m 38s |
| hybrid-pacbio-illumina | HG003 | call_variants | 4386.22 | 143.326 | 5 | 1h 13m 6s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 239.33 | 8.637 | 5 | 3m 59s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 310.08 | 6.04 | 5 | 5m 10s |
| hybrid-pacbio-illumina | HG003 | total | 7904.32 | 197.081 | 5 | 2h 11m 44s |
| ont-r104 | HG003 | make_examples | 2895.39 | 48.685 | 5 | 48m 15s |
| ont-r104 | HG003 | call_variants | 1970.55 | 36.64 | 5 | 32m 50s |
| ont-r104 | HG003 | postprocess_variants | 893.05 | 12.834 | 5 | 14m 53s |
| ont-r104 | HG003 | vcf_stats | 449.9 | 17.86 | 5 | 7m 29s |
| ont-r104 | HG003 | total | 5759.0 | 72.291 | 5 | 1h 35m 59s |
| pacbio | HG003 | make_examples | 1990.22 | 34.021 | 5 | 33m 10s |
| pacbio | HG003 | call_variants | 1134.72 | 20.244 | 5 | 18m 54s |
| pacbio | HG003 | postprocess_variants | 440.31 | 4.567 | 5 | 7m 20s |
| pacbio | HG003 | vcf_stats | 348.82 | 6.463 | 5 | 5m 48s |
| pacbio | HG003 | total | 3565.25 | 56.336 | 5 | 59m 25s |
| rnaseq | HG005 | make_examples | 427.93 | 14.606 | 5 | 7m 7s |
| rnaseq | HG005 | call_variants | 25.73 | 0.769 | 5 | 25s |
| rnaseq | HG005 | postprocess_variants | 65.01 | 0.535 | 5 | 1m 5s |
| rnaseq | HG005 | vcf_stats | 5.45 | 0.094 | 5 | 5s |
| rnaseq | HG005 | total | 518.68 | 14.794 | 5 | 8m 38s |
| wgs | HG003 | make_examples | 2375.42 | 89.542 | 5 | 39m 35s |
| wgs | HG003 | call_variants | 982.35 | 64.884 | 5 | 16m 22s |
| wgs | HG003 | postprocess_variants | 405.67 | 13.119 | 5 | 6m 45s |
| wgs | HG003 | vcf_stats | 316.51 | 5.106 | 5 | 5m 16s |
| wgs | HG003 | total | 3763.43 | 129.075 | 5 | 1h 2m 43s |
Click to view raw head938924210 runtime table
head938924210 Runtime Table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 190.24 | 11.357 | 5 | 3m 10s |
| exome | HG003 | call_variants | 34.33 | 2.1 | 5 | 34s |
| exome | HG003 | postprocess_variants | 30.52 | 0.091 | 5 | 30s |
| exome | HG003 | vcf_stats | 6.17 | 0.091 | 5 | 6s |
| exome | HG003 | total | 255.09 | 13.253 | 5 | 4m 15s |
| hybrid-pacbio-illumina | HG003 | make_examples | 3404.1 | 96.1 | 5 | 56m 44s |
| hybrid-pacbio-illumina | HG003 | call_variants | 4133.09 | 450.234 | 5 | 1h 8m 53s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 235.06 | 8.204 | 5 | 3m 55s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 311.91 | 9.025 | 5 | 5m 11s |
| hybrid-pacbio-illumina | HG003 | total | 7772.25 | 532.957 | 5 | 2h 9m 32s |
| ont-r104 | HG003 | make_examples | 3292.11 | 35.424 | 5 | 54m 52s |
| ont-r104 | HG003 | call_variants | 1966.44 | 29.279 | 5 | 32m 46s |
| ont-r104 | HG003 | postprocess_variants | 879.19 | 4.702 | 5 | 14m 39s |
| ont-r104 | HG003 | vcf_stats | 446.33 | 3.943 | 5 | 7m 26s |
| ont-r104 | HG003 | total | 6137.74 | 52.714 | 5 | 1h 42m 17s |
| pacbio | HG003 | make_examples | 2315.63 | 115.204 | 5 | 38m 35s |
| pacbio | HG003 | call_variants | 1168.11 | 86.518 | 5 | 19m 28s |
| pacbio | HG003 | postprocess_variants | 443.32 | 8.427 | 5 | 7m 23s |
| pacbio | HG003 | vcf_stats | 349.94 | 12.866 | 5 | 5m 49s |
| pacbio | HG003 | total | 3927.05 | 205.759 | 5 | 1h 5m 27s |
| rnaseq | HG005 | make_examples | 436.98 | 17.427 | 5 | 7m 16s |
| rnaseq | HG005 | call_variants | 25.78 | 1.231 | 5 | 25s |
| rnaseq | HG005 | postprocess_variants | 64.94 | 0.935 | 5 | 1m 4s |
| rnaseq | HG005 | vcf_stats | 5.51 | 0.201 | 5 | 5s |
| rnaseq | HG005 | total | 527.70 | 18.766 | 5 | 8m 47s |
| wgs | HG003 | make_examples | 2655.11 | 91.839 | 5 | 44m 15s |
| wgs | HG003 | call_variants | 975.37 | 82.033 | 5 | 16m 15s |
| wgs | HG003 | postprocess_variants | 405.71 | 10.186 | 5 | 6m 45s |
| wgs | HG003 | vcf_stats | 320.59 | 6.27 | 5 | 5m 20s |
| wgs | HG003 | total | 4036.19 | 178.7 | 5 | 1h 7m 16s |
Hi @tfenne, Contributions (and usage!) on fast-pipeline are VERY welcome. For context: We originally built it for a specific deployment scenario. However, after the development and testing, that scenario was no longer relevant. At this point, I was even considering removing it, since it hasn't been advertised enough to have active users (either internally or externally). If you'd like to both improve AND use the code, that would be great. But no pressure at all :) -pichuan |
|
Hi Tim (@tfenne) and Pi-Chuan (@pichuan), I'm super-excited to see that the Google DeepVariant team partially implemented an initial version of my shared-memory model concept recommendation via the The partial implementation refers to the idea that it currently operates through a few bottlenecks, that can be easily overcome. For example:
I don't have the compute resources to update based on these ideas (and a bit limited on time), but it should be a few minor changes by Tim and the DeepVariant team to implement some of these changes, which should make DeepVariant several orders of magnitude faster. Hope it helps, |
What
Preload jemalloc as the allocator for the
make_examplesfamily of binaries in the published Docker image, instead of the default glibcmalloc.Why
make_examplesis a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibcmalloc.The preload is scoped to the make_examples family only.
call_variantsis TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.Impact
~7.3% faster
make_exampleson a 30× WGS HG003chr20run with the productionwgsconfig (7-channel model +--call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).Verified the wrapper actually engages it: the
make_examplespython process launched via the wrapped entrypoint mapslibjemalloc.so.2and hasLD_PRELOAD=libjemalloc.so.2in its environment; the unwrapped entrypoint does not.Changes (1 file)
Dockerfile:libjemalloc2package in the runtime image;LD_PRELOAD=libjemalloc.so.2on themake_examples,multisample_make_examples, andmake_examples_somaticwrappers.The bare so name (
libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.Correctness
This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in
malloc/freereplacement;make_examplesoutput is bit-identical. No source files are touched.Notes
I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.