perf: use jemalloc for make_examples in the runtime image by tfenne · Pull Request #1087 · google/deepvariant

tfenne · 2026-06-22T22:56:52Z

What

Preload jemalloc as the allocator for the
make_examples family of binaries in the published Docker image, instead of the default glibc malloc.

Why

make_examples is a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibc malloc.

The preload is scoped to the make_examples family only. call_variants is TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.

Impact

~7.3% faster make_examples on a 30× WGS HG003 chr20 run with the production wgs config (7-channel model + --call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).

Verified the wrapper actually engages it: the make_examples python process launched via the wrapped entrypoint maps libjemalloc.so.2 and has LD_PRELOAD=libjemalloc.so.2 in its environment; the unwrapped entrypoint does not.

Changes (1 file)

Dockerfile:
- install the distro libjemalloc2 package in the runtime image;
- prefix LD_PRELOAD=libjemalloc.so.2 on the make_examples,
  multisample_make_examples, and make_examples_somatic wrappers.

The bare so name (libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.

Correctness

This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in malloc/free replacement; make_examples output is bit-identical. No source files are touched.

Notes

I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.

make_examples spends a large share of time in allocation-heavy pileup and local-realignment work. Preloading jemalloc (vs glibc malloc) measurably reduces its wall-clock; it has no measurable effect on call_variants (TF inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers only. Installed via the distro libjemalloc2 package; the bare soname (libjemalloc.so.2) keeps it architecture-portable.

… pangenome images The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload. This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.

pichuan · 2026-06-24T04:32:22Z

Hi @tfenne ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

tfenne · 2026-06-24T04:49:35Z

Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs.

tfenne · 2026-06-26T15:14:20Z

I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a c8a.4xlarge at AWS with 16 cores / 16 shards. In that setup:

baseline runtime: 113.1
pr runtime: 105.7
% change: ~7.3% improvement

pgrosu · 2026-06-26T20:56:16Z

Hi Tim (@tfenne) and Pi-Chuan (@pichuan),

Since make_examples is launched as multiple single-threaded processes $-$ basically not sharing their heap as compared to multi-threaded applications $-$ you can probably further optimize it through the environmental variable MALLOC_CONF by turning off and limiting unused resources. A more process-centric start might be the following configuration:

export MALLOC_CONF="narenas:1,background_thread:true,tcache:true,tcache_max:65536,dirty_decay_ms:10000,muzzy_decay_ms:5000"

and then tune from there by playing with the cache size and cleanup as it relates to the number of threads for different sample types. Fortunately, jemalloc provides many options to play with.

It might be interesting to see how it compares to tcmalloc, as the thread lifespan under some runtime condition might favor that architecture. Again, lots of fun to tune through many options ;)

Hope it helps,
~p

pichuan · 2026-06-27T06:16:47Z

Hi @tfenne,

I ran benchmarks on the changes from this PR using our internal case studies pipeline (running on c3d-standard-16 instances, 5 trials per sample).

The testing methodology and baseline codebase are the same as described in #1086 (comment).

Output and Stage Observations:

md5sum: The output VCFs and gVCFs from gh1087 have exactly the same md5sum hashes as the head937500229 baseline.
Other stages: Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical between baseline and PR runs (differences are within standard ~0.5% – 2.0% run-to-run noise). This is expected since LD_PRELOAD is tightly scoped to just the make_examples wrappers.

Runtime summary:

The results show consistent runtime improvements across the board for the make_examples stage (ranging from 4.7% to 12.3% speedup), saving up to ~22–25 minutes on the large WGS, PacBio, and ONT runs.

Runtime Comparison: `head937500229` (baseline) vs `gh1087` (PR)

uid	stage	head937500229 (mean)	gh1087 (mean)	speedup (sec)	speedup (%)
wgs	make_examples	11180.41s (3h 6m 20s)	9806.62s (2h 43m 26s)	1373.79s	12.29%
	total	17412.56s (4h 50m 12s)	16153.49s (4h 29m 13s)	1259.07s	7.23%
ont-r104	make_examples	13062.19s (3h 37m 42s)	11527.95s (3h 12m 7s)	1534.24s	11.75%
	total	25910.83s (7h 11m 50s)	24457.72s (6h 47m 37s)	1453.11s	5.61%
pacbio	make_examples	8615.63s (2h 23m 35s)	7671.91s (2h 7m 51s)	943.72s	10.95%
	total	15373.77s (4h 16m 13s)	14407.26s (4h 0m 7s)	966.51s	6.29%
hybrid-pacbio-illumina	make_examples	15513.45s (4h 18m 33s)	14417.19s (4h 0m 17s)	1096.26s	7.07%
	total	39323.79s (10h 55m 23s)	38229.74s (10h 37m 9s)	1094.05s	2.78%
rnaseq	make_examples	1506.26s (25m 6s)	1418.62s (23m 38s)	87.64s	5.82%
	total	1805.56s (30m 5s)	1718.49s (28m 38s)	87.07s	4.82%
exome	make_examples	481.59s (8m 1s)	458.96s (7m 38s)	22.63s	4.70%
	total	659.81s (10m 59s)	637.41s (10m 37s)	22.40s	3.39%

Click to view raw `gh1087` runtime table

`gh1087` Runtime Table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	458.96	7.741	5	7m 38s
exome	HG003	call_variants	151.24	1.731	5	2m 31s
exome	HG003	postprocess_variants	27.21	0.177	5	27s
exome	HG003	vcf_stats	5.53	0.038	5	5s
exome	HG003	total	637.41	9.455	5	10m 37s
hybrid-pacbio-illumina	HG003	make_examples	14417.19	39.708	5	4h 17s
hybrid-pacbio-illumina	HG003	call_variants	23515.27	29.91	5	6h 31m 55s
hybrid-pacbio-illumina	HG003	postprocess_variants	297.29	6.776	5	4m 57s
hybrid-pacbio-illumina	HG003	vcf_stats	245.6	2.75	5	4m 5s
hybrid-pacbio-illumina	HG003	total	38229.74	47.021	5	10h 37m 9s
ont-r104	HG003	make_examples	11527.95	17.278	5	3h 12m 7s
ont-r104	HG003	call_variants	11905.93	120.625	5	3h 18m 25s
ont-r104	HG003	postprocess_variants	1023.84	8.715	5	17m 3s
ont-r104	HG003	vcf_stats	359.27	1.067	5	5m 59s
ont-r104	HG003	total	24457.72	127.981	5	6h 47m 37s
pacbio	HG003	make_examples	7671.91	22.629	5	2h 7m 51s
pacbio	HG003	call_variants	6204.24	7.886	5	1h 43m 24s
pacbio	HG003	postprocess_variants	531.11	8.841	5	8m 51s
pacbio	HG003	vcf_stats	282.95	1.985	5	4m 42s
pacbio	HG003	total	14407.26	23.523	5	4h 7s
rnaseq	HG005	make_examples	1418.62	9.414	5	23m 38s
rnaseq	HG005	call_variants	94.95	0.119	5	1m 34s
rnaseq	HG005	postprocess_variants	204.92	0.65	5	3m 24s
rnaseq	HG005	vcf_stats	4.94	0.04	5	4s
rnaseq	HG005	total	1718.49	9.864	5	28m 38s
wgs	HG003	make_examples	9806.62	30.525	5	2h 43m 26s
wgs	HG003	call_variants	5935.29	192.789	5	1h 38m 55s
wgs	HG003	postprocess_variants	411.57	8.592	5	6m 51s
wgs	HG003	vcf_stats	254.71	1.315	5	4m 14s
wgs	HG003	total	16153.49	201.882	5	4h 29m 13s

Click to view raw `head937500229` runtime table

`head937500229` Runtime Table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	481.59	8.884	5	8m 1s
exome	HG003	call_variants	150.94	0.808	5	2m 30s
exome	HG003	postprocess_variants	27.28	0.182	5	27s
exome	HG003	vcf_stats	5.54	0.058	5	5s
exome	HG003	total	659.81	9.358	5	10m 59s
hybrid-pacbio-illumina	HG003	make_examples	15513.45	51.654	5	4h 18m 33s
hybrid-pacbio-illumina	HG003	call_variants	23520.08	29.808	5	6h 32m 0s
hybrid-pacbio-illumina	HG003	postprocess_variants	290.26	4.803	5	4m 50s
hybrid-pacbio-illumina	HG003	vcf_stats	243.47	2.018	5	4m 3s
hybrid-pacbio-illumina	HG003	total	39323.79	67.138	5	10h 55m 23s
ont-r104	HG003	make_examples	13062.19	156.502	5	3h 37m 42s
ont-r104	HG003	call_variants	11814.9	7.661	5	3h 16m 54s
ont-r104	HG003	postprocess_variants	1033.73	8.6	5	17m 13s
ont-r104	HG003	vcf_stats	361.09	3.496	5	6m 1s
ont-r104	HG003	total	25910.83	158.26	5	7h 11m 50s
pacbio	HG003	make_examples	8615.63	155.868	5	2h 23m 35s
pacbio	HG003	call_variants	6237.02	32.756	5	1h 43m 57s
pacbio	HG003	postprocess_variants	521.13	8.723	5	8m 41s
pacbio	HG003	vcf_stats	280.33	1.387	5	4m 40s
pacbio	HG003	total	15373.77	182.451	5	4h 16m 13s
rnaseq	HG005	make_examples	1506.26	9.424	5	25m 6s
rnaseq	HG005	call_variants	94.85	0.066	5	1m 34s
rnaseq	HG005	postprocess_variants	204.45	1.424	5	3m 24s
rnaseq	HG005	vcf_stats	4.94	0.057	5	4s
rnaseq	HG005	total	1805.56	10.267	5	30m 5s
wgs	HG003	make_examples	11180.41	111.767	5	3h 6m 20s
wgs	HG003	call_variants	5825.96	5.848	5	1h 37m 5s
wgs	HG003	postprocess_variants	406.19	1.954	5	6m 46s
wgs	HG003	vcf_stats	252.8	1.256	5	4m 12s
wgs	HG003	total	17412.56	110.714	5	4h 50m 12s

I have created an internal changelist for the team to review. Since it's the weekend, we'll review and submit internally next week.
Out of curiosity, I'll also try to test this on our regular n2-standard-96 setup to see how the speedups hold up there!

pichuan · 2026-06-28T03:18:40Z

Hi @tfenne,

As a side note (with no action required for this PR), we also have a lesser-known alternative code path: fast-pipeline.

Because it runs as a native C++ binary rather than using python shell wrappers, it probably won't benefit from the LD_PRELOAD wrappers introduced here. It would be nice to enable jemalloc for it as well, but I'm happy to look into that separately later.

(For context, the fast pipeline is a stream-based design we created to run make_examples and call_variants concurrently to reduce I/O overhead. It's still experimental and not yet a fully supported path.)

tfenne · 2026-06-28T15:43:22Z

Thanks @pichuan - I'm aware of the fast-pipeline, but didn't want to touch it as it seems like it's probably under much more active development. I'd actually be very interested in the fast-pipeline for CPU, to avoid writing intermediates to disk - if contributions in that area are welcome, and wouldn't trip over in-house work, I'd be interested in helping there too.

pichuan · 2026-06-28T15:47:36Z

PR 1087 (jemalloc) Runtime Report — `n2-standard-96`

Setup

Machine type: n2-standard-96
Trials: 5 per sample
Baseline: head938924210 (no jemalloc)
PR build: gh1087-head938924210 (jemalloc preloaded on make_examples)

Runtime Comparison: `head938924210` (baseline) vs `gh1087` (PR)

The make_examples stage shows speedups across all sample types, with the strongest and most statistically significant improvements on the long-running workloads (pacbio, ont-r104, wgs).

uid	stage	head938924210 (mean)	gh1087 (mean)	speedup (sec)	speedup (%)
pacbio	make_examples	2315.63s (38m 35s)	1990.22s (33m 10s)	325.41s	14.05%
	total	3927.05s (1h 5m 27s)	3565.25s (59m 25s)	361.80s	9.21%
ont-r104	make_examples	3292.11s (54m 52s)	2895.39s (48m 15s)	396.72s	12.05%
	total	6137.74s (1h 42m 17s)	5759.00s (1h 35m 59s)	378.74s	6.17%
wgs	make_examples	2655.11s (44m 15s)	2375.42s (39m 35s)	279.69s	10.53%
	total	4036.19s (1h 7m 16s)	3763.43s (1h 2m 43s)	272.76s	6.76%
hybrid-pacbio-illumina	make_examples	3404.10s (56m 44s)	3278.77s (54m 38s)	125.33s	3.68%
	total	7772.25s (2h 9m 32s)	7904.32s (2h 11m 44s)	-132.07s	-1.70%*
exome	make_examples	190.24s (3m 10s)	184.19s (3m 4s)	6.05s	3.18%
	total	255.09s (4m 15s)	248.78s (4m 8s)	6.31s	2.47%
rnaseq	make_examples	436.98s (7m 16s)	427.93s (7m 7s)	9.05s	2.07%
	total	527.70s (8m 47s)	518.68s (8m 38s)	9.02s	1.71%

Note

* The hybrid-pacbio-illumina total shows a slight regression, but this is due to high variance in the call_variants stage (std_dev of 450s for the baseline vs 143s for gh1087). The make_examples stage itself still shows a 3.68% improvement. The call_variants variance is unrelated to jemalloc since it is not preloaded for that stage.

Statistical Significance (Welch's t-test)

With only 5 trials and relatively high variance on n2-standard-96, it's important to check which improvements are statistically significant vs. run-to-run noise.

uid	diff (sec)	diff (%)	t-stat	df	significant?
pacbio	325.41s	14.05%	6.06	4.7	✅ YES (p<0.01)
ont-r104	396.72s	12.05%	14.73	7.3	✅ YES (p<0.01)
wgs	279.69s	10.53%	4.88	8.0	✅ YES (p<0.01)
hybrid-pacbio-illumina	125.33s	3.68%	1.94	7.9	⚠️ Marginal (p<0.05)
exome	6.05s	3.18%	1.18	4.2	❌ Not significant
rnaseq	9.05s	2.07%	0.89	7.8	❌ Not significant

Important

On n2-standard-96, only 3 of 6 datasets (wgs, ont-r104, pacbio) show clearly significant make_examples speedups. The exome and rnaseq improvements are indistinguishable from noise at this sample size.

However, all 6 datasets show improvement in the same direction — if the effect were pure noise, we'd expect some positive and some negative. The c3d-standard-16 results (which have much tighter variance) confirm significance across all 6 datasets (all p<0.01).

Other Stages (Sanity Check)

Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical — most differences are well under 2%, consistent with run-to-run noise. The one outlier is hybrid-pacbio-illumina call_variants (-6.12%), which is explained by the high baseline std_dev (450s) — the baseline happened to have some faster trials in that stage.

Comparison with c3d-standard-16 results

The make_examples speedup percentages on n2-standard-96 are broadly consistent with the c3d-standard-16 results, though the n2 data has higher variance:

uid	c3d-standard-16 speedup	c3d significant?	n2-standard-96 speedup	n2 significant?
pacbio	10.95%	✅ p<0.01	14.05%	✅ p<0.01
ont-r104	11.75%	✅ p<0.01	12.05%	✅ p<0.01
wgs	12.29%	✅ p<0.01	10.53%	✅ p<0.01
hybrid-pacbio-illumina	7.07%	✅ p<0.01	3.68%	⚠️ p<0.05
rnaseq	5.82%	✅ p<0.01	2.07%	❌
exome	4.70%	✅ p<0.01	3.18%	❌

Summary: The benefit is real and clearly significant for long-running workloads (wgs, pacbio, ont — saving 5–7 minutes each). For shorter workloads (exome, rnaseq), the n2 data alone can't confirm a benefit, though the c3d data (with much tighter variance) confirms it across all datasets.

Click to view raw gh1087-head938924210 runtime table

`gh1087-head938924210` Runtime Table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	184.19	1.77	5	3m 4s
exome	HG003	call_variants	33.54	0.216	5	33s
exome	HG003	postprocess_variants	31.06	0.983	5	31s
exome	HG003	vcf_stats	6.16	0.072	5	6s
exome	HG003	total	248.78	2.193	5	4m 8s
hybrid-pacbio-illumina	HG003	make_examples	3278.77	107.917	5	54m 38s
hybrid-pacbio-illumina	HG003	call_variants	4386.22	143.326	5	1h 13m 6s
hybrid-pacbio-illumina	HG003	postprocess_variants	239.33	8.637	5	3m 59s
hybrid-pacbio-illumina	HG003	vcf_stats	310.08	6.04	5	5m 10s
hybrid-pacbio-illumina	HG003	total	7904.32	197.081	5	2h 11m 44s
ont-r104	HG003	make_examples	2895.39	48.685	5	48m 15s
ont-r104	HG003	call_variants	1970.55	36.64	5	32m 50s
ont-r104	HG003	postprocess_variants	893.05	12.834	5	14m 53s
ont-r104	HG003	vcf_stats	449.9	17.86	5	7m 29s
ont-r104	HG003	total	5759.0	72.291	5	1h 35m 59s
pacbio	HG003	make_examples	1990.22	34.021	5	33m 10s
pacbio	HG003	call_variants	1134.72	20.244	5	18m 54s
pacbio	HG003	postprocess_variants	440.31	4.567	5	7m 20s
pacbio	HG003	vcf_stats	348.82	6.463	5	5m 48s
pacbio	HG003	total	3565.25	56.336	5	59m 25s
rnaseq	HG005	make_examples	427.93	14.606	5	7m 7s
rnaseq	HG005	call_variants	25.73	0.769	5	25s
rnaseq	HG005	postprocess_variants	65.01	0.535	5	1m 5s
rnaseq	HG005	vcf_stats	5.45	0.094	5	5s
rnaseq	HG005	total	518.68	14.794	5	8m 38s
wgs	HG003	make_examples	2375.42	89.542	5	39m 35s
wgs	HG003	call_variants	982.35	64.884	5	16m 22s
wgs	HG003	postprocess_variants	405.67	13.119	5	6m 45s
wgs	HG003	vcf_stats	316.51	5.106	5	5m 16s
wgs	HG003	total	3763.43	129.075	5	1h 2m 43s

Click to view raw head938924210 runtime table

`head938924210` Runtime Table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	190.24	11.357	5	3m 10s
exome	HG003	call_variants	34.33	2.1	5	34s
exome	HG003	postprocess_variants	30.52	0.091	5	30s
exome	HG003	vcf_stats	6.17	0.091	5	6s
exome	HG003	total	255.09	13.253	5	4m 15s
hybrid-pacbio-illumina	HG003	make_examples	3404.1	96.1	5	56m 44s
hybrid-pacbio-illumina	HG003	call_variants	4133.09	450.234	5	1h 8m 53s
hybrid-pacbio-illumina	HG003	postprocess_variants	235.06	8.204	5	3m 55s
hybrid-pacbio-illumina	HG003	vcf_stats	311.91	9.025	5	5m 11s
hybrid-pacbio-illumina	HG003	total	7772.25	532.957	5	2h 9m 32s
ont-r104	HG003	make_examples	3292.11	35.424	5	54m 52s
ont-r104	HG003	call_variants	1966.44	29.279	5	32m 46s
ont-r104	HG003	postprocess_variants	879.19	4.702	5	14m 39s
ont-r104	HG003	vcf_stats	446.33	3.943	5	7m 26s
ont-r104	HG003	total	6137.74	52.714	5	1h 42m 17s
pacbio	HG003	make_examples	2315.63	115.204	5	38m 35s
pacbio	HG003	call_variants	1168.11	86.518	5	19m 28s
pacbio	HG003	postprocess_variants	443.32	8.427	5	7m 23s
pacbio	HG003	vcf_stats	349.94	12.866	5	5m 49s
pacbio	HG003	total	3927.05	205.759	5	1h 5m 27s
rnaseq	HG005	make_examples	436.98	17.427	5	7m 16s
rnaseq	HG005	call_variants	25.78	1.231	5	25s
rnaseq	HG005	postprocess_variants	64.94	0.935	5	1m 4s
rnaseq	HG005	vcf_stats	5.51	0.201	5	5s
rnaseq	HG005	total	527.70	18.766	5	8m 47s
wgs	HG003	make_examples	2655.11	91.839	5	44m 15s
wgs	HG003	call_variants	975.37	82.033	5	16m 15s
wgs	HG003	postprocess_variants	405.71	10.186	5	6m 45s
wgs	HG003	vcf_stats	320.59	6.27	5	5m 20s
wgs	HG003	total	4036.19	178.7	5	1h 7m 16s

pichuan · 2026-06-28T15:57:16Z

Thanks @pichuan - I'm aware of the fast-pipeline, but didn't want to touch it as it seems like it's probably under much more active development. I'd actually be very interested in the fast-pipeline for CPU, to avoid writing intermediates to disk - if contributions in that area are welcome, and wouldn't trip over in-house work, I'd be interested in helping there too.

Hi @tfenne,

Contributions (and usage!) on fast-pipeline are VERY welcome.

For context: We originally built it for a specific deployment scenario. However, after the development and testing, that scenario was no longer relevant.

At this point, I was even considering removing it, since it hasn't been advertised enough to have active users (either internally or externally). If you'd like to both improve AND use the code, that would be great. But no pressure at all :)

-pichuan

pgrosu · 2026-06-28T22:58:27Z

Hi Tim (@tfenne) and Pi-Chuan (@pichuan),

I'm super-excited to see that the Google DeepVariant team partially implemented an initial version of my shared-memory model concept recommendation via the fast-pipeline, and has experienced significant performance gains! The way the current stream-component functionality is implemented, is still only CPU-focused by bundling the stream_examples_kernel.cc and stream_examples_ops.cc into a shared library object deepvariant/examples_from_stream.so $-$ including the core-functionality of stream_examples.cc $-$ without GPU utilization for I/O (this is aside from other steps in call_variants).

The partial implementation refers to the idea that it currently operates through a few bottlenecks, that can be easily overcome. For example:

The CPU-based aligner in make_examples can be augmented with a GPU-based one (i.e. CUDASW++, etc), as it consumes a significant amount of processing time during the make_examples step, shown in an earlier post of mine.
The implementation depends too much on (mutex) synchronization, which is not required as candidate examples are relatively independent of each other. Shared memory objects are generated given the number of shards, and are controlled through mutexes per object. make_examples streams examples into a shared memory object for that shard, and once one is full, call_variants is called to process that object of memory. There is an asynchronous approach opportunity here, where:

$1)$ Example candidates can be streamed to both memory and disk, where each written stream example-candidates object has its own lock (in the header of the payload) with a dynamic vector/hashtable dictating disk/shared-memory example-candidate object-locations and state.

$2)$ As shared memory becomes too busy, candidates can continuously be streamed from disk, with the ability for call_variants threads to concurrently process from either locations. Lock-free mechanisms and concurrency will get around the current mutex bottlenecks.

$3)$ The implementation should become more example-candidate-focused rather than shard-focused. One should be able to add new shared memory regions on-the-fly as needed, and reshape execution-flow based on a dynamic thread pool (as dictated by sample type) where modular functions from all three steps in DeepVariant pull from, and take advantage of performing stutter-free continuous execution.

The architecture would benefit from having more granular equivalent CPU and GPU functions that are utilized within (and build up) the make_examples, call_variants and postprocess_variants pipeline steps. In this way, the execution can be reshaped-on-the-fly during the execution flow, which will make the implementation of DeepVariant much more modular. So CPUs provide lots of memory, and extra free memory through SSD-type disks. While on the other hand, GPUs provide many free processing units. The NVidia Unified Virtual Memory (UVM) approach merges the two, so that the pool of processing units and pool of memory become more unified (even if the host/device mechanism is required, and optimized via pinned memory). If the CPU cores are overwhelmed given a percentage threshold, then the equivalent GPU functions in (make_examples, call_variants and postprocess_variants) can be reshaped during the execution flow, which will make the implementation of DeepVariant much more modular.

I don't have the compute resources to update based on these ideas (and a bit limited on time), but it should be a few minor changes by Tim and the DeepVariant team to implement some of these changes, which should make DeepVariant several orders of magnitude faster.

Hope it helps,
~p

tfenne force-pushed the tf_jemalloc-image branch from f5129f6 to da4f616 Compare June 23, 2026 04:26

pichuan self-assigned this Jun 24, 2026

pichuan self-requested a review June 26, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use jemalloc for make_examples in the runtime image#1087

perf: use jemalloc for make_examples in the runtime image#1087
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image

tfenne commented Jun 22, 2026 •

edited

Loading

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

pgrosu commented Jun 26, 2026

Uh oh!

pichuan commented Jun 27, 2026

`gh1087` Runtime Table

`head937500229` Runtime Table

Uh oh!

pichuan commented Jun 28, 2026

Uh oh!

tfenne commented Jun 28, 2026

Uh oh!

pichuan commented Jun 28, 2026

`gh1087-head938924210` Runtime Table

`head938924210` Runtime Table

Uh oh!

pichuan commented Jun 28, 2026

Uh oh!

pgrosu commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tfenne commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Impact

Changes (1 file)

Correctness

Notes

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

pgrosu commented Jun 26, 2026

Uh oh!

pichuan commented Jun 27, 2026

Output and Stage Observations:

Runtime summary:

Runtime Comparison: head937500229 (baseline) vs gh1087 (PR)

gh1087 Runtime Table

head937500229 Runtime Table

Uh oh!

pichuan commented Jun 28, 2026

Uh oh!

tfenne commented Jun 28, 2026

Uh oh!

pichuan commented Jun 28, 2026

PR 1087 (jemalloc) Runtime Report — n2-standard-96

Setup

Runtime Comparison: head938924210 (baseline) vs gh1087 (PR)

Statistical Significance (Welch's t-test)

Other Stages (Sanity Check)

Comparison with c3d-standard-16 results

gh1087-head938924210 Runtime Table

head938924210 Runtime Table

Uh oh!

pichuan commented Jun 28, 2026

Uh oh!

pgrosu commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tfenne commented Jun 22, 2026 •

edited

Loading

Runtime Comparison: `head937500229` (baseline) vs `gh1087` (PR)

`gh1087` Runtime Table

`head937500229` Runtime Table

PR 1087 (jemalloc) Runtime Report — `n2-standard-96`

Runtime Comparison: `head938924210` (baseline) vs `gh1087` (PR)

`gh1087-head938924210` Runtime Table

`head938924210` Runtime Table