Skip to content

perf: use jemalloc for make_examples in the runtime image#1087

Open
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image
Open

perf: use jemalloc for make_examples in the runtime image#1087
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image

Conversation

@tfenne

@tfenne tfenne commented Jun 22, 2026

Copy link
Copy Markdown

What

Preload jemalloc as the allocator for the
make_examples family of binaries in the published Docker image, instead of the default glibc malloc.

Why

make_examples is a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibc malloc.

The preload is scoped to the make_examples family only. call_variants is TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.

Impact

~7.3% faster make_examples on a 30× WGS HG003 chr20 run with the production wgs config (7-channel model + --call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).

Verified the wrapper actually engages it: the make_examples python process launched via the wrapped entrypoint maps libjemalloc.so.2 and has LD_PRELOAD=libjemalloc.so.2 in its environment; the unwrapped entrypoint does not.

Changes (1 file)

  • Dockerfile:
    • install the distro libjemalloc2 package in the runtime image;
    • prefix LD_PRELOAD=libjemalloc.so.2 on the make_examples,
      multisample_make_examples, and make_examples_somatic wrappers.

The bare so name (libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.

Correctness

This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in malloc/free replacement; make_examples output is bit-identical. No source files are touched.

Notes

I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.

make_examples spends a large share of time in allocation-heavy pileup and
local-realignment work. Preloading jemalloc (vs glibc malloc) measurably
reduces its wall-clock; it has no measurable effect on call_variants (TF
inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers
only. Installed via the distro libjemalloc2 package; the bare soname
(libjemalloc.so.2) keeps it architecture-portable.
@tfenne tfenne force-pushed the tf_jemalloc-image branch from f5129f6 to da4f616 Compare June 23, 2026 04:26
… pangenome images

The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload.

This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.
@pichuan

pichuan commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Hi @tfenne ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

@pichuan pichuan self-assigned this Jun 24, 2026
@tfenne

tfenne commented Jun 24, 2026

Copy link
Copy Markdown
Author

Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs.

@tfenne

tfenne commented Jun 26, 2026

Copy link
Copy Markdown
Author

I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a c8a.4xlarge at AWS with 16 cores / 16 shards. In that setup:

baseline runtime: 113.1
pr runtime: 105.7
% change: ~7.3% improvement

@pichuan pichuan self-requested a review June 26, 2026 16:37
@pgrosu

pgrosu commented Jun 26, 2026

Copy link
Copy Markdown

Hi Tim (@tfenne) and Pi-Chuan (@pichuan),

Since make_examples is launched as multiple single-threaded processes $-$ basically not sharing their heap as compared to multi-threaded applications $-$ you can probably further optimize it through the environmental variable MALLOC_CONF by turning off and limiting unused resources. A more process-centric start might be the following configuration:

export MALLOC_CONF="narenas:1,background_thread:true,tcache:true,tcache_max:65536,dirty_decay_ms:10000,muzzy_decay_ms:5000"

and then tune from there by playing with the cache size and cleanup as it relates to the number of threads for different sample types. Fortunately, jemalloc provides many options to play with.

It might be interesting to see how it compares to tcmalloc, as the thread lifespan under some runtime condition might favor that architecture. Again, lots of fun to tune through many options ;)

Hope it helps,
~p

@pichuan

pichuan commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Hi @tfenne,

I ran benchmarks on the changes from this PR using our internal case studies pipeline (running on c3d-standard-16 instances, 5 trials per sample).

The testing methodology and baseline codebase are the same as described in #1086 (comment).

Output and Stage Observations:

  • md5sum: The output VCFs and gVCFs from gh1087 have exactly the same md5sum hashes as the head937500229 baseline.
  • Other stages: Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical between baseline and PR runs (differences are within standard ~0.5% – 2.0% run-to-run noise). This is expected since LD_PRELOAD is tightly scoped to just the make_examples wrappers.

Runtime summary:

The results show consistent runtime improvements across the board for the make_examples stage (ranging from 4.7% to 12.3% speedup), saving up to ~22–25 minutes on the large WGS, PacBio, and ONT runs.

Runtime Comparison: head937500229 (baseline) vs gh1087 (PR)

uid stage head937500229 (mean) gh1087 (mean) speedup (sec) speedup (%)
wgs make_examples 11180.41s (3h 6m 20s) 9806.62s (2h 43m 26s) 1373.79s 12.29%
total 17412.56s (4h 50m 12s) 16153.49s (4h 29m 13s) 1259.07s 7.23%
ont-r104 make_examples 13062.19s (3h 37m 42s) 11527.95s (3h 12m 7s) 1534.24s 11.75%
total 25910.83s (7h 11m 50s) 24457.72s (6h 47m 37s) 1453.11s 5.61%
pacbio make_examples 8615.63s (2h 23m 35s) 7671.91s (2h 7m 51s) 943.72s 10.95%
total 15373.77s (4h 16m 13s) 14407.26s (4h 0m 7s) 966.51s 6.29%
hybrid-pacbio-illumina make_examples 15513.45s (4h 18m 33s) 14417.19s (4h 0m 17s) 1096.26s 7.07%
total 39323.79s (10h 55m 23s) 38229.74s (10h 37m 9s) 1094.05s 2.78%
rnaseq make_examples 1506.26s (25m 6s) 1418.62s (23m 38s) 87.64s 5.82%
total 1805.56s (30m 5s) 1718.49s (28m 38s) 87.07s 4.82%
exome make_examples 481.59s (8m 1s) 458.96s (7m 38s) 22.63s 4.70%
total 659.81s (10m 59s) 637.41s (10m 37s) 22.40s 3.39%
Click to view raw `gh1087` runtime table

gh1087 Runtime Table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 458.96 7.741 5 7m 38s
exome HG003 call_variants 151.24 1.731 5 2m 31s
exome HG003 postprocess_variants 27.21 0.177 5 27s
exome HG003 vcf_stats 5.53 0.038 5 5s
exome HG003 total 637.41 9.455 5 10m 37s
hybrid-pacbio-illumina HG003 make_examples 14417.19 39.708 5 4h 17s
hybrid-pacbio-illumina HG003 call_variants 23515.27 29.91 5 6h 31m 55s
hybrid-pacbio-illumina HG003 postprocess_variants 297.29 6.776 5 4m 57s
hybrid-pacbio-illumina HG003 vcf_stats 245.6 2.75 5 4m 5s
hybrid-pacbio-illumina HG003 total 38229.74 47.021 5 10h 37m 9s
ont-r104 HG003 make_examples 11527.95 17.278 5 3h 12m 7s
ont-r104 HG003 call_variants 11905.93 120.625 5 3h 18m 25s
ont-r104 HG003 postprocess_variants 1023.84 8.715 5 17m 3s
ont-r104 HG003 vcf_stats 359.27 1.067 5 5m 59s
ont-r104 HG003 total 24457.72 127.981 5 6h 47m 37s
pacbio HG003 make_examples 7671.91 22.629 5 2h 7m 51s
pacbio HG003 call_variants 6204.24 7.886 5 1h 43m 24s
pacbio HG003 postprocess_variants 531.11 8.841 5 8m 51s
pacbio HG003 vcf_stats 282.95 1.985 5 4m 42s
pacbio HG003 total 14407.26 23.523 5 4h 7s
rnaseq HG005 make_examples 1418.62 9.414 5 23m 38s
rnaseq HG005 call_variants 94.95 0.119 5 1m 34s
rnaseq HG005 postprocess_variants 204.92 0.65 5 3m 24s
rnaseq HG005 vcf_stats 4.94 0.04 5 4s
rnaseq HG005 total 1718.49 9.864 5 28m 38s
wgs HG003 make_examples 9806.62 30.525 5 2h 43m 26s
wgs HG003 call_variants 5935.29 192.789 5 1h 38m 55s
wgs HG003 postprocess_variants 411.57 8.592 5 6m 51s
wgs HG003 vcf_stats 254.71 1.315 5 4m 14s
wgs HG003 total 16153.49 201.882 5 4h 29m 13s
Click to view raw `head937500229` runtime table

head937500229 Runtime Table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 481.59 8.884 5 8m 1s
exome HG003 call_variants 150.94 0.808 5 2m 30s
exome HG003 postprocess_variants 27.28 0.182 5 27s
exome HG003 vcf_stats 5.54 0.058 5 5s
exome HG003 total 659.81 9.358 5 10m 59s
hybrid-pacbio-illumina HG003 make_examples 15513.45 51.654 5 4h 18m 33s
hybrid-pacbio-illumina HG003 call_variants 23520.08 29.808 5 6h 32m 0s
hybrid-pacbio-illumina HG003 postprocess_variants 290.26 4.803 5 4m 50s
hybrid-pacbio-illumina HG003 vcf_stats 243.47 2.018 5 4m 3s
hybrid-pacbio-illumina HG003 total 39323.79 67.138 5 10h 55m 23s
ont-r104 HG003 make_examples 13062.19 156.502 5 3h 37m 42s
ont-r104 HG003 call_variants 11814.9 7.661 5 3h 16m 54s
ont-r104 HG003 postprocess_variants 1033.73 8.6 5 17m 13s
ont-r104 HG003 vcf_stats 361.09 3.496 5 6m 1s
ont-r104 HG003 total 25910.83 158.26 5 7h 11m 50s
pacbio HG003 make_examples 8615.63 155.868 5 2h 23m 35s
pacbio HG003 call_variants 6237.02 32.756 5 1h 43m 57s
pacbio HG003 postprocess_variants 521.13 8.723 5 8m 41s
pacbio HG003 vcf_stats 280.33 1.387 5 4m 40s
pacbio HG003 total 15373.77 182.451 5 4h 16m 13s
rnaseq HG005 make_examples 1506.26 9.424 5 25m 6s
rnaseq HG005 call_variants 94.85 0.066 5 1m 34s
rnaseq HG005 postprocess_variants 204.45 1.424 5 3m 24s
rnaseq HG005 vcf_stats 4.94 0.057 5 4s
rnaseq HG005 total 1805.56 10.267 5 30m 5s
wgs HG003 make_examples 11180.41 111.767 5 3h 6m 20s
wgs HG003 call_variants 5825.96 5.848 5 1h 37m 5s
wgs HG003 postprocess_variants 406.19 1.954 5 6m 46s
wgs HG003 vcf_stats 252.8 1.256 5 4m 12s
wgs HG003 total 17412.56 110.714 5 4h 50m 12s

I have created an internal changelist for the team to review. Since it's the weekend, we'll review and submit internally next week.
Out of curiosity, I'll also try to test this on our regular n2-standard-96 setup to see how the speedups hold up there!

@pichuan

pichuan commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Hi @tfenne,

As a side note (with no action required for this PR), we also have a lesser-known alternative code path: fast-pipeline.

Because it runs as a native C++ binary rather than using python shell wrappers, it probably won't benefit from the LD_PRELOAD wrappers introduced here. It would be nice to enable jemalloc for it as well, but I'm happy to look into that separately later.

(For context, the fast pipeline is a stream-based design we created to run make_examples and call_variants concurrently to reduce I/O overhead. It's still experimental and not yet a fully supported path.)

@tfenne

tfenne commented Jun 28, 2026

Copy link
Copy Markdown
Author

Thanks @pichuan - I'm aware of the fast-pipeline, but didn't want to touch it as it seems like it's probably under much more active development. I'd actually be very interested in the fast-pipeline for CPU, to avoid writing intermediates to disk - if contributions in that area are welcome, and wouldn't trip over in-house work, I'd be interested in helping there too.

@pichuan

pichuan commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

PR 1087 (jemalloc) Runtime Report — n2-standard-96

Setup

  • Machine type: n2-standard-96
  • Trials: 5 per sample
  • Baseline: head938924210 (no jemalloc)
  • PR build: gh1087-head938924210 (jemalloc preloaded on make_examples)

Runtime Comparison: head938924210 (baseline) vs gh1087 (PR)

The make_examples stage shows speedups across all sample types, with the strongest and most statistically significant improvements on the long-running workloads (pacbio, ont-r104, wgs).

uid stage head938924210 (mean) gh1087 (mean) speedup (sec) speedup (%)
pacbio make_examples 2315.63s (38m 35s) 1990.22s (33m 10s) 325.41s 14.05%
total 3927.05s (1h 5m 27s) 3565.25s (59m 25s) 361.80s 9.21%
ont-r104 make_examples 3292.11s (54m 52s) 2895.39s (48m 15s) 396.72s 12.05%
total 6137.74s (1h 42m 17s) 5759.00s (1h 35m 59s) 378.74s 6.17%
wgs make_examples 2655.11s (44m 15s) 2375.42s (39m 35s) 279.69s 10.53%
total 4036.19s (1h 7m 16s) 3763.43s (1h 2m 43s) 272.76s 6.76%
hybrid-pacbio-illumina make_examples 3404.10s (56m 44s) 3278.77s (54m 38s) 125.33s 3.68%
total 7772.25s (2h 9m 32s) 7904.32s (2h 11m 44s) -132.07s -1.70%*
exome make_examples 190.24s (3m 10s) 184.19s (3m 4s) 6.05s 3.18%
total 255.09s (4m 15s) 248.78s (4m 8s) 6.31s 2.47%
rnaseq make_examples 436.98s (7m 16s) 427.93s (7m 7s) 9.05s 2.07%
total 527.70s (8m 47s) 518.68s (8m 38s) 9.02s 1.71%

Note

* The hybrid-pacbio-illumina total shows a slight regression, but this is due to high variance in the call_variants stage (std_dev of 450s for the baseline vs 143s for gh1087). The make_examples stage itself still shows a 3.68% improvement. The call_variants variance is unrelated to jemalloc since it is not preloaded for that stage.

Statistical Significance (Welch's t-test)

With only 5 trials and relatively high variance on n2-standard-96, it's important to check which improvements are statistically significant vs. run-to-run noise.

uid diff (sec) diff (%) t-stat df significant?
pacbio 325.41s 14.05% 6.06 4.7 ✅ YES (p<0.01)
ont-r104 396.72s 12.05% 14.73 7.3 ✅ YES (p<0.01)
wgs 279.69s 10.53% 4.88 8.0 ✅ YES (p<0.01)
hybrid-pacbio-illumina 125.33s 3.68% 1.94 7.9 ⚠️ Marginal (p<0.05)
exome 6.05s 3.18% 1.18 4.2 ❌ Not significant
rnaseq 9.05s 2.07% 0.89 7.8 ❌ Not significant

Important

On n2-standard-96, only 3 of 6 datasets (wgs, ont-r104, pacbio) show clearly significant make_examples speedups. The exome and rnaseq improvements are indistinguishable from noise at this sample size.

However, all 6 datasets show improvement in the same direction — if the effect were pure noise, we'd expect some positive and some negative. The c3d-standard-16 results (which have much tighter variance) confirm significance across all 6 datasets (all p<0.01).

Other Stages (Sanity Check)

Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical — most differences are well under 2%, consistent with run-to-run noise. The one outlier is hybrid-pacbio-illumina call_variants (-6.12%), which is explained by the high baseline std_dev (450s) — the baseline happened to have some faster trials in that stage.

Comparison with c3d-standard-16 results

The make_examples speedup percentages on n2-standard-96 are broadly consistent with the c3d-standard-16 results, though the n2 data has higher variance:

uid c3d-standard-16 speedup c3d significant? n2-standard-96 speedup n2 significant?
pacbio 10.95% ✅ p<0.01 14.05% ✅ p<0.01
ont-r104 11.75% ✅ p<0.01 12.05% ✅ p<0.01
wgs 12.29% ✅ p<0.01 10.53% ✅ p<0.01
hybrid-pacbio-illumina 7.07% ✅ p<0.01 3.68% ⚠️ p<0.05
rnaseq 5.82% ✅ p<0.01 2.07%
exome 4.70% ✅ p<0.01 3.18%

Summary: The benefit is real and clearly significant for long-running workloads (wgs, pacbio, ont — saving 5–7 minutes each). For shorter workloads (exome, rnaseq), the n2 data alone can't confirm a benefit, though the c3d data (with much tighter variance) confirms it across all datasets.

Click to view raw gh1087-head938924210 runtime table

gh1087-head938924210 Runtime Table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 184.19 1.77 5 3m 4s
exome HG003 call_variants 33.54 0.216 5 33s
exome HG003 postprocess_variants 31.06 0.983 5 31s
exome HG003 vcf_stats 6.16 0.072 5 6s
exome HG003 total 248.78 2.193 5 4m 8s
hybrid-pacbio-illumina HG003 make_examples 3278.77 107.917 5 54m 38s
hybrid-pacbio-illumina HG003 call_variants 4386.22 143.326 5 1h 13m 6s
hybrid-pacbio-illumina HG003 postprocess_variants 239.33 8.637 5 3m 59s
hybrid-pacbio-illumina HG003 vcf_stats 310.08 6.04 5 5m 10s
hybrid-pacbio-illumina HG003 total 7904.32 197.081 5 2h 11m 44s
ont-r104 HG003 make_examples 2895.39 48.685 5 48m 15s
ont-r104 HG003 call_variants 1970.55 36.64 5 32m 50s
ont-r104 HG003 postprocess_variants 893.05 12.834 5 14m 53s
ont-r104 HG003 vcf_stats 449.9 17.86 5 7m 29s
ont-r104 HG003 total 5759.0 72.291 5 1h 35m 59s
pacbio HG003 make_examples 1990.22 34.021 5 33m 10s
pacbio HG003 call_variants 1134.72 20.244 5 18m 54s
pacbio HG003 postprocess_variants 440.31 4.567 5 7m 20s
pacbio HG003 vcf_stats 348.82 6.463 5 5m 48s
pacbio HG003 total 3565.25 56.336 5 59m 25s
rnaseq HG005 make_examples 427.93 14.606 5 7m 7s
rnaseq HG005 call_variants 25.73 0.769 5 25s
rnaseq HG005 postprocess_variants 65.01 0.535 5 1m 5s
rnaseq HG005 vcf_stats 5.45 0.094 5 5s
rnaseq HG005 total 518.68 14.794 5 8m 38s
wgs HG003 make_examples 2375.42 89.542 5 39m 35s
wgs HG003 call_variants 982.35 64.884 5 16m 22s
wgs HG003 postprocess_variants 405.67 13.119 5 6m 45s
wgs HG003 vcf_stats 316.51 5.106 5 5m 16s
wgs HG003 total 3763.43 129.075 5 1h 2m 43s
Click to view raw head938924210 runtime table

head938924210 Runtime Table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 190.24 11.357 5 3m 10s
exome HG003 call_variants 34.33 2.1 5 34s
exome HG003 postprocess_variants 30.52 0.091 5 30s
exome HG003 vcf_stats 6.17 0.091 5 6s
exome HG003 total 255.09 13.253 5 4m 15s
hybrid-pacbio-illumina HG003 make_examples 3404.1 96.1 5 56m 44s
hybrid-pacbio-illumina HG003 call_variants 4133.09 450.234 5 1h 8m 53s
hybrid-pacbio-illumina HG003 postprocess_variants 235.06 8.204 5 3m 55s
hybrid-pacbio-illumina HG003 vcf_stats 311.91 9.025 5 5m 11s
hybrid-pacbio-illumina HG003 total 7772.25 532.957 5 2h 9m 32s
ont-r104 HG003 make_examples 3292.11 35.424 5 54m 52s
ont-r104 HG003 call_variants 1966.44 29.279 5 32m 46s
ont-r104 HG003 postprocess_variants 879.19 4.702 5 14m 39s
ont-r104 HG003 vcf_stats 446.33 3.943 5 7m 26s
ont-r104 HG003 total 6137.74 52.714 5 1h 42m 17s
pacbio HG003 make_examples 2315.63 115.204 5 38m 35s
pacbio HG003 call_variants 1168.11 86.518 5 19m 28s
pacbio HG003 postprocess_variants 443.32 8.427 5 7m 23s
pacbio HG003 vcf_stats 349.94 12.866 5 5m 49s
pacbio HG003 total 3927.05 205.759 5 1h 5m 27s
rnaseq HG005 make_examples 436.98 17.427 5 7m 16s
rnaseq HG005 call_variants 25.78 1.231 5 25s
rnaseq HG005 postprocess_variants 64.94 0.935 5 1m 4s
rnaseq HG005 vcf_stats 5.51 0.201 5 5s
rnaseq HG005 total 527.70 18.766 5 8m 47s
wgs HG003 make_examples 2655.11 91.839 5 44m 15s
wgs HG003 call_variants 975.37 82.033 5 16m 15s
wgs HG003 postprocess_variants 405.71 10.186 5 6m 45s
wgs HG003 vcf_stats 320.59 6.27 5 5m 20s
wgs HG003 total 4036.19 178.7 5 1h 7m 16s

@pichuan

pichuan commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Thanks @pichuan - I'm aware of the fast-pipeline, but didn't want to touch it as it seems like it's probably under much more active development. I'd actually be very interested in the fast-pipeline for CPU, to avoid writing intermediates to disk - if contributions in that area are welcome, and wouldn't trip over in-house work, I'd be interested in helping there too.

Hi @tfenne,

Contributions (and usage!) on fast-pipeline are VERY welcome.

For context: We originally built it for a specific deployment scenario. However, after the development and testing, that scenario was no longer relevant.

At this point, I was even considering removing it, since it hasn't been advertised enough to have active users (either internally or externally). If you'd like to both improve AND use the code, that would be great. But no pressure at all :)

-pichuan

@pgrosu

pgrosu commented Jun 28, 2026

Copy link
Copy Markdown

Hi Tim (@tfenne) and Pi-Chuan (@pichuan),

I'm super-excited to see that the Google DeepVariant team partially implemented an initial version of my shared-memory model concept recommendation via the fast-pipeline, and has experienced significant performance gains! The way the current stream-component functionality is implemented, is still only CPU-focused by bundling the stream_examples_kernel.cc and stream_examples_ops.cc into a shared library object deepvariant/examples_from_stream.so $-$ including the core-functionality of stream_examples.cc $-$ without GPU utilization for I/O (this is aside from other steps in call_variants).

The partial implementation refers to the idea that it currently operates through a few bottlenecks, that can be easily overcome. For example:

  • The CPU-based aligner in make_examples can be augmented with a GPU-based one (i.e. CUDASW++, etc), as it consumes a significant amount of processing time during the make_examples step, shown in an earlier post of mine.

  • The implementation depends too much on (mutex) synchronization, which is not required as candidate examples are relatively independent of each other. Shared memory objects are generated given the number of shards, and are controlled through mutexes per object. make_examples streams examples into a shared memory object for that shard, and once one is full, call_variants is called to process that object of memory. There is an asynchronous approach opportunity here, where:

$1)$ Example candidates can be streamed to both memory and disk, where each written stream example-candidates object has its own lock (in the header of the payload) with a dynamic vector/hashtable dictating disk/shared-memory example-candidate object-locations and state.

$2)$ As shared memory becomes too busy, candidates can continuously be streamed from disk, with the ability for call_variants threads to concurrently process from either locations. Lock-free mechanisms and concurrency will get around the current mutex bottlenecks.

$3)$ The implementation should become more example-candidate-focused rather than shard-focused. One should be able to add new shared memory regions on-the-fly as needed, and reshape execution-flow based on a dynamic thread pool (as dictated by sample type) where modular functions from all three steps in DeepVariant pull from, and take advantage of performing stutter-free continuous execution.

  • The architecture would benefit from having more granular equivalent CPU and GPU functions that are utilized within (and build up) the make_examples, call_variants and postprocess_variants pipeline steps. In this way, the execution can be reshaped-on-the-fly during the execution flow, which will make the implementation of DeepVariant much more modular. So CPUs provide lots of memory, and extra free memory through SSD-type disks. While on the other hand, GPUs provide many free processing units. The NVidia Unified Virtual Memory (UVM) approach merges the two, so that the pool of processing units and pool of memory become more unified (even if the host/device mechanism is required, and optimized via pinned memory). If the CPU cores are overwhelmed given a percentage threshold, then the equivalent GPU functions in (make_examples, call_variants and postprocess_variants) can be reshaped during the execution flow, which will make the implementation of DeepVariant much more modular.

I don't have the compute resources to update based on these ideas (and a bit limited on time), but it should be a few minor changes by Tim and the DeepVariant team to implement some of these changes, which should make DeepVariant several orders of magnitude faster.

Hope it helps,
~p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants