Add fast scib-metrics benchmarking options for large datasets#295
Merged
Conversation
|
Expose subsampling, metric profiles, and optional FAISS neighbors so integration benchmarking stays practical on large objects without sacrificing reproducibility metadata.
Use pandas groupby instead of a null-byte separator in the Nextflow template, remove invalid null schema default for scib_max_cells, and rebuild the Wave container for faiss-cpu.
Record new *_benchmark_info.json paths in stable_path snapshots after SCIBMETRICS_BENCHMARK started publishing benchmark metadata.
9a89349 to
9e9f655
Compare
Thread scib tuning params through the workflow call chain so SCIBMETRICS_BENCHMARK receives them explicitly rather than via modules.config.
Drop scib_neighbor_backend and faiss-cpu since the Apptainer image cannot import faiss reliably; scib-metrics uses its default neighbor search instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scib_max_cells,scib_subsample_strategy,scib_subsample_seed) and metric profiles (scib_metric_profile:fast/full).SCIBMETRICS_BENCHMARK(not viamodules.configext fields).n_jobs=task.cpus) inSCIBMETRICS_BENCHMARK.{method}_benchmark_info.jsonand document subsampling/profile settings in MultiQC table descriptions.Defaults
scib_metric_profile=fastfor routine pipeline runs (disables isolated labels, k-means NMI/ARI, and PCR comparison; skips count-matrix rebuild).--scib_max_cellsis set (recommended ~50k–100k for very large objects).--scib_metric_profile fullfor publication-style benchmarking comparable to default scib-metrics.Test plan
nf-test test modules/local/scibmetrics/benchmark/tests/main.nf.test-profile test --scib trueand verifycombine/integrate/scib_metrics/*_benchmark_info.jsonand MultiQC scib-metrics table--scib_metric_profile fastandfull, with and without--scib_max_cells 100000