Skip to content

Add fast scib-metrics benchmarking options for large datasets#295

Merged
nictru merged 6 commits into
devfrom
feat/scib-fast-benchmark
Jun 25, 2026
Merged

Add fast scib-metrics benchmarking options for large datasets#295
nictru merged 6 commits into
devfrom
feat/scib-fast-benchmark

Conversation

@nictru

@nictru nictru commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add pipeline parameters for optional stratified subsampling (scib_max_cells, scib_subsample_strategy, scib_subsample_seed) and metric profiles (scib_metric_profile: fast/full).
  • Pass scib tuning options as explicit process inputs to SCIBMETRICS_BENCHMARK (not via modules.config ext fields).
  • Implement subsampling, fast metric selection, and parallel neighbor search (n_jobs=task.cpus) in SCIBMETRICS_BENCHMARK.
  • Publish {method}_benchmark_info.json and document subsampling/profile settings in MultiQC table descriptions.

Defaults

  • scib_metric_profile=fast for routine pipeline runs (disables isolated labels, k-means NMI/ARI, and PCR comparison; skips count-matrix rebuild).
  • No subsampling unless --scib_max_cells is set (recommended ~50k–100k for very large objects).
  • Use --scib_metric_profile full for publication-style benchmarking comparable to default scib-metrics.

Test plan

  • Run module test: nf-test test modules/local/scibmetrics/benchmark/tests/main.nf.test
  • Run pipeline with -profile test --scib true and verify combine/integrate/scib_metrics/*_benchmark_info.json and MultiQC scib-metrics table
  • On a large integrated object, compare runtime and method ranking between --scib_metric_profile fast and full, with and without --scib_max_cells 100000

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit e0bd9de

+| ✅ 298 tests passed       |+
#| ❔   1 tests had warnings |#
!| ❗  15 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: conf/igenomes.config
  • files_exist - File not found: conf/igenomes_ignored.config
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in CONTRIBUTING.md: Add any pipeline specific contribution guidelines here, such as coding styles, procedures, checklists etc.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in nextflow.config: Specify any additional parameters here
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests fixed:

✅ Tests passed:

Run details

  • nf-core/tools version 4.0.2
  • Run at 2026-06-25 05:19:26

nictru added 3 commits June 24, 2026 12:00
Expose subsampling, metric profiles, and optional FAISS neighbors so integration benchmarking stays practical on large objects without sacrificing reproducibility metadata.
Use pandas groupby instead of a null-byte separator in the Nextflow template, remove invalid null schema default for scib_max_cells, and rebuild the Wave container for faiss-cpu.
Record new *_benchmark_info.json paths in stable_path snapshots after SCIBMETRICS_BENCHMARK started publishing benchmark metadata.
@nictru nictru force-pushed the feat/scib-fast-benchmark branch from 9a89349 to 9e9f655 Compare June 24, 2026 10:01
nictru added 3 commits June 24, 2026 12:48
Thread scib tuning params through the workflow call chain so SCIBMETRICS_BENCHMARK receives them explicitly rather than via modules.config.
Drop scib_neighbor_backend and faiss-cpu since the Apptainer image cannot import faiss reliably; scib-metrics uses its default neighbor search instead.
@nictru nictru merged commit dfc1482 into dev Jun 25, 2026
56 checks passed
@nictru nictru deleted the feat/scib-fast-benchmark branch June 25, 2026 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant