Per-batch clip threshold leaks across batches in `_highly_variable_pearson_residuals` by Arshammik · Pull Request #674 · scverse/rapids-singlecell

Arshammik · 2026-05-23T18:47:02Z

In _highly_variable_pearson_residuals, the default clip threshold is computed inside the per-batch loop guarded by if clip is None. Because the assignment writes back to the clip parameter itself, the guard is only true on the first iteration; every subsequent batch reuses the first batch's threshold instead of computing its own.

Lause, Berens & Kobak (2021) define the clip as sqrt(N) on the cell count entering the residual computation, so with batch_key the correct value is sqrt(n_cells_in_batch). On a 5000/200 split, the 200-cell batch silently inherits sqrt(5000) ≈ 70.7 instead of sqrt(200) ≈ 14.1. Residuals in the small batch that should be clipped are not, its per-gene variance is inflated, and HVG ranking shifts. The output also depends on np.unique's alphabetical ordering of batch labels, so renaming "A" to "B" (and vice versa) on otherwise identical data changes the HVG list — not the kind of reproducibility surprise one wants buried in a preprocessing default.

The fix introduces a loop-local clip_batch so the per-batch default is recomputed each iteration; the user-facing clip parameter is left alone. Two tests cover it. test_pearson_residuals_batch_clip_scoped_per_batch monkeypatches the csc_hvg_res / dense_hvg_res bindings and asserts that two distinct clip values reach the kernel across batches — on the buggy code both calls receive the same value. test_pearson_residuals_batch_order_invariant swaps A↔B on identical data and checks that residual_variances matches bit-for-bit. Both fail on the unpatched code; the rest of the pearson + HVG suite (58 tests) still passes. Verified on an H100.

scanpy ports this routine almost verbatim and carries the same bug — I have a parallel PR open at scverse/scanpy#4138 with the same one-line scoping fix and equivalent CPU-side tests. Happy to land them together, or to hold this one until the scanpy side merges, whichever you prefer.

`_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter inside the per-batch loop via `if clip is None: clip = cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so every subsequent batch reused the first batch's threshold instead of computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021. For a 5000/200 split with `clip=None`, the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating their per-gene variance and shifting HVG ranking. The result also depended on which batch label was alphabetically first (since `np.unique` sorts), so renaming "A"↔"B" silently changed the HVG output. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_hvg.py`: - `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and asserts that two distinct clip values reach the kernel for a 5000/200 split. - `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B on identical data and asserts that `residual_variances` is unchanged. Both fail on the unpatched code and pass on the fix; the full 58-test pearson HVG subset of `tests/test_hvg.py` still passes. scanpy ports the same routine and has the same bug — see scverse/scanpy#4138.

for more information, see https://pre-commit.ci

Intron7 · 2026-05-26T12:05:36Z

Thanks for catching it.
The PR is much larger than the fix needs to be. Please slim it down:

Keep the clip_batch change (3 lines).
Keep test_pearson_residuals_batch_order_invariant — that's the right test, it covers the observable behavior.
Drop test_pearson_residuals_batch_clip_scoped_per_batch.
Drop the multi-line comment and the file:line references in the test docstring.

Will wait on scverse/scanpy#4138 and land them together.

Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the order-invariant test already covers the observable behavior), the file:line reference docstring on the surviving test, and the inline comment above the clip_batch assignment.

Intron7 · 2026-05-26T18:49:52Z

Also we need a release note

Intron7 · 2026-05-27T15:53:00Z

@coderabbitai review

coderabbitai · 2026-05-27T15:53:05Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-27T15:57:36Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added support for multiple pseudobulk-based distance metrics in the Distance tool.
Bug Fixes
- Fixed per-batch clipping threshold behavior in highly variable genes computation with batch information, where each batch now independently calculates its threshold based on cell count.

Walkthrough

This PR fixes per-batch clip threshold behavior in the Pearson residuals flavor of highly_variable_genes. Each batch now independently computes its clip value from sqrt(n_cells_in_batch) when clip is not specified, rather than reusing a global value. The fix includes a validation test and corresponding release notes.

Changes

Pearson Residuals Per-Batch Clipping

Layer / File(s)	Summary
Batch-aware clipping threshold computation and test `src/rapids_singlecell/preprocessing/_hvg/_pearson_residuals.py`, `tests/test_hvg.py`	Batch-specific clipping value is computed as `sqrt(n_cells)` when `clip` is `None`, validated for non-negativity, and converted to `clip_val` for CUDA kernel arguments. New test `test_pearson_residuals_batch_order_invariant` verifies `residual_variances` are invariant to batch label ordering by comparing results from two `AnnData` objects with swapped batch assignments.
Release notes documentation `docs/release-notes/0.15.2.md`	Release notes for version 0.15.2 document the per-batch clipping threshold fix in `pp.highly_variable_genes(flavor="pearson_residuals", batch_key=...)`.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: a per-batch clip threshold bug in the Pearson residuals HVG function where the threshold leaks across batches.
Description check	✅ Passed	The description is detailed and directly related to the changeset, explaining the bug mechanism, impact, and how the fix addresses it with supporting test coverage.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/test_hvg.py (1)
400-434: 🏗️ Heavy lift

Add CPU-reference parity and edge-case variants for this new test path.

The new test checks invariance on one generated shape only. Please add numerical parity against a CPU reference implementation and cover edge cases (single row, empty input if supported, and max-size split) for this scenario.

As per coding guidelines, "tests/**/*.py: Test validation: include numerical correctness checks against CPU reference implementations, and cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_hvg.py` around lines 400 - 434, Extend
test_pearson_residuals_batch_order_invariant to include numerical parity against
a CPU reference implementation and add edge-case variants: (1) single-row input,
(2) empty input if the API supports it, and (3) a max-size split (e.g., one
extremely large batch and one tiny batch). For parity, compute
residual_variances using a pure-CPU reference path (dense numpy or scipy sparse
operations replicating rsc.pp.highly_variable_genes with
flavor="pearson_residuals") and assert values match the AnnData results (use
np.testing.assert_allclose with a tight atol); reuse the same call signature
(flavor="pearson_residuals", batch_key="batch", n_top_genes as appropriate,
check_values=False) on the AnnData instances (a1/a2 and the edge-case instances)
and compare their .var["residual_variances"].to_numpy() to the CPU reference
outputs. Ensure each variant constructs batches by label permutations as in the
original test to validate order invariance.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_hvg.py`:
- Around line 400-434: Extend test_pearson_residuals_batch_order_invariant to
include numerical parity against a CPU reference implementation and add
edge-case variants: (1) single-row input, (2) empty input if the API supports
it, and (3) a max-size split (e.g., one extremely large batch and one tiny
batch). For parity, compute residual_variances using a pure-CPU reference path
(dense numpy or scipy sparse operations replicating rsc.pp.highly_variable_genes
with flavor="pearson_residuals") and assert values match the AnnData results
(use np.testing.assert_allclose with a tight atol); reuse the same call
signature (flavor="pearson_residuals", batch_key="batch", n_top_genes as
appropriate, check_values=False) on the AnnData instances (a1/a2 and the
edge-case instances) and compare their .var["residual_variances"].to_numpy() to
the CPU reference outputs. Ensure each variant constructs batches by label
permutations as in the original test to validate order invariance.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0c173409-b092-4941-8d03-dffa728c8530

📥 Commits

Reviewing files that changed from the base of the PR and between 32c424a and 3104452.

📒 Files selected for processing (3)

docs/release-notes/0.15.2.md
src/rapids_singlecell/preprocessing/_hvg/_pearson_residuals.py
tests/test_hvg.py

codecov-commenter · 2026-05-27T16:18:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.83%. Comparing base (32c424a) to head (3104452).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #674   +/-   ##
=======================================
  Coverage   88.83%   88.83%           
=======================================
  Files         100      100           
  Lines        7615     7616    +1     
=======================================
+ Hits         6765     6766    +1     
  Misses        850      850

Files with missing lines	Coverage Δ
...inglecell/preprocessing/_hvg/_pearson_residuals.py	`94.31% <100.00%> (+0.06%)`	⬆️

Arshammik and others added 3 commits May 23, 2026 13:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

36a1ddd

for more information, see https://pre-commit.ci

Merge branch 'main' into fix/hvg-pearson-clip-per-batch

11bf679

tests/hvg: slim per review

f841432

Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the order-invariant test already covers the observable behavior), the file:line reference docstring on the surviving test, and the inline comment above the clip_batch assignment.

Intron7 and others added 2 commits May 26, 2026 20:49

Merge branch 'main' into fix/hvg-pearson-clip-per-batch

3d10a23

docs: add release note for scverse#674

5b8d537

Intron7 added the run-gpu-ci label May 27, 2026

Merge branch 'main' into fix/hvg-pearson-clip-per-batch

3104452

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Intron7 merged commit 55b331f into scverse:main May 27, 2026
17 of 18 checks passed

Arshammik deleted the fix/hvg-pearson-clip-per-batch branch May 27, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-batch clip threshold leaks across batches in `_highly_variable_pearson_residuals`#674

Per-batch clip threshold leaks across batches in `_highly_variable_pearson_residuals`#674
Intron7 merged 7 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch

Arshammik commented May 23, 2026

Uh oh!

Intron7 commented May 26, 2026

Uh oh!

Intron7 commented May 26, 2026

Uh oh!

Intron7 commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov-commenter commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Arshammik commented May 23, 2026

Uh oh!

Intron7 commented May 26, 2026

Uh oh!

Intron7 commented May 26, 2026

Uh oh!

Intron7 commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented May 27, 2026 •

edited

Loading