Skip to content

Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674

Merged
Intron7 merged 7 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch
May 27, 2026
Merged

Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674
Intron7 merged 7 commits into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch

Conversation

@Arshammik
Copy link
Copy Markdown
Contributor

In _highly_variable_pearson_residuals, the default clip threshold is computed inside the per-batch loop guarded by if clip is None. Because the assignment writes back to the clip parameter itself, the guard is only true on the first iteration; every subsequent batch reuses the first batch's threshold instead of computing its own.

Lause, Berens & Kobak (2021) define the clip as sqrt(N) on the cell count entering the residual computation, so with batch_key the correct value is sqrt(n_cells_in_batch). On a 5000/200 split, the 200-cell batch silently inherits sqrt(5000) ≈ 70.7 instead of sqrt(200) ≈ 14.1. Residuals in the small batch that should be clipped are not, its per-gene variance is inflated, and HVG ranking shifts. The output also depends on np.unique's alphabetical ordering of batch labels, so renaming "A" to "B" (and vice versa) on otherwise identical data changes the HVG list — not the kind of reproducibility surprise one wants buried in a preprocessing default.

The fix introduces a loop-local clip_batch so the per-batch default is recomputed each iteration; the user-facing clip parameter is left alone. Two tests cover it. test_pearson_residuals_batch_clip_scoped_per_batch monkeypatches the csc_hvg_res / dense_hvg_res bindings and asserts that two distinct clip values reach the kernel across batches — on the buggy code both calls receive the same value. test_pearson_residuals_batch_order_invariant swaps AB on identical data and checks that residual_variances matches bit-for-bit. Both fail on the unpatched code; the rest of the pearson + HVG suite (58 tests) still passes. Verified on an H100.

scanpy ports this routine almost verbatim and carries the same bug — I have a parallel PR open at scverse/scanpy#4138 with the same one-line scoping fix and equivalent CPU-side tests. Happy to land them together, or to hold this one until the scanpy side merges, whichever you prefer.

Arshammik and others added 3 commits May 23, 2026 13:49
`_highly_variable_pearson_residuals` reassigned the function-scoped
`clip` parameter inside the per-batch loop via `if clip is None: clip =
cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so
every subsequent batch reused the first batch's threshold instead of
computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021.

For a 5000/200 split with `clip=None`, the small batch silently inherited
`sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals
in the small batch that should be clipped were not, inflating their
per-gene variance and shifting HVG ranking. The result also depended on
which batch label was alphabetically first (since `np.unique` sorts), so
renaming "A"↔"B" silently changed the HVG output.

The fix introduces a loop-local `clip_batch` variable so the per-batch
default is recomputed each iteration. The user-facing `clip` parameter
is untouched.

Two new tests in `tests/test_hvg.py`:
- `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the
  underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and
  asserts that two distinct clip values reach the kernel for a 5000/200
  split.
- `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B
  on identical data and asserts that `residual_variances` is unchanged.

Both fail on the unpatched code and pass on the fix; the full 58-test
pearson HVG subset of `tests/test_hvg.py` still passes.

scanpy ports the same routine and has the same bug — see scverse/scanpy#4138.
@Intron7
Copy link
Copy Markdown
Member

Intron7 commented May 26, 2026

Thanks for catching it.
The PR is much larger than the fix needs to be. Please slim it down:

  • Keep the clip_batch change (3 lines).
  • Keep test_pearson_residuals_batch_order_invariant — that's the right test, it covers the observable behavior.
  • Drop test_pearson_residuals_batch_clip_scoped_per_batch.
  • Drop the multi-line comment and the file:line references in the test docstring.

Will wait on scverse/scanpy#4138 and land them together.

Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the
order-invariant test already covers the observable behavior), the
file:line reference docstring on the surviving test, and the inline
comment above the clip_batch assignment.
@Intron7
Copy link
Copy Markdown
Member

Intron7 commented May 26, 2026

Also we need a release note

@Intron7
Copy link
Copy Markdown
Member

Intron7 commented May 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added support for multiple pseudobulk-based distance metrics in the Distance tool.
  • Bug Fixes

    • Fixed per-batch clipping threshold behavior in highly variable genes computation with batch information, where each batch now independently calculates its threshold based on cell count.

Walkthrough

This PR fixes per-batch clip threshold behavior in the Pearson residuals flavor of highly_variable_genes. Each batch now independently computes its clip value from sqrt(n_cells_in_batch) when clip is not specified, rather than reusing a global value. The fix includes a validation test and corresponding release notes.

Changes

Pearson Residuals Per-Batch Clipping

Layer / File(s) Summary
Batch-aware clipping threshold computation and test
src/rapids_singlecell/preprocessing/_hvg/_pearson_residuals.py, tests/test_hvg.py
Batch-specific clipping value is computed as sqrt(n_cells) when clip is None, validated for non-negativity, and converted to clip_val for CUDA kernel arguments. New test test_pearson_residuals_batch_order_invariant verifies residual_variances are invariant to batch label ordering by comparing results from two AnnData objects with swapped batch assignments.
Release notes documentation
docs/release-notes/0.15.2.md
Release notes for version 0.15.2 document the per-batch clipping threshold fix in pp.highly_variable_genes(flavor="pearson_residuals", batch_key=...).

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: a per-batch clip threshold bug in the Pearson residuals HVG function where the threshold leaks across batches.
Description check ✅ Passed The description is detailed and directly related to the changeset, explaining the bug mechanism, impact, and how the fix addresses it with supporting test coverage.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_hvg.py (1)

400-434: 🏗️ Heavy lift

Add CPU-reference parity and edge-case variants for this new test path.

The new test checks invariance on one generated shape only. Please add numerical parity against a CPU reference implementation and cover edge cases (single row, empty input if supported, and max-size split) for this scenario.

As per coding guidelines, "tests/**/*.py: Test validation: include numerical correctness checks against CPU reference implementations, and cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_hvg.py` around lines 400 - 434, Extend
test_pearson_residuals_batch_order_invariant to include numerical parity against
a CPU reference implementation and add edge-case variants: (1) single-row input,
(2) empty input if the API supports it, and (3) a max-size split (e.g., one
extremely large batch and one tiny batch). For parity, compute
residual_variances using a pure-CPU reference path (dense numpy or scipy sparse
operations replicating rsc.pp.highly_variable_genes with
flavor="pearson_residuals") and assert values match the AnnData results (use
np.testing.assert_allclose with a tight atol); reuse the same call signature
(flavor="pearson_residuals", batch_key="batch", n_top_genes as appropriate,
check_values=False) on the AnnData instances (a1/a2 and the edge-case instances)
and compare their .var["residual_variances"].to_numpy() to the CPU reference
outputs. Ensure each variant constructs batches by label permutations as in the
original test to validate order invariance.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_hvg.py`:
- Around line 400-434: Extend test_pearson_residuals_batch_order_invariant to
include numerical parity against a CPU reference implementation and add
edge-case variants: (1) single-row input, (2) empty input if the API supports
it, and (3) a max-size split (e.g., one extremely large batch and one tiny
batch). For parity, compute residual_variances using a pure-CPU reference path
(dense numpy or scipy sparse operations replicating rsc.pp.highly_variable_genes
with flavor="pearson_residuals") and assert values match the AnnData results
(use np.testing.assert_allclose with a tight atol); reuse the same call
signature (flavor="pearson_residuals", batch_key="batch", n_top_genes as
appropriate, check_values=False) on the AnnData instances (a1/a2 and the
edge-case instances) and compare their .var["residual_variances"].to_numpy() to
the CPU reference outputs. Ensure each variant constructs batches by label
permutations as in the original test to validate order invariance.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0c173409-b092-4941-8d03-dffa728c8530

📥 Commits

Reviewing files that changed from the base of the PR and between 32c424a and 3104452.

📒 Files selected for processing (3)
  • docs/release-notes/0.15.2.md
  • src/rapids_singlecell/preprocessing/_hvg/_pearson_residuals.py
  • tests/test_hvg.py

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.83%. Comparing base (32c424a) to head (3104452).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #674   +/-   ##
=======================================
  Coverage   88.83%   88.83%           
=======================================
  Files         100      100           
  Lines        7615     7616    +1     
=======================================
+ Hits         6765     6766    +1     
  Misses        850      850           
Files with missing lines Coverage Δ
...inglecell/preprocessing/_hvg/_pearson_residuals.py 94.31% <100.00%> (+0.06%) ⬆️

@Intron7 Intron7 merged commit 55b331f into scverse:main May 27, 2026
17 of 18 checks passed
@Arshammik Arshammik deleted the fix/hvg-pearson-clip-per-batch branch May 27, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants