fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG by Arshammik · Pull Request #4138 · scverse/scanpy

Arshammik · 2026-05-23T17:45:51Z

Hi,

In _highly_variable_pearson_residuals, the clip parameter is reassigned inside the per-batch loop when it comes in as None: if clip is None: clip = np.sqrt(n). After the first iteration, clip is no longer None, so every subsequent batch silently reuses the first batch's threshold instead of computing its own sqrt(n_batch).

This matters because Lause, Berens & Kobak (2021) define the clip as sqrt(N) over the cells entering the residual computation, and with batch_key each batch is its own residual computation. For a 5000/200 split, the small batch gets sqrt(5000) ≈ 70.7 instead of sqrt(200) ≈ 14.1, so residuals that should be clipped in the small batch are not, its per-gene variances are inflated, and HVG ranking shifts. There is also a quieter reproducibility consequence: because np.unique sorts the labels, the threshold that leaks into all batches is whichever one is alphabetically first, so renaming "A"↔"B" on the same data changes the HVG output.

The fix is a one-line scoping change — a loop-local clip_batch that is recomputed each iteration from the current batch's n. The user-facing clip argument is untouched, so any caller that passed an explicit value still gets exactly that value applied to every batch.

Two tests in tests/test_highly_variable_genes.py cover this. The first monkeypatches _calculate_res_sparse to capture the clip value passed on each batch and asserts two distinct values appear for a 5000/200 split; the second swaps batch labels A↔B on identical data and asserts residual_variances is unchanged. Both fail on main and pass on this branch, and the rest of the pearson-residual suite is green.

rapids-singlecell ports this routine verbatim and inherits the same bug; I have a parallel fix queued there that will reference this PR number once assigned.

Thanks,
Arsham

codecov · 2026-05-23T17:47:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.61%. Comparing base (4ba31e4) to head (402b167).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4138   +/-   ##
=======================================
  Coverage   79.61%   79.61%           
=======================================
  Files         120      120           
  Lines       12786    12787    +1     
=======================================
+ Hits        10179    10181    +2     
+ Misses       2607     2606    -1

Flag	Coverage Δ
hatch-test.low-vers	`78.86% <100.00%> (+<0.01%)`	⬆️
hatch-test.pre	`79.47% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...c/scanpy/experimental/pp/_highly_variable_genes.py	`94.28% <100.00%> (+0.05%)`	⬆️

... and 1 file with indirect coverage changes

Zethson · 2026-05-25T06:46:21Z

Thanks! Could you please ensure that the pre-commit checks pass?

…esiduals `_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter from `None` to `sqrt(n_batch_0)` on the first batch iteration, which made the `if clip is None:` check False on every subsequent batch. The first batch's clip was therefore reused for every other batch, even though Lause-Berens-Kobak (2021) specifies a per-batch threshold of `sqrt(n_cells_in_batch)`. For an unbalanced split — e.g. two batches of 5000 and 200 cells — the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating per-gene residual variance and shifting HVG ranking. The bug also made the result depend on alphabetical batch-label ordering (`np.unique` picks the first label), which is a reproducibility hazard. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_highly_variable_genes.py` cover the regression: one instruments `_calculate_res_sparse` to capture the per-batch clip value, the other verifies that swapping batch labels A<->B on identical data produces identical `residual_variances`.

Arshammik · 2026-05-25T21:05:26Z

@Zethson confirming that the pre-commit checks have passed.

Zethson · 2026-05-26T12:32:53Z

@jlause do you have any thoughts on this matter?

…arson_residuals` (#674) * hvg(pearson_residuals): scope per-batch clip threshold `_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter inside the per-batch loop via `if clip is None: clip = cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so every subsequent batch reused the first batch's threshold instead of computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021. For a 5000/200 split with `clip=None`, the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating their per-gene variance and shifting HVG ranking. The result also depended on which batch label was alphabetically first (since `np.unique` sorts), so renaming "A"↔"B" silently changed the HVG output. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_hvg.py`: - `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and asserts that two distinct clip values reach the kernel for a 5000/200 split. - `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B on identical data and asserts that `residual_variances` is unchanged. Both fail on the unpatched code and pass on the fix; the full 58-test pearson HVG subset of `tests/test_hvg.py` still passes. scanpy ports the same routine and has the same bug — see scverse/scanpy#4138. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/hvg: slim per review Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the order-invariant test already covers the observable behavior), the file:line reference docstring on the surviving test, and the inline comment above the clip_batch assignment. * docs: add release note for #674 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Severin Dicks <37635888+Intron7@users.noreply.github.com>

Arshammik mentioned this pull request May 23, 2026

Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals scverse/rapids-singlecell#674

Merged

Arshammik force-pushed the fix/hvg-pearson-clip-per-batch branch from 47b118c to c588f69 Compare May 25, 2026 14:18

Arshammik changed the title ~~Fix per-batch clip threshold in pearson_residuals HVG~~ fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG May 25, 2026

Arshammik force-pushed the fix/hvg-pearson-clip-per-batch branch from c588f69 to 402b167 Compare May 25, 2026 14:43

Zethson added this to the 1.12.2 milestone May 26, 2026

Zethson added the Area – Preprocessing 🧼 label May 26, 2026

Arshammik force-pushed the fix/hvg-pearson-clip-per-batch branch 2 times, most recently from 2754759 to 402b167 Compare May 26, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG#4138

fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG#4138
Arshammik wants to merge 1 commit into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch

Arshammik commented May 23, 2026

Uh oh!

codecov Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Zethson commented May 25, 2026

Uh oh!

Arshammik commented May 25, 2026

Uh oh!

Zethson commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Arshammik commented May 23, 2026

Uh oh!

codecov Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Zethson commented May 25, 2026

Uh oh!

Arshammik commented May 25, 2026

Uh oh!

Zethson commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 23, 2026 •

edited

Loading