Skip to content

fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG#4138

Open
Arshammik wants to merge 1 commit into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch
Open

fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG#4138
Arshammik wants to merge 1 commit into
scverse:mainfrom
Arshammik:fix/hvg-pearson-clip-per-batch

Conversation

@Arshammik
Copy link
Copy Markdown

Hi,

In _highly_variable_pearson_residuals, the clip parameter is reassigned inside the per-batch loop when it comes in as None: if clip is None: clip = np.sqrt(n). After the first iteration, clip is no longer None, so every subsequent batch silently reuses the first batch's threshold instead of computing its own sqrt(n_batch).

This matters because Lause, Berens & Kobak (2021) define the clip as sqrt(N) over the cells entering the residual computation, and with batch_key each batch is its own residual computation. For a 5000/200 split, the small batch gets sqrt(5000) ≈ 70.7 instead of sqrt(200) ≈ 14.1, so residuals that should be clipped in the small batch are not, its per-gene variances are inflated, and HVG ranking shifts. There is also a quieter reproducibility consequence: because np.unique sorts the labels, the threshold that leaks into all batches is whichever one is alphabetically first, so renaming "A"↔"B" on the same data changes the HVG output.

The fix is a one-line scoping change — a loop-local clip_batch that is recomputed each iteration from the current batch's n. The user-facing clip argument is untouched, so any caller that passed an explicit value still gets exactly that value applied to every batch.

Two tests in tests/test_highly_variable_genes.py cover this. The first monkeypatches _calculate_res_sparse to capture the clip value passed on each batch and asserts two distinct values appear for a 5000/200 split; the second swaps batch labels A↔B on identical data and asserts residual_variances is unchanged. Both fail on main and pass on this branch, and the rest of the pearson-residual suite is green.

rapids-singlecell ports this routine verbatim and inherits the same bug; I have a parallel fix queued there that will reference this PR number once assigned.

Thanks,
Arsham

@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.61%. Comparing base (4ba31e4) to head (402b167).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4138   +/-   ##
=======================================
  Coverage   79.61%   79.61%           
=======================================
  Files         120      120           
  Lines       12786    12787    +1     
=======================================
+ Hits        10179    10181    +2     
+ Misses       2607     2606    -1     
Flag Coverage Δ
hatch-test.low-vers 78.86% <100.00%> (+<0.01%) ⬆️
hatch-test.pre 79.47% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...c/scanpy/experimental/pp/_highly_variable_genes.py 94.28% <100.00%> (+0.05%) ⬆️

... and 1 file with indirect coverage changes

@Zethson
Copy link
Copy Markdown
Member

Zethson commented May 25, 2026

Thanks! Could you please ensure that the pre-commit checks pass?

@Arshammik Arshammik force-pushed the fix/hvg-pearson-clip-per-batch branch from 47b118c to c588f69 Compare May 25, 2026 14:18
@Arshammik Arshammik changed the title Fix per-batch clip threshold in pearson_residuals HVG fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG May 25, 2026
…esiduals

`_highly_variable_pearson_residuals` reassigned the function-scoped `clip`
parameter from `None` to `sqrt(n_batch_0)` on the first batch iteration,
which made the `if clip is None:` check False on every subsequent batch.
The first batch's clip was therefore reused for every other batch, even
though Lause-Berens-Kobak (2021) specifies a per-batch threshold of
`sqrt(n_cells_in_batch)`.

For an unbalanced split — e.g. two batches of 5000 and 200 cells — the
small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its
own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be
clipped were not, inflating per-gene residual variance and shifting HVG
ranking. The bug also made the result depend on alphabetical batch-label
ordering (`np.unique` picks the first label), which is a reproducibility
hazard.

The fix introduces a loop-local `clip_batch` variable so the per-batch
default is recomputed each iteration. The user-facing `clip` parameter is
untouched.

Two new tests in `tests/test_highly_variable_genes.py` cover the
regression: one instruments `_calculate_res_sparse` to capture the
per-batch clip value, the other verifies that swapping batch labels A<->B
on identical data produces identical `residual_variances`.
@Arshammik Arshammik force-pushed the fix/hvg-pearson-clip-per-batch branch from c588f69 to 402b167 Compare May 25, 2026 14:43
@Arshammik
Copy link
Copy Markdown
Author

@Zethson confirming that the pre-commit checks have passed.

@Zethson
Copy link
Copy Markdown
Member

Zethson commented May 26, 2026

@jlause do you have any thoughts on this matter?

@Zethson Zethson added this to the 1.12.2 milestone May 26, 2026
@Arshammik Arshammik force-pushed the fix/hvg-pearson-clip-per-batch branch 2 times, most recently from 2754759 to 402b167 Compare May 26, 2026 17:58
Intron7 added a commit to scverse/rapids-singlecell that referenced this pull request May 27, 2026
…arson_residuals` (#674)

* hvg(pearson_residuals): scope per-batch clip threshold

`_highly_variable_pearson_residuals` reassigned the function-scoped
`clip` parameter inside the per-batch loop via `if clip is None: clip =
cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so
every subsequent batch reused the first batch's threshold instead of
computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021.

For a 5000/200 split with `clip=None`, the small batch silently inherited
`sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals
in the small batch that should be clipped were not, inflating their
per-gene variance and shifting HVG ranking. The result also depended on
which batch label was alphabetically first (since `np.unique` sorts), so
renaming "A"↔"B" silently changed the HVG output.

The fix introduces a loop-local `clip_batch` variable so the per-batch
default is recomputed each iteration. The user-facing `clip` parameter
is untouched.

Two new tests in `tests/test_hvg.py`:
- `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the
  underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and
  asserts that two distinct clip values reach the kernel for a 5000/200
  split.
- `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B
  on identical data and asserts that `residual_variances` is unchanged.

Both fail on the unpatched code and pass on the fix; the full 58-test
pearson HVG subset of `tests/test_hvg.py` still passes.

scanpy ports the same routine and has the same bug — see scverse/scanpy#4138.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests/hvg: slim per review

Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the
order-invariant test already covers the observable behavior), the
file:line reference docstring on the surviving test, and the inline
comment above the clip_batch assignment.

* docs: add release note for #674

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Severin Dicks <37635888+Intron7@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants