fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG#4138
Open
Arshammik wants to merge 1 commit into
Open
fix(experimental.pp.hvg): scope per-batch clip threshold in pearson_residuals HVG#4138Arshammik wants to merge 1 commit into
Arshammik wants to merge 1 commit into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4138 +/- ##
=======================================
Coverage 79.61% 79.61%
=======================================
Files 120 120
Lines 12786 12787 +1
=======================================
+ Hits 10179 10181 +2
+ Misses 2607 2606 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Member
|
Thanks! Could you please ensure that the pre-commit checks pass? |
47b118c to
c588f69
Compare
…esiduals `_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter from `None` to `sqrt(n_batch_0)` on the first batch iteration, which made the `if clip is None:` check False on every subsequent batch. The first batch's clip was therefore reused for every other batch, even though Lause-Berens-Kobak (2021) specifies a per-batch threshold of `sqrt(n_cells_in_batch)`. For an unbalanced split — e.g. two batches of 5000 and 200 cells — the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating per-gene residual variance and shifting HVG ranking. The bug also made the result depend on alphabetical batch-label ordering (`np.unique` picks the first label), which is a reproducibility hazard. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_highly_variable_genes.py` cover the regression: one instruments `_calculate_res_sparse` to capture the per-batch clip value, the other verifies that swapping batch labels A<->B on identical data produces identical `residual_variances`.
c588f69 to
402b167
Compare
Author
|
@Zethson confirming that the pre-commit checks have passed. |
Member
|
@jlause do you have any thoughts on this matter? |
2754759 to
402b167
Compare
Intron7
added a commit
to scverse/rapids-singlecell
that referenced
this pull request
May 27, 2026
…arson_residuals` (#674) * hvg(pearson_residuals): scope per-batch clip threshold `_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter inside the per-batch loop via `if clip is None: clip = cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so every subsequent batch reused the first batch's threshold instead of computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021. For a 5000/200 split with `clip=None`, the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating their per-gene variance and shifting HVG ranking. The result also depended on which batch label was alphabetically first (since `np.unique` sorts), so renaming "A"↔"B" silently changed the HVG output. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_hvg.py`: - `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and asserts that two distinct clip values reach the kernel for a 5000/200 split. - `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B on identical data and asserts that `residual_variances` is unchanged. Both fail on the unpatched code and pass on the fix; the full 58-test pearson HVG subset of `tests/test_hvg.py` still passes. scanpy ports the same routine and has the same bug — see scverse/scanpy#4138. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/hvg: slim per review Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the order-invariant test already covers the observable behavior), the file:line reference docstring on the surviving test, and the inline comment above the clip_batch assignment. * docs: add release note for #674 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Severin Dicks <37635888+Intron7@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
In
_highly_variable_pearson_residuals, theclipparameter is reassigned inside the per-batch loop when it comes in asNone:if clip is None: clip = np.sqrt(n). After the first iteration,clipis no longerNone, so every subsequent batch silently reuses the first batch's threshold instead of computing its ownsqrt(n_batch).This matters because Lause, Berens & Kobak (2021) define the clip as
sqrt(N)over the cells entering the residual computation, and withbatch_keyeach batch is its own residual computation. For a 5000/200 split, the small batch getssqrt(5000) ≈ 70.7instead ofsqrt(200) ≈ 14.1, so residuals that should be clipped in the small batch are not, its per-gene variances are inflated, and HVG ranking shifts. There is also a quieter reproducibility consequence: becausenp.uniquesorts the labels, the threshold that leaks into all batches is whichever one is alphabetically first, so renaming "A"↔"B" on the same data changes the HVG output.The fix is a one-line scoping change — a loop-local
clip_batchthat is recomputed each iteration from the current batch'sn. The user-facingclipargument is untouched, so any caller that passed an explicit value still gets exactly that value applied to every batch.Two tests in
tests/test_highly_variable_genes.pycover this. The first monkeypatches_calculate_res_sparseto capture theclipvalue passed on each batch and asserts two distinct values appear for a 5000/200 split; the second swaps batch labels A↔B on identical data and assertsresidual_variancesis unchanged. Both fail onmainand pass on this branch, and the rest of the pearson-residual suite is green.rapids-singlecell ports this routine verbatim and inherits the same bug; I have a parallel fix queued there that will reference this PR number once assigned.
Thanks,
Arsham