Per-batch clip threshold leaks across batches in _highly_variable_pearson_residuals#674
Conversation
`_highly_variable_pearson_residuals` reassigned the function-scoped `clip` parameter inside the per-batch loop via `if clip is None: clip = cp.sqrt(n, ...)`. After the first batch, `clip is None` was False, so every subsequent batch reused the first batch's threshold instead of computing its own `sqrt(n_cells_in_batch)` per Lause-Berens-Kobak 2021. For a 5000/200 split with `clip=None`, the small batch silently inherited `sqrt(5000) ≈ 70.7` instead of using its own `sqrt(200) ≈ 14.1`. Residuals in the small batch that should be clipped were not, inflating their per-gene variance and shifting HVG ranking. The result also depended on which batch label was alphabetically first (since `np.unique` sorts), so renaming "A"↔"B" silently changed the HVG output. The fix introduces a loop-local `clip_batch` variable so the per-batch default is recomputed each iteration. The user-facing `clip` parameter is untouched. Two new tests in `tests/test_hvg.py`: - `test_pearson_residuals_batch_clip_scoped_per_batch` monkeypatches the underlying `_pr_cuda.csc_hvg_res` / `dense_hvg_res` bindings and asserts that two distinct clip values reach the kernel for a 5000/200 split. - `test_pearson_residuals_batch_order_invariant` swaps batch labels A↔B on identical data and asserts that `residual_variances` is unchanged. Both fail on the unpatched code and pass on the fix; the full 58-test pearson HVG subset of `tests/test_hvg.py` still passes. scanpy ports the same routine and has the same bug — see scverse/scanpy#4138.
for more information, see https://pre-commit.ci
|
Thanks for catching it.
Will wait on scverse/scanpy#4138 and land them together. |
Drop the whitebox test_pearson_residuals_batch_clip_scoped_per_batch (the order-invariant test already covers the observable behavior), the file:line reference docstring on the surviving test, and the inline comment above the clip_batch assignment.
|
Also we need a release note |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR fixes per-batch clip threshold behavior in the Pearson residuals flavor of ChangesPearson Residuals Per-Batch Clipping
🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_hvg.py (1)
400-434: 🏗️ Heavy liftAdd CPU-reference parity and edge-case variants for this new test path.
The new test checks invariance on one generated shape only. Please add numerical parity against a CPU reference implementation and cover edge cases (single row, empty input if supported, and max-size split) for this scenario.
As per coding guidelines, "tests/**/*.py: Test validation: include numerical correctness checks against CPU reference implementations, and cover edge cases (single row, empty input, max-size input) beyond just 'runs without error'."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_hvg.py` around lines 400 - 434, Extend test_pearson_residuals_batch_order_invariant to include numerical parity against a CPU reference implementation and add edge-case variants: (1) single-row input, (2) empty input if the API supports it, and (3) a max-size split (e.g., one extremely large batch and one tiny batch). For parity, compute residual_variances using a pure-CPU reference path (dense numpy or scipy sparse operations replicating rsc.pp.highly_variable_genes with flavor="pearson_residuals") and assert values match the AnnData results (use np.testing.assert_allclose with a tight atol); reuse the same call signature (flavor="pearson_residuals", batch_key="batch", n_top_genes as appropriate, check_values=False) on the AnnData instances (a1/a2 and the edge-case instances) and compare their .var["residual_variances"].to_numpy() to the CPU reference outputs. Ensure each variant constructs batches by label permutations as in the original test to validate order invariance.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_hvg.py`:
- Around line 400-434: Extend test_pearson_residuals_batch_order_invariant to
include numerical parity against a CPU reference implementation and add
edge-case variants: (1) single-row input, (2) empty input if the API supports
it, and (3) a max-size split (e.g., one extremely large batch and one tiny
batch). For parity, compute residual_variances using a pure-CPU reference path
(dense numpy or scipy sparse operations replicating rsc.pp.highly_variable_genes
with flavor="pearson_residuals") and assert values match the AnnData results
(use np.testing.assert_allclose with a tight atol); reuse the same call
signature (flavor="pearson_residuals", batch_key="batch", n_top_genes as
appropriate, check_values=False) on the AnnData instances (a1/a2 and the
edge-case instances) and compare their .var["residual_variances"].to_numpy() to
the CPU reference outputs. Ensure each variant constructs batches by label
permutations as in the original test to validate order invariance.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0c173409-b092-4941-8d03-dffa728c8530
📒 Files selected for processing (3)
docs/release-notes/0.15.2.mdsrc/rapids_singlecell/preprocessing/_hvg/_pearson_residuals.pytests/test_hvg.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #674 +/- ##
=======================================
Coverage 88.83% 88.83%
=======================================
Files 100 100
Lines 7615 7616 +1
=======================================
+ Hits 6765 6766 +1
Misses 850 850
|
In
_highly_variable_pearson_residuals, the default clip threshold is computed inside the per-batch loop guarded byif clip is None. Because the assignment writes back to theclipparameter itself, the guard is only true on the first iteration; every subsequent batch reuses the first batch's threshold instead of computing its own.Lause, Berens & Kobak (2021) define the clip as
sqrt(N)on the cell count entering the residual computation, so withbatch_keythe correct value issqrt(n_cells_in_batch). On a 5000/200 split, the 200-cell batch silently inheritssqrt(5000) ≈ 70.7instead ofsqrt(200) ≈ 14.1. Residuals in the small batch that should be clipped are not, its per-gene variance is inflated, and HVG ranking shifts. The output also depends onnp.unique's alphabetical ordering of batch labels, so renaming"A"to"B"(and vice versa) on otherwise identical data changes the HVG list — not the kind of reproducibility surprise one wants buried in a preprocessing default.The fix introduces a loop-local
clip_batchso the per-batch default is recomputed each iteration; the user-facingclipparameter is left alone. Two tests cover it.test_pearson_residuals_batch_clip_scoped_per_batchmonkeypatches thecsc_hvg_res/dense_hvg_resbindings and asserts that two distinct clip values reach the kernel across batches — on the buggy code both calls receive the same value.test_pearson_residuals_batch_order_invariantswapsA↔Bon identical data and checks thatresidual_variancesmatches bit-for-bit. Both fail on the unpatched code; the rest of the pearson + HVG suite (58 tests) still passes. Verified on an H100.scanpy ports this routine almost verbatim and carries the same bug — I have a parallel PR open at scverse/scanpy#4138 with the same one-line scoping fix and equivalent CPU-side tests. Happy to land them together, or to hold this one until the scanpy side merges, whichever you prefer.