Memory optimization for cal_score() and top_markers()#1
Merged
Conversation
* Drop forced as.matrix(data) in cal_score() matrix method so that dgCMatrix inputs now flow through to the existing sparse-aware row/col aggregators instead of being densified on entry. * Add return.intermediate = FALSE to cal_score() (both methods) and cal_score_init(). Default path no longer stores the full-size tf/idf/iae matrices in SummarizedExperiment metadata, which on a 20k x 100k input saved roughly 3x the dense-matrix footprint per call. Explicitly clear stale metadata when the flag is FALSE to avoid leakage from prior runs. * Remove the df_n_inv <- !df_n dead code in idf_prob(); it was never used downstream and carried a logical matrix the same size as expr. * Add tests/testthat/test-numerical-equivalence.R and the frozen legacy_scores.rds snapshot captured from the pre-refactor HEAD. These pin cal_score() and top_markers() outputs (GLM, GLM+batch, abs-mean, abs-median) at 1e-10 tolerance for all downstream phases. Breaking change: return.intermediate defaults to FALSE. Callers that relied on metadata(se)$tf / $idf / $iae must pass return.intermediate = TRUE. Documented in the roxygen @param block; NEWS entry deferred to Phase D so the user-visible announcement lands alongside the full refactor.
Core change: labelled `idf_prob`, `idf_rf`, `iae_prob` and `iae_rf` now return a compact G x K matrix (columns = unique labels) instead of expanding to G x N via `[, label]`. `cal_score_init()` composes the final score through per-group column blocks in the new `combine_tf_idf_iae()` helper, so at most one G x N matrix is alive at any time regardless of how many factors are compact. Additional Phase B edits: * `tf()` rewritten to stay sparse on `dgCMatrix` inputs by right- multiplying with `Matrix::Diagonal(1 / (colSums + 0.01))` and relying on `log1p(0) == 0`. The legacy `sweep()` path would fall back to `as.array()` and densify. * New internal helper `rowwise_notin_max()` replaces the `apply(mean_row_in, 1, function(x) max(x[names(x) != type]))` pattern in the `multi = TRUE` branches of `idf_prob / idf_rf / iae_prob / iae_rf`. The old form was O(G * K^2); the new form is O(G * K) using a top-1 plus masked-top-2 trick. * New internal helper `pmax0_offset()` replaces `expr - thres` plus `expr[expr < 0] <- 0` in every IAE variant. For `dgCMatrix` it mutates the `@x` slot in place and calls `Matrix::drop0()`; for `thres == 0` (the default) it short-circuits to a no-op. * `idf_hdb()` and `iae_hdb()` now own their internal HDBSCAN cluster labels and expand the compact `idf_prob` / `iae_prob` result back to G x N inside the helper. Their external contract is unchanged, so callers passing `idf = "hdb"` or `iae = "hdb"` behave identically. Their internal naive TF is also routed through the new sparse-safe `tf()` helper. * `combine_tf_idf_iae()` detects each factor's shape (scalar / G-vector / G x K compact / G x N full) and slices appropriately. A small local `%||%` helper avoids adding an `rlang` Imports. Verification: * All 85 existing structural tests remain green. * The 17 numerical-equivalence assertions pass against the legacy snapshot; the `return.intermediate = TRUE` check was updated to compare the expanded form `md$idf[, label]` against the legacy G x N matrix, reflecting the new compact internal contract. * Sparse (`dgCMatrix`) vs dense input produce identical scores; the sparse path now returns a `dgCMatrix` throughout (max abs diff 1.4e-17, i.e. machine epsilon). Breaking internal contract (non-exported helpers only): `idf_prob`, `idf_rf`, `iae_prob`, `iae_rf` now return G x K. None of these are exported (NAMESPACE exports only `idf_iae_methods()`), so the user- facing API is unchanged. When `return.intermediate = TRUE`, `metadata(se)$idf` and `metadata(se)$iae` also become G x K, which is a user-visible shape change; the NEWS entry (Phase D) will document the expansion idiom for callers that relied on the old per-cell form.
… single-copy top_markers_abs: * Drops the `t() |> as.data.frame() |> group_by() |> summarise_all()` chain which materialised a dense N x G data.frame (tens of GB on large inputs). Replaces it with direct G x K aggregation via `sparseMatrixStats::rowMeans2 / rowMedians / rowMads`, then pivots to long form with a fixed-size three-column data.frame. * Factored into small helpers (`apply_row_scaling`, `aggregate_rows_by_group`, `long_format_from_group_matrix`, `finalize_top_markers`) to keep the public entry point small. top_markers_glm: * Adds a closed-form least-squares fast path for the default `gaussian()` + identity link. Uses `Matrix::sparse.model.matrix`, `Matrix::crossprod` and `Matrix::solve` to compute all per-gene label coefficients as a single `K_total x G` matrix, replacing the per-gene `apply(glm)` loop. Rank-deficient designs fall back to the legacy `glm()` loop automatically. * Non-gaussian families continue to use the `apply(glm)` path with no behaviour change, preserving existing users who pass e.g. `family = Gamma()` or `poisson()`. * Factored into `fit_label_betas()` dispatcher plus `fit_label_betas_closed_form()` and `fit_label_betas_glm_loop()` helpers; the 1-vs-max logFC contrast is now `betas_to_logfc_1v_max()`. * Replaces the shared `t(scale(t(data)))` densifying pattern with `row_scale_zmean()`, built on `sparseMatrixStats::rowMeans2` + `rowSds`, matching the commented-out hint already present in the file. scale_mgm: * Caches `split(seq_len(ncol(expr)), label)` once so every per-group `vapply` iteration reuses the same index vectors instead of re-scanning `label == i`. * Replaces `(expr - mgm) / (sds + 1e-8)` with `(expr - mgm) * inv_sd` (pre-computed `1 / (sds + 1e-8)`), collapsing the prior two allocations into a single broadcast. * Introduces `row_pool_sds_from_idx()` taking the pre-computed index list; the public-internal `row_pool_sds()` is kept as a shim so external callers with the old signature still work. Verification: * devtools::test(): all 14 groups pass with 0 failures / 0 errors, including the 7 numerical-equivalence tests (matrix score, SE metadata, GLM path, GLM + batch, abs mean, abs median) at 1e-10 and 1e-8 tolerances. * Dense-vs-sparse parity end-to-end: `cal_score()` + `top_markers()` through both GLM and abs paths on identical inputs expressed as `matrix` and `dgCMatrix` produce identical marker sets and score vectors (max abs diff ~1e-17, i.e. machine epsilon). * Sparse input's `assay(se, "score")` stays a `dgCMatrix` throughout; dense input's stays a plain `matrix`. No forced densification on the fast paths.
* Extend `test-numerical-equivalence.R` with five edge-case test
groups: dgCMatrix-vs-dense parity on `cal_score()` output, a direct
closed-form vs `glm()` apply-loop coefficient check, zero-SD /
all-NA row collapse for `row_scale_zmean()`, sparsity preservation
and threshold semantics for `pmax0_offset()`, and a vectorised-vs-
naive parity check for `rowwise_notin_max()`. Tests use
`out@metadata` and `data.frame()` for colData to avoid the stray
`::` S4Vectors references flagged by R CMD check.
* Add `inst/bench/benchmark_smartid.R`: a standalone script that
simulates a 20,000 x 100,000 dgCMatrix with 10 balanced groups and
reports wall time + R-level memory delta for `cal_score()` and
`top_markers()` under the default gaussian closed-form path and
the abs-mean path. Uses a `gc()`-delta fallback when `bench::mark`
is unavailable.
* Update roxygen documentation:
- `cal_score()` / `cal_score_init()` document the new
`return.intermediate = FALSE` parameter and the memory rationale.
- `top_markers_glm()` documents the gaussian-identity closed-form
fast path and the automatic fallback for non-gaussian families
or rank-deficient designs.
- `tf()` notes that `dgCMatrix` input now stays sparse end-to-end.
Regenerated Rd files via `devtools::document()`.
* NEWS.md gets a `# smartid (development version)` entry summarising
all four phases, including the `return.intermediate` breaking
change, the compact G x K internal IDF/IAE contract, the closed-
form GLM path, the sparse-preserving helpers and the lack of new
dependencies. Header stays as "(development version)" until the
branch is pushed and CI verifies the refactor; Version bump and
NEWS heading rename land in a follow-up commit.
Verification:
* devtools::test(): 14 groups, all pass with 0 failures / 0 errors.
The numerical-equivalence file now runs 12 test groups covering
the full legacy snapshot plus the five edge cases.
* R CMD check (--no-build-vignettes --no-manual, --as-cran):
Status: 1 NOTE. The only remaining NOTE is
"Cannot extract version info from the following section titles:
smartid (development version)" which is expected and will clear
when the version is bumped in the follow-up commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See commit messages for phases A–D. R CMD check: 0 ERROR, 0 WARNING, 1 NOTE (development version placeholder).