Skip to content

Memory optimization for cal_score() and top_markers()#1

Merged
Gene233 merged 6 commits into
develfrom
devel-memopt
Apr 22, 2026
Merged

Memory optimization for cal_score() and top_markers()#1
Gene233 merged 6 commits into
develfrom
devel-memopt

Conversation

@Gene233

@Gene233 Gene233 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

See commit messages for phases A–D. R CMD check: 0 ERROR, 0 WARNING, 1 NOTE (development version placeholder).

Gene233 added 6 commits April 20, 2026 18:42
* Drop forced as.matrix(data) in cal_score() matrix method so that
  dgCMatrix inputs now flow through to the existing sparse-aware row/col
  aggregators instead of being densified on entry.
* Add return.intermediate = FALSE to cal_score() (both methods) and
  cal_score_init(). Default path no longer stores the full-size tf/idf/iae
  matrices in SummarizedExperiment metadata, which on a 20k x 100k input
  saved roughly 3x the dense-matrix footprint per call. Explicitly clear
  stale metadata when the flag is FALSE to avoid leakage from prior runs.
* Remove the df_n_inv <- !df_n dead code in idf_prob(); it was never used
  downstream and carried a logical matrix the same size as expr.
* Add tests/testthat/test-numerical-equivalence.R and the frozen
  legacy_scores.rds snapshot captured from the pre-refactor HEAD. These
  pin cal_score() and top_markers() outputs (GLM, GLM+batch, abs-mean,
  abs-median) at 1e-10 tolerance for all downstream phases.

Breaking change: return.intermediate defaults to FALSE. Callers that
relied on metadata(se)$tf / $idf / $iae must pass return.intermediate =
TRUE. Documented in the roxygen @param block; NEWS entry deferred to
Phase D so the user-visible announcement lands alongside the full
refactor.
Core change: labelled `idf_prob`, `idf_rf`, `iae_prob` and `iae_rf` now
return a compact G x K matrix (columns = unique labels) instead of
expanding to G x N via `[, label]`. `cal_score_init()` composes the
final score through per-group column blocks in the new
`combine_tf_idf_iae()` helper, so at most one G x N matrix is alive at
any time regardless of how many factors are compact.

Additional Phase B edits:

* `tf()` rewritten to stay sparse on `dgCMatrix` inputs by right-
  multiplying with `Matrix::Diagonal(1 / (colSums + 0.01))` and relying
  on `log1p(0) == 0`. The legacy `sweep()` path would fall back to
  `as.array()` and densify.
* New internal helper `rowwise_notin_max()` replaces the
  `apply(mean_row_in, 1, function(x) max(x[names(x) != type]))` pattern
  in the `multi = TRUE` branches of `idf_prob / idf_rf / iae_prob /
  iae_rf`. The old form was O(G * K^2); the new form is O(G * K) using
  a top-1 plus masked-top-2 trick.
* New internal helper `pmax0_offset()` replaces `expr - thres` plus
  `expr[expr < 0] <- 0` in every IAE variant. For `dgCMatrix` it
  mutates the `@x` slot in place and calls `Matrix::drop0()`; for
  `thres == 0` (the default) it short-circuits to a no-op.
* `idf_hdb()` and `iae_hdb()` now own their internal HDBSCAN cluster
  labels and expand the compact `idf_prob` / `iae_prob` result back to
  G x N inside the helper. Their external contract is unchanged, so
  callers passing `idf = "hdb"` or `iae = "hdb"` behave identically.
  Their internal naive TF is also routed through the new sparse-safe
  `tf()` helper.
* `combine_tf_idf_iae()` detects each factor's shape (scalar /
  G-vector / G x K compact / G x N full) and slices appropriately.
  A small local `%||%` helper avoids adding an `rlang` Imports.

Verification:
* All 85 existing structural tests remain green.
* The 17 numerical-equivalence assertions pass against the legacy
  snapshot; the `return.intermediate = TRUE` check was updated to
  compare the expanded form `md$idf[, label]` against the legacy
  G x N matrix, reflecting the new compact internal contract.
* Sparse (`dgCMatrix`) vs dense input produce identical scores; the
  sparse path now returns a `dgCMatrix` throughout (max abs diff
  1.4e-17, i.e. machine epsilon).

Breaking internal contract (non-exported helpers only): `idf_prob`,
`idf_rf`, `iae_prob`, `iae_rf` now return G x K. None of these are
exported (NAMESPACE exports only `idf_iae_methods()`), so the user-
facing API is unchanged. When `return.intermediate = TRUE`,
`metadata(se)$idf` and `metadata(se)$iae` also become G x K, which is
a user-visible shape change; the NEWS entry (Phase D) will document
the expansion idiom for callers that relied on the old per-cell form.
… single-copy

top_markers_abs:
* Drops the `t() |> as.data.frame() |> group_by() |> summarise_all()`
  chain which materialised a dense N x G data.frame (tens of GB on
  large inputs). Replaces it with direct G x K aggregation via
  `sparseMatrixStats::rowMeans2 / rowMedians / rowMads`, then pivots
  to long form with a fixed-size three-column data.frame.
* Factored into small helpers (`apply_row_scaling`,
  `aggregate_rows_by_group`, `long_format_from_group_matrix`,
  `finalize_top_markers`) to keep the public entry point small.

top_markers_glm:
* Adds a closed-form least-squares fast path for the default
  `gaussian()` + identity link. Uses `Matrix::sparse.model.matrix`,
  `Matrix::crossprod` and `Matrix::solve` to compute all per-gene
  label coefficients as a single `K_total x G` matrix, replacing
  the per-gene `apply(glm)` loop. Rank-deficient designs fall back
  to the legacy `glm()` loop automatically.
* Non-gaussian families continue to use the `apply(glm)` path with no
  behaviour change, preserving existing users who pass e.g.
  `family = Gamma()` or `poisson()`.
* Factored into `fit_label_betas()` dispatcher plus
  `fit_label_betas_closed_form()` and `fit_label_betas_glm_loop()`
  helpers; the 1-vs-max logFC contrast is now `betas_to_logfc_1v_max()`.
* Replaces the shared `t(scale(t(data)))` densifying pattern with
  `row_scale_zmean()`, built on `sparseMatrixStats::rowMeans2` +
  `rowSds`, matching the commented-out hint already present in the
  file.

scale_mgm:
* Caches `split(seq_len(ncol(expr)), label)` once so every per-group
  `vapply` iteration reuses the same index vectors instead of
  re-scanning `label == i`.
* Replaces `(expr - mgm) / (sds + 1e-8)` with `(expr - mgm) * inv_sd`
  (pre-computed `1 / (sds + 1e-8)`), collapsing the prior two
  allocations into a single broadcast.
* Introduces `row_pool_sds_from_idx()` taking the pre-computed index
  list; the public-internal `row_pool_sds()` is kept as a shim so
  external callers with the old signature still work.

Verification:
* devtools::test(): all 14 groups pass with 0 failures / 0 errors,
  including the 7 numerical-equivalence tests (matrix score,
  SE metadata, GLM path, GLM + batch, abs mean, abs median) at
  1e-10 and 1e-8 tolerances.
* Dense-vs-sparse parity end-to-end: `cal_score()` + `top_markers()`
  through both GLM and abs paths on identical inputs expressed as
  `matrix` and `dgCMatrix` produce identical marker sets and score
  vectors (max abs diff ~1e-17, i.e. machine epsilon).
* Sparse input's `assay(se, "score")` stays a `dgCMatrix` throughout;
  dense input's stays a plain `matrix`. No forced densification on
  the fast paths.
* Extend `test-numerical-equivalence.R` with five edge-case test
  groups: dgCMatrix-vs-dense parity on `cal_score()` output, a direct
  closed-form vs `glm()` apply-loop coefficient check, zero-SD /
  all-NA row collapse for `row_scale_zmean()`, sparsity preservation
  and threshold semantics for `pmax0_offset()`, and a vectorised-vs-
  naive parity check for `rowwise_notin_max()`. Tests use
  `out@metadata` and `data.frame()` for colData to avoid the stray
  `::` S4Vectors references flagged by R CMD check.
* Add `inst/bench/benchmark_smartid.R`: a standalone script that
  simulates a 20,000 x 100,000 dgCMatrix with 10 balanced groups and
  reports wall time + R-level memory delta for `cal_score()` and
  `top_markers()` under the default gaussian closed-form path and
  the abs-mean path. Uses a `gc()`-delta fallback when `bench::mark`
  is unavailable.
* Update roxygen documentation:
  - `cal_score()` / `cal_score_init()` document the new
    `return.intermediate = FALSE` parameter and the memory rationale.
  - `top_markers_glm()` documents the gaussian-identity closed-form
    fast path and the automatic fallback for non-gaussian families
    or rank-deficient designs.
  - `tf()` notes that `dgCMatrix` input now stays sparse end-to-end.
  Regenerated Rd files via `devtools::document()`.
* NEWS.md gets a `# smartid (development version)` entry summarising
  all four phases, including the `return.intermediate` breaking
  change, the compact G x K internal IDF/IAE contract, the closed-
  form GLM path, the sparse-preserving helpers and the lack of new
  dependencies. Header stays as "(development version)" until the
  branch is pushed and CI verifies the refactor; Version bump and
  NEWS heading rename land in a follow-up commit.

Verification:
* devtools::test(): 14 groups, all pass with 0 failures / 0 errors.
  The numerical-equivalence file now runs 12 test groups covering
  the full legacy snapshot plus the five edge cases.
* R CMD check (--no-build-vignettes --no-manual, --as-cran):
  Status: 1 NOTE. The only remaining NOTE is
  "Cannot extract version info from the following section titles:
  smartid (development version)" which is expected and will clear
  when the version is bumped in the follow-up commit.
@Gene233 Gene233 merged commit 626ef6a into devel Apr 22, 2026
2 checks passed
@Gene233 Gene233 deleted the devel-memopt branch April 22, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant