Memory optimization for cal_score() and top_markers() by Gene233 · Pull Request #1 · DavisLaboratory/smartid

Gene233 · 2026-04-22T11:03:29Z

See commit messages for phases A–D. R CMD check: 0 ERROR, 0 WARNING, 1 NOTE (development version placeholder).

* Drop forced as.matrix(data) in cal_score() matrix method so that dgCMatrix inputs now flow through to the existing sparse-aware row/col aggregators instead of being densified on entry. * Add return.intermediate = FALSE to cal_score() (both methods) and cal_score_init(). Default path no longer stores the full-size tf/idf/iae matrices in SummarizedExperiment metadata, which on a 20k x 100k input saved roughly 3x the dense-matrix footprint per call. Explicitly clear stale metadata when the flag is FALSE to avoid leakage from prior runs. * Remove the df_n_inv <- !df_n dead code in idf_prob(); it was never used downstream and carried a logical matrix the same size as expr. * Add tests/testthat/test-numerical-equivalence.R and the frozen legacy_scores.rds snapshot captured from the pre-refactor HEAD. These pin cal_score() and top_markers() outputs (GLM, GLM+batch, abs-mean, abs-median) at 1e-10 tolerance for all downstream phases. Breaking change: return.intermediate defaults to FALSE. Callers that relied on metadata(se)$tf / $idf / $iae must pass return.intermediate = TRUE. Documented in the roxygen @param block; NEWS entry deferred to Phase D so the user-visible announcement lands alongside the full refactor.

Core change: labelled `idf_prob`, `idf_rf`, `iae_prob` and `iae_rf` now return a compact G x K matrix (columns = unique labels) instead of expanding to G x N via `[, label]`. `cal_score_init()` composes the final score through per-group column blocks in the new `combine_tf_idf_iae()` helper, so at most one G x N matrix is alive at any time regardless of how many factors are compact. Additional Phase B edits: * `tf()` rewritten to stay sparse on `dgCMatrix` inputs by right- multiplying with `Matrix::Diagonal(1 / (colSums + 0.01))` and relying on `log1p(0) == 0`. The legacy `sweep()` path would fall back to `as.array()` and densify. * New internal helper `rowwise_notin_max()` replaces the `apply(mean_row_in, 1, function(x) max(x[names(x) != type]))` pattern in the `multi = TRUE` branches of `idf_prob / idf_rf / iae_prob / iae_rf`. The old form was O(G * K^2); the new form is O(G * K) using a top-1 plus masked-top-2 trick. * New internal helper `pmax0_offset()` replaces `expr - thres` plus `expr[expr < 0] <- 0` in every IAE variant. For `dgCMatrix` it mutates the `@x` slot in place and calls `Matrix::drop0()`; for `thres == 0` (the default) it short-circuits to a no-op. * `idf_hdb()` and `iae_hdb()` now own their internal HDBSCAN cluster labels and expand the compact `idf_prob` / `iae_prob` result back to G x N inside the helper. Their external contract is unchanged, so callers passing `idf = "hdb"` or `iae = "hdb"` behave identically. Their internal naive TF is also routed through the new sparse-safe `tf()` helper. * `combine_tf_idf_iae()` detects each factor's shape (scalar / G-vector / G x K compact / G x N full) and slices appropriately. A small local `%||%` helper avoids adding an `rlang` Imports. Verification: * All 85 existing structural tests remain green. * The 17 numerical-equivalence assertions pass against the legacy snapshot; the `return.intermediate = TRUE` check was updated to compare the expanded form `md$idf[, label]` against the legacy G x N matrix, reflecting the new compact internal contract. * Sparse (`dgCMatrix`) vs dense input produce identical scores; the sparse path now returns a `dgCMatrix` throughout (max abs diff 1.4e-17, i.e. machine epsilon). Breaking internal contract (non-exported helpers only): `idf_prob`, `idf_rf`, `iae_prob`, `iae_rf` now return G x K. None of these are exported (NAMESPACE exports only `idf_iae_methods()`), so the user- facing API is unchanged. When `return.intermediate = TRUE`, `metadata(se)$idf` and `metadata(se)$iae` also become G x K, which is a user-visible shape change; the NEWS entry (Phase D) will document the expansion idiom for callers that relied on the old per-cell form.

… single-copy top_markers_abs: * Drops the `t() |> as.data.frame() |> group_by() |> summarise_all()` chain which materialised a dense N x G data.frame (tens of GB on large inputs). Replaces it with direct G x K aggregation via `sparseMatrixStats::rowMeans2 / rowMedians / rowMads`, then pivots to long form with a fixed-size three-column data.frame. * Factored into small helpers (`apply_row_scaling`, `aggregate_rows_by_group`, `long_format_from_group_matrix`, `finalize_top_markers`) to keep the public entry point small. top_markers_glm: * Adds a closed-form least-squares fast path for the default `gaussian()` + identity link. Uses `Matrix::sparse.model.matrix`, `Matrix::crossprod` and `Matrix::solve` to compute all per-gene label coefficients as a single `K_total x G` matrix, replacing the per-gene `apply(glm)` loop. Rank-deficient designs fall back to the legacy `glm()` loop automatically. * Non-gaussian families continue to use the `apply(glm)` path with no behaviour change, preserving existing users who pass e.g. `family = Gamma()` or `poisson()`. * Factored into `fit_label_betas()` dispatcher plus `fit_label_betas_closed_form()` and `fit_label_betas_glm_loop()` helpers; the 1-vs-max logFC contrast is now `betas_to_logfc_1v_max()`. * Replaces the shared `t(scale(t(data)))` densifying pattern with `row_scale_zmean()`, built on `sparseMatrixStats::rowMeans2` + `rowSds`, matching the commented-out hint already present in the file. scale_mgm: * Caches `split(seq_len(ncol(expr)), label)` once so every per-group `vapply` iteration reuses the same index vectors instead of re-scanning `label == i`. * Replaces `(expr - mgm) / (sds + 1e-8)` with `(expr - mgm) * inv_sd` (pre-computed `1 / (sds + 1e-8)`), collapsing the prior two allocations into a single broadcast. * Introduces `row_pool_sds_from_idx()` taking the pre-computed index list; the public-internal `row_pool_sds()` is kept as a shim so external callers with the old signature still work. Verification: * devtools::test(): all 14 groups pass with 0 failures / 0 errors, including the 7 numerical-equivalence tests (matrix score, SE metadata, GLM path, GLM + batch, abs mean, abs median) at 1e-10 and 1e-8 tolerances. * Dense-vs-sparse parity end-to-end: `cal_score()` + `top_markers()` through both GLM and abs paths on identical inputs expressed as `matrix` and `dgCMatrix` produce identical marker sets and score vectors (max abs diff ~1e-17, i.e. machine epsilon). * Sparse input's `assay(se, "score")` stays a `dgCMatrix` throughout; dense input's stays a plain `matrix`. No forced densification on the fast paths.

* Extend `test-numerical-equivalence.R` with five edge-case test groups: dgCMatrix-vs-dense parity on `cal_score()` output, a direct closed-form vs `glm()` apply-loop coefficient check, zero-SD / all-NA row collapse for `row_scale_zmean()`, sparsity preservation and threshold semantics for `pmax0_offset()`, and a vectorised-vs- naive parity check for `rowwise_notin_max()`. Tests use `out@metadata` and `data.frame()` for colData to avoid the stray `::` S4Vectors references flagged by R CMD check. * Add `inst/bench/benchmark_smartid.R`: a standalone script that simulates a 20,000 x 100,000 dgCMatrix with 10 balanced groups and reports wall time + R-level memory delta for `cal_score()` and `top_markers()` under the default gaussian closed-form path and the abs-mean path. Uses a `gc()`-delta fallback when `bench::mark` is unavailable. * Update roxygen documentation: - `cal_score()` / `cal_score_init()` document the new `return.intermediate = FALSE` parameter and the memory rationale. - `top_markers_glm()` documents the gaussian-identity closed-form fast path and the automatic fallback for non-gaussian families or rank-deficient designs. - `tf()` notes that `dgCMatrix` input now stays sparse end-to-end. Regenerated Rd files via `devtools::document()`. * NEWS.md gets a `# smartid (development version)` entry summarising all four phases, including the `return.intermediate` breaking change, the compact G x K internal IDF/IAE contract, the closed- form GLM path, the sparse-preserving helpers and the lack of new dependencies. Header stays as "(development version)" until the branch is pushed and CI verifies the refactor; Version bump and NEWS heading rename land in a follow-up commit. Verification: * devtools::test(): 14 groups, all pass with 0 failures / 0 errors. The numerical-equivalence file now runs 12 test groups covering the full legacy snapshot plus the five edge cases. * R CMD check (--no-build-vignettes --no-manual, --as-cran): Status: 1 NOTE. The only remaining NOTE is "Cannot extract version info from the following section titles: smartid (development version)" which is expected and will clear when the version is bumped in the follow-up commit.

Gene233 added 6 commits April 20, 2026 18:42

'update .gitignore'

f52b683

Fix vignette.

ea5c4f1

Gene233 merged commit 626ef6a into devel Apr 22, 2026
2 checks passed

Gene233 deleted the devel-memopt branch April 22, 2026 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimization for cal_score() and top_markers()#1

Memory optimization for cal_score() and top_markers()#1
Gene233 merged 6 commits into
develfrom
devel-memopt

Gene233 commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gene233 commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant