teslajoy · teslajoy · May 27, 2026 · May 19, 2026 · May 19, 2026 · May 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -174,4 +174,13 @@ cython_debug/
 .pypirc
 
 .idea
-scripts/
+scripts/
+paper_draft.*
+TODO.md
+arm_topology_A_B_C.png
+img/aucell_walks_figure.png
+img/aucell_walks_figure.svg
+img/fisher_walks_figure.png
+img/fisher_walks_figure.svg
+img/v1_v2_v3_comparison.png
+img/v1_v2_v3_comparison.svg
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@ a python package for converting gene sets to biological pathway embeddings with
 
 gene sets (from clusters, niches, studies) are tested against reactome pathways via fisher's exact test, then embedded into a shared vector space using metapath2vec over the pathway hierarchy graph.
 
+enrichment is either fisher's exact (a binary top-N gene set per cluster/niche) or **niche-level AUCell** (the full per-niche expression ranking, no gene-set selection step). AUCell here scores each niche's **aggregated pseudobulk** profile, one score per (niche, pathway); it is **not single-cell AUCell** (the package never sees individual cells, niche construction is upstream). both enrichment sources feed the same reactome hierarchy graph and metapath2vec embedding.
+
 ![gpath2vec.png](./img/gpath2vec.png)
 
 ## pipeline
@@ -89,7 +91,8 @@ the result is a shared embedding space where:
 ## install
 
 ```bash
-pip install -e .
+pip install -e .            # fisher enrichment path
+pip install -e '.[aucell]'  # adds the niche-level AUCell path (decoupler, anndata)
 ```
 
 ## usage
@@ -166,6 +169,17 @@ gpath2vec embeddings --network-path net.pkl --method vae --ea-matrix-path ea_mat
 
 # full pipeline with method choice
 gpath2vec end2end --genes "EGFR,EGF" --level low --method vae --output-dir output/
+
+# niche pipeline: enrichment -> graph -> embeddings in one command.
+# --enrichment fisher : binary top-N gene set per niche, fisher's exact + fdr.
+# --enrichment aucell : niche-level AUCell on each niche's aggregated pseudobulk
+#                       (one score per niche, NOT single-cell), per-niche top-k as edges.
+# inputs: --niche-matrix (niches x genes .npz/.npy), --genes (.npy gene order),
+#         --niche-meta (parquet with a niche_id column).
+gpath2vec niche-pipeline \
+  --niche-matrix niches.npz --genes genes.npy --niche-meta niche_meta.parquet \
+  --enrichment aucell --reactome-level low --topk 50 \
+  --reactome-dir /path/to/reactome/cache --out-dir output/
 ```
 
 ## embedding methods
@@ -219,3 +233,27 @@ reactome data is downloaded once and cached to `~/.gpath2vec/cache/`. set `GPATH
 ```bash
 export GPATH2VEC_REACTOME_DIR=/path/to/reactome/files
 ```
+
+### offline / hpc (air-gapped compute nodes)
+
+clusters like ARC have no internet on compute nodes, so the cache must be staged from a node that does have internet (ex. a login node):
+
+```bash
+./pull_reactome_cache.sh /shared/path/reactome_cache   # run where there IS internet
+```
+
+this drives the package's real fetchers, so the cache matches exactly what gpath2vec expects. on the compute node, point at it without re-downloading:
+
+```bash
+gpath2vec niche-pipeline ... --reactome-dir /shared/path/reactome_cache
+# or: export GPATH2VEC_REACTOME_DIR=/shared/path/reactome_cache
+```
+
+## reproducibility
+
+all stochastic embedders (metapath2vec, line, vae) take a `seed` (default 1234) that pins the python, numpy and torch rngs, so embeddings are bit-reproducible run to run. the seed is re-applied before training (independent of walk-generation rng) and recorded in `run_provenance.json`. svd and spectral are deterministic by construction. cli: `--seed`.
+
+## todo
+
+- **edge2vec**: edge-type transition-matrix biased walks as an embedding method.
+- **lorentz (hyperbolic) pipeline**
diff --git a/gpath2vec/aucell.py b/gpath2vec/aucell.py
@@ -0,0 +1,231 @@
+"""AUCell per-niche pathway scoring, as an alternative enrichment source.
+
+WHAT THIS COMPUTES (read this before using it)
+----------------------------------------------
+The unit of analysis is the *niche*: an upstream-defined spatial group of cells
+(e.g. a center spot plus its spatial neighbors), summarized as one aggregated
+"pseudobulk" expression vector per niche. The input here is therefore an
+(n_niches x n_genes) matrix where each row is that niche's aggregated counts.
+
+For each niche, AUCell:
+  1. ranks all genes by the niche's aggregated expression (descending),
+  2. for each Reactome pathway gene set, walks down that ranking and integrates
+     the recovery curve (how early the pathway's genes appear) over the top
+     `n_up` features. With decoupler's default (n_up=None) the recovery window
+     is the top 5% of ranked genes by magnitude.
+  3. returns one continuous score in [0, ~1] per (niche, pathway): high = the
+     pathway's genes are concentrated among the niche's most-expressed genes.
+
+There is no per-niche gene-set selection step (no top-100, no MAD/mean): AUCell
+uses the *full* gene ranking, so the "which genes count per niche" choice that
+Fisher's exact requires does not exist here. The only parameters are the
+gene-set size band and the AUC recovery window (n_up); both are logged.
+
+This is a faithful generalization of the validated niche_aucell.py pipeline:
+same normalization (normalize_total to 1e4 then log1p), same decoupler call
+(`dc.mt.aucell`, tmin = min gene-set size), same gene-set-size band. It is
+generalized by (a) taking all paths/params as arguments, (b) drawing the
+pathway universe from `ea.filter_pathways` so the AUCell and Fisher enrichment
+paths use the IDENTICAL level-filtered Reactome universe (comparable by
+construction), and (c) writing a provenance JSON for the methods section.
+
+decoupler + anndata are imported lazily; install via the `aucell` extra.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import scipy.sparse as sp
+
+from . import ea
+
+
+def _sha256_of_array(X) -> str:
+    """stable content hash of the niche x gene matrix, for provenance."""
+    h = hashlib.sha256()
+    if sp.issparse(X):
+        Xc = X.tocsr()
+        for a in (Xc.data, Xc.indices, Xc.indptr,
+                  np.asarray(Xc.shape, dtype=np.int64)):
+            h.update(np.ascontiguousarray(a).tobytes())
+    else:
+        h.update(np.ascontiguousarray(np.asarray(X)).tobytes())
+    return h.hexdigest()
+
+
+def _build_net(level, gene_filter, min_genes, max_genes):
+    """long-form pathway->gene table for decoupler, drawn from the SAME
+    `ea.filter_pathways` universe the Fisher path uses. `source` is the bare
+    Reactome stId so the resulting score columns are drop-in compatible with
+    `Net` (which keys pathway nodes by stId)."""
+    gm = ea.filter_pathways(level=level, gene_filter=gene_filter,
+                            min_genes=min_genes)
+    if gm.empty:
+        raise ValueError(f"no pathways for level={level!r} "
+                         f"(gene_filter set: {gene_filter is not None})")
+    rows = []
+    kept = 0
+    for r in gm.itertuples(index=False):
+        g = list(r.genes)
+        # gene-set-size band: min_genes (also passed to decoupler as tmin)
+        # and an upper cap to drop very large, non-specific sets.
+        if min_genes <= len(g) <= max_genes:
+            kept += 1
+            for sym in g:
+                rows.append((r.stId, sym))
+    if not rows:
+        raise ValueError(
+            f"no pathways within size band [{min_genes}, {max_genes}] "
+            f"at level={level!r}")
+    net = pd.DataFrame(rows, columns=["source", "target"])
+    return net, kept
+
+
+def compute_aucell(X, genes, niche_ids, *, level="all", gene_filter=None,
+                   min_genes=3, max_genes=500, normalize=True,
+                   provenance_path=None, verbose=True):
+    """per-niche AUCell scores against the level-filtered Reactome universe.
+
+    Parameters
+    ----------
+    X : (n_niches, n_genes) array or scipy.sparse
+        per-niche aggregated (pseudobulk) expression. Raw counts by default;
+        pass `normalize=False` if already library-normalized + log1p'd.
+    genes : sequence of str, length n_genes
+        gene symbols aligned to the columns of X.
+    niche_ids : sequence, length n_niches
+        identifier per row of X (used as the output index).
+    level : {"low","mid","high","all"}
+        Reactome pathway level (same semantics as ea.filter_pathways).
+    gene_filter : optional set/list of gene symbols
+        restrict the pathway universe (e.g. TF genes). None = no restriction.
+    min_genes, max_genes : int
+        pathway gene-set size band. min_genes is also passed to decoupler as
+        `tmin`. Defaults [3, 500] match the validated pipeline.
+    normalize : bool
+        if True (default) apply normalize_total(target_sum=1e4) then log1p
+        in-place, exactly as the validated niche_aucell pipeline. The score
+        ranking is invariant to a per-niche scalar, but this is kept for
+        parity with the reference and is logged.
+    provenance_path : optional path
+        if given, write a JSON of every parameter (decoupler version, n_up
+        policy, tmin/tmax, normalization, input sha256, shapes) for the
+        methods section / reproducibility.
+
+    Returns
+    -------
+    aucell_df : DataFrame (n_niches x n_pathways), index = niche_ids,
+        columns = Reactome stIds, values = continuous AUCell scores.
+    """
+    try:
+        import anndata as ad
+        import decoupler as dc
+    except ImportError as e:  # pragma: no cover - environment-dependent
+        raise ImportError(
+            "AUCell requires 'decoupler' and 'anndata'. Install the extra: "
+            "pip install 'gpath2vec[aucell]'") from e
+
+    genes = list(map(str, genes))
+    niche_ids = list(niche_ids)
+    if sp.issparse(X):
+        X = X.tocsr()
+    if X.shape != (len(niche_ids), len(genes)):
+        raise ValueError(
+            f"X shape {X.shape} does not match "
+            f"(n_niches={len(niche_ids)}, n_genes={len(genes)})")
+
+    input_sha = _sha256_of_array(X)
+
+    adata = ad.AnnData(
+        X=X.tocsr() if sp.issparse(X) else np.asarray(X),
+        obs=pd.DataFrame({"niche_id": niche_ids}),
+        var=pd.DataFrame(index=genes),
+    )
+    adata.obs_names = [str(n) for n in niche_ids]
+    adata.obs_names_make_unique()
+
+    if normalize:
+        # normalize_total(target_sum=1e4) then log1p, done explicitly so the
+        # exact transform is auditable and matches the reference pipeline.
+        if verbose:
+            print("  normalizing: target_sum=1e4 -> log1p")
+        Xc = adata.X.tocsr() if sp.issparse(adata.X) else sp.csr_matrix(adata.X)
+        lib = np.asarray(Xc.sum(axis=1)).ravel()
+        scaling = 1e4 / np.maximum(lib, 1.0)
+        Xc = Xc.multiply(scaling[:, None]).tocsr()
+        Xc.data = np.log1p(Xc.data)
+        adata.X = Xc
+
+    net, n_path = _build_net(level, gene_filter, min_genes, max_genes)
+    if verbose:
+        print(f"  pathway net: {net['source'].nunique()} pathways, "
+              f"{len(net)} pathway-gene edges (level={level}, "
+              f"size band [{min_genes}, {max_genes}])")
+        print("  running AUCell (decoupler)...")
+
+    # tmin = minimum gene-set size; n_up left at decoupler default (None ->
+    # top 5% of ranked features). Both recorded in provenance below.
+    dc.mt.aucell(adata, net=net, tmin=min_genes, verbose=verbose)
+    scores = adata.obsm["score_aucell"].copy()
+    scores.index = niche_ids[: len(scores)] if len(scores) == len(niche_ids) \
+        else scores.index
+
+    if provenance_path is not None:
+        prov = {
+            "method": "AUCell",
+            "decoupler_version": dc.__version__,
+            "anndata_version": ad.__version__,
+            "n_up": None,
+            "n_up_policy": ("decoupler default: top 5% of ranked features "
+                            "by magnitude"),
+            "tmin": int(min_genes),
+            "tmax_genesize_band": int(max_genes),
+            "reactome_level": level,
+            "gene_filter_applied": gene_filter is not None,
+            "n_gene_filter": (len(set(gene_filter))
+                              if gene_filter is not None else 0),
+            "normalize_total_target_sum": 1e4 if normalize else None,
+            "log1p": bool(normalize),
+            "pre_normalized_input": (not normalize),
+            "n_niches": int(len(niche_ids)),
+            "n_genes": int(len(genes)),
+            "n_pathways_scored": int(scores.shape[1]),
+            "input_matrix_sha256": input_sha,
+        }
+        Path(provenance_path).write_text(json.dumps(prov, indent=2))
+        if verbose:
+            print(f"  wrote AUCell provenance -> {provenance_path}")
+
+    return scores
+
+
+def topk_per_niche(aucell_df, k):
+    """top-k AUCell pathways per niche -> clusters dict for `Net`.
+
+    aucell_df : DataFrame (n_niches x n_pathways), continuous AUCell scores.
+    k : pathways to keep per niche (the primary sparsity parameter; ablate
+        over e.g. {20, 50, 100}). Low k = sparse, sharp per-niche identity but
+        risk of under-connected niches; high k = denser, smoother, re-approaches
+        the everything-connects-to-everything dilution. Always ablate + log k.
+
+    Mechanics: each niche row is the aggregated pseudobulk of its constituent
+    cells; its AUCell vector is that niche's pathway-activity profile. Keeping
+    the k highest-scoring pathways (dropping zeros) makes the niche->pathway
+    graph edges, with the RAW AUCell score as edge weight (non-negative, in
+    [0, ~1], directly usable by the weighted random walker).
+
+    returns: {niche_id: {pathway_stId: aucell_score, ...}} -- identical shape
+    to the Fisher path's clusters dict, drop-in for the Net builder.
+    """
+    clusters = {}
+    for niche_id in aucell_df.index:
+        scores = aucell_df.loc[niche_id]
+        top = scores.nlargest(k)
+        top = top[top > 0]  # drop zero-AUCell pathways even if within top-k
+        clusters[str(niche_id)] = top.to_dict()
+    return clusters