Dataset.open(splice_info, var_filter) breaks choose_exonic_variants; with_settings(splice_info, var_filter) works

## Summary

Passing `splice_info` + `var_filter="exonic"` via `Dataset.open(...)` produces broken kernel state inside `choose_exonic_variants` (numerical overflow on `n_variants` → `MemoryError` or `ValueError: negative dimensions not allowed`), while passing the same arguments via `with_settings(splice_info=..., var_filter=...)` after `with_seqs("haplotypes")` works correctly on the same data.

## Symptom

```
Traceback (most recent call last):
  File "<repro>.py", line 89, in <module>
    haps = ds[0, :].to_ak()
  File "genvarloader/_dataset/_impl.py", line 2231, in __getitem__
    return super().__getitem__(idx)
  File "genvarloader/_dataset/_impl.py", line 1561, in __getitem__
    recon, squeeze, out_reshape = self._getitem_spliced(idx, self._sp_idxer)
  File "genvarloader/_dataset/_impl.py", line 1670, in _getitem_spliced
    recon = inner_ds._recon(...)
  File "genvarloader/_dataset/_reconstruct.py", line 354, in __call__
    haps, *_ = self.get_haps_and_shifts(...)
  File "genvarloader/_dataset/_reconstruct.py", line 395, in get_haps_and_shifts
    keep, keep_offsets = choose_exonic_variants(...)
MemoryError: Allocation failed (probably too large).
# (chr1; equivalent run on chr2 raises `ValueError: negative dimensions not allowed`)
```

The kernel computes `lengths[query, hap] = o_e - o_s` then `keep = np.empty(n_variants, np.bool_)` where `n_variants = keep_offsets[-1]`. Either:

- The `(o_s, o_e)` offsets are wrong via the `open()` setup path (negative or huge gap), or
- `_sp_idxer` / `_spliced_bed` get a different internal shape via `open(splice_info=...)` than via `with_settings(splice_info=...)`.

## Environment

| | |
|---|---|
| GenVarLoader | `0.24.1` PyPI **and** `main` post-#170 (both reproduce) |
| genoray | reproduces on `2.3.3`, `2.6.0` (whole range) |
| Python | 3.12 (`runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404`) |
| OS | Ubuntu 24.04 |
| Pod | 16 vCPU / 64 GB / cpu3g on Runpod |

## Full end-to-end reproduction

Self-contained — downloads the 1KG NYGC 30x chr1 phased panel + reference + GTF, builds the SVAR+GVL pair, and demonstrates the asymmetry by opening the same dataset two ways. Tested with `genvarloader==0.24.1` + the merged #170 fix on top.

```bash
# Pod env (assumes bcftools, pip, ~120 GB free disk).
pip install "genvarloader @ git+https://github.com/mcvickerlab/GenVarLoader@main" \
            "genoray==2.3.3" pooch polars pyarrow pysam seqpro
```

```python
# repro.py
import polars as pl
import pooch
import pysam
import subprocess
import sys
from pathlib import Path
import genoray, genvarloader as gvl

WORK = Path("/tmp/gvl-repro-176")
WORK.mkdir(exist_ok=True, parents=True)

# ---- 1. Inputs (1KG NYGC 30x phased chr1 + GRCh38 + Ensembl 113) ----
vcf_url = (
    "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
    "1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/"
    "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
)
vcf_raw = Path(pooch.retrieve(vcf_url, known_hash=None, path=WORK / "1kg"))
pooch.retrieve(vcf_url + ".tbi", known_hash=None, path=WORK / "1kg")

ref_url = (  # use whatever GRCh38 your environment already has
    "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/"
    "GRCh38.primary_assembly.genome.fa.gz"
)
ref_fa = Path(pooch.retrieve(ref_url, known_hash=None, path=WORK / "ref"))
# Index + faidx if not already:
subprocess.run(["bgzip", "-r", str(ref_fa)], check=False)
subprocess.run(["samtools", "faidx", str(ref_fa)], check=True)

gtf_url = "https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.chr.gtf.gz"
gtf = Path(pooch.retrieve(gtf_url, known_hash=None, path=WORK / "gtf"))

# ---- 2. bcftools norm (left-align + atomize + split multi-allelic) ----
# GVL/genoray require this. See https://genvarloader.readthedocs.io/en/latest/write.html
vcf_normed = WORK / "chr1.normed.vcf.gz"
if not vcf_normed.exists():
    subprocess.run([
        "bcftools", "norm",
        "-f", str(ref_fa),
        "-a", "--atom-overlaps", ".",
        "-m", "-any", "--multi-overlaps", ".",
        "--threads", "8",
        "-O", "z", "-o", str(vcf_normed),
        str(vcf_raw),
    ], check=True)
    subprocess.run(["bcftools", "index", "--tbi", str(vcf_normed)], check=True)

# ---- 3. Build the splice BED (per-transcript MANE Select CDS regions) ----
splice_bed = gvl.get_splice_bed(
    str(gtf), transcript_support_level=None, m3=True,
).filter(pl.col("chrom") == "chr1")

# ---- 4. Build SVAR via genoray (two-step pattern: avoids gvl.write's
#         _write_from_vcf shortcut that hit #162's max_mem overshoot). ----
svar_dir = WORK / "chr1.svar"
if not svar_dir.exists():
    genoray.SparseVar.from_vcf(
        out=svar_dir,
        vcf=genoray.VCF(str(vcf_normed)),
        max_mem="32g",
        overwrite=False,
    )

# ---- 5. Build GVL dataset (BED-indexed, uses the pre-built SVAR) ----
gvl_dir = WORK / "chr1.gvl"
if not gvl_dir.exists():
    gvl.write(
        path=str(gvl_dir),
        bed=splice_bed,
        variants=genoray.SparseVar(svar_dir),
        max_mem="2g",
        overwrite=False,
    )

# ---- 6. The asymmetry. Same dataset, two open paths. ----

# 6a. open(splice_info, var_filter) — BROKEN
print("=== Path A: Dataset.open(splice_info, var_filter) ===", flush=True)
try:
    ds_a = gvl.Dataset.open(
        gvl_dir,
        reference=str(ref_fa),
        splice_info=("transcript_id", "exon_number"),
        var_filter="exonic",
    ).with_seqs("haplotypes")
    haps = ds_a[0, :].to_ak()
    print(f"Path A succeeded — shape {haps.layout}")
except Exception as e:
    print(f"Path A FAILED — {type(e).__name__}: {e}")

# 6b. with_seqs.with_settings(splice_info, var_filter) — WORKS
print("\n=== Path B: .with_seqs().with_settings(splice_info, var_filter) ===", flush=True)
ds_b = (
    gvl.Dataset.open(gvl_dir, reference=str(ref_fa))
    .with_seqs("haplotypes")
    .with_settings(
        splice_info=("transcript_id", "exon_number"),
        var_filter="exonic",
    )
)
haps_b = ds_b[0, :].to_ak()
print(f"Path B succeeded — first sample/phase bytes len={len(bytes(haps_b[0, 0]))}")
```

Expected output:
```
=== Path A: Dataset.open(splice_info, var_filter) ===
Path A FAILED — MemoryError: Allocation failed (probably too large).

=== Path B: .with_seqs().with_settings(splice_info, var_filter) ===
Path B succeeded — first sample/phase bytes len=NNNNN
```

Same SVAR, same GVL dataset, same reference, same kwargs — only the call order differs.

## Bisection evidence

Across two chromosomes (chr1 + chr2 of the same cohort) and four (GVL × genoray) version combos:

| seqlab + GVL config | chr1 | chr2 |
|---|---|---|
| Path A on GVL `main` (= 0.25.0) | MemoryError | MemoryError |
| Path A on GVL v0.24.1 + #170 cherry-pick | MemoryError | MemoryError |
| Path A on the above + `genoray==2.3.3` pin | MemoryError | negative-dim ValueError |
| **Path B** on the same stack | ✅ extract_done + upload_done | ✅ extract_done + upload_done |

Same pods, same data, same env — only the call order differs.

## Related issues / PRs

- **#170** (closed, merged today): fixed a *prior* layer of this kernel — `TypingError: slice(array, array)` from a missing `ndim` guard on the second prange loop. #176 is the next-layer bug that became visible only after #170 unmasked it.
- **#162** (open): different code path (`gvl.write` max_mem), but shows that the read-side and write-side both have memory-accounting / pathological-allocation sore spots.
- Seqlab-side workaround / revert that switches back to Path B: standardmodelbio/seqlab#31.

## Why this matters

`Dataset.open(splice_info=..., var_filter=...)` is a documented entry point and the natural call shape. Anyone relying on the `open()` signature today would hit this on cohort-scale data. Worth either fixing the `open()` path internally to route through the same code as `with_settings()`, or documenting that `splice_info` + `var_filter` MUST be applied via `with_settings()` until then.

I'm happy to dig further into where the state diverges between the two paths if it'd help — let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.open(splice_info, var_filter) breaks choose_exonic_variants; with_settings(splice_info, var_filter) works #176

Summary

Symptom

Environment

Full end-to-end reproduction

Bisection evidence

Related issues / PRs

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


GenVarLoader	`0.24.1` PyPI and `main` post-#170 (both reproduce)
genoray	reproduces on `2.3.3`, `2.6.0` (whole range)
Python	3.12 (`runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404`)
OS	Ubuntu 24.04
Pod	16 vCPU / 64 GB / cpu3g on Runpod

seqlab + GVL config	chr1	chr2
Path A on GVL `main` (= 0.25.0)	MemoryError	MemoryError
Path A on GVL v0.24.1 + #170 cherry-pick	MemoryError	MemoryError
Path A on the above + `genoray==2.3.3` pin	MemoryError	negative-dim ValueError
Path B on the same stack	✅ extract_done + upload_done	✅ extract_done + upload_done

Dataset.open(splice_info, var_filter) breaks choose_exonic_variants; with_settings(splice_info, var_filter) works #176

Description

Summary

Symptom

Environment

Full end-to-end reproduction

Bisection evidence

Related issues / PRs

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions