Skip to content

Dataset.open(splice_info, var_filter) breaks choose_exonic_variants; with_settings(splice_info, var_filter) works #176

@bschilder

Description

@bschilder

Summary

Passing splice_info + var_filter="exonic" via Dataset.open(...) produces broken kernel state inside choose_exonic_variants (numerical overflow on n_variantsMemoryError or ValueError: negative dimensions not allowed), while passing the same arguments via with_settings(splice_info=..., var_filter=...) after with_seqs("haplotypes") works correctly on the same data.

Symptom

Traceback (most recent call last):
  File "<repro>.py", line 89, in <module>
    haps = ds[0, :].to_ak()
  File "genvarloader/_dataset/_impl.py", line 2231, in __getitem__
    return super().__getitem__(idx)
  File "genvarloader/_dataset/_impl.py", line 1561, in __getitem__
    recon, squeeze, out_reshape = self._getitem_spliced(idx, self._sp_idxer)
  File "genvarloader/_dataset/_impl.py", line 1670, in _getitem_spliced
    recon = inner_ds._recon(...)
  File "genvarloader/_dataset/_reconstruct.py", line 354, in __call__
    haps, *_ = self.get_haps_and_shifts(...)
  File "genvarloader/_dataset/_reconstruct.py", line 395, in get_haps_and_shifts
    keep, keep_offsets = choose_exonic_variants(...)
MemoryError: Allocation failed (probably too large).
# (chr1; equivalent run on chr2 raises `ValueError: negative dimensions not allowed`)

The kernel computes lengths[query, hap] = o_e - o_s then keep = np.empty(n_variants, np.bool_) where n_variants = keep_offsets[-1]. Either:

  • The (o_s, o_e) offsets are wrong via the open() setup path (negative or huge gap), or
  • _sp_idxer / _spliced_bed get a different internal shape via open(splice_info=...) than via with_settings(splice_info=...).

Environment

GenVarLoader 0.24.1 PyPI and main post-#170 (both reproduce)
genoray reproduces on 2.3.3, 2.6.0 (whole range)
Python 3.12 (runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404)
OS Ubuntu 24.04
Pod 16 vCPU / 64 GB / cpu3g on Runpod

Full end-to-end reproduction

Self-contained — downloads the 1KG NYGC 30x chr1 phased panel + reference + GTF, builds the SVAR+GVL pair, and demonstrates the asymmetry by opening the same dataset two ways. Tested with genvarloader==0.24.1 + the merged #170 fix on top.

# Pod env (assumes bcftools, pip, ~120 GB free disk).
pip install "genvarloader @ git+https://github.com/mcvickerlab/GenVarLoader@main" \
            "genoray==2.3.3" pooch polars pyarrow pysam seqpro
# repro.py
import polars as pl
import pooch
import pysam
import subprocess
import sys
from pathlib import Path
import genoray, genvarloader as gvl

WORK = Path("/tmp/gvl-repro-176")
WORK.mkdir(exist_ok=True, parents=True)

# ---- 1. Inputs (1KG NYGC 30x phased chr1 + GRCh38 + Ensembl 113) ----
vcf_url = (
    "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
    "1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/"
    "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
)
vcf_raw = Path(pooch.retrieve(vcf_url, known_hash=None, path=WORK / "1kg"))
pooch.retrieve(vcf_url + ".tbi", known_hash=None, path=WORK / "1kg")

ref_url = (  # use whatever GRCh38 your environment already has
    "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/"
    "GRCh38.primary_assembly.genome.fa.gz"
)
ref_fa = Path(pooch.retrieve(ref_url, known_hash=None, path=WORK / "ref"))
# Index + faidx if not already:
subprocess.run(["bgzip", "-r", str(ref_fa)], check=False)
subprocess.run(["samtools", "faidx", str(ref_fa)], check=True)

gtf_url = "https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.chr.gtf.gz"
gtf = Path(pooch.retrieve(gtf_url, known_hash=None, path=WORK / "gtf"))

# ---- 2. bcftools norm (left-align + atomize + split multi-allelic) ----
# GVL/genoray require this. See https://genvarloader.readthedocs.io/en/latest/write.html
vcf_normed = WORK / "chr1.normed.vcf.gz"
if not vcf_normed.exists():
    subprocess.run([
        "bcftools", "norm",
        "-f", str(ref_fa),
        "-a", "--atom-overlaps", ".",
        "-m", "-any", "--multi-overlaps", ".",
        "--threads", "8",
        "-O", "z", "-o", str(vcf_normed),
        str(vcf_raw),
    ], check=True)
    subprocess.run(["bcftools", "index", "--tbi", str(vcf_normed)], check=True)

# ---- 3. Build the splice BED (per-transcript MANE Select CDS regions) ----
splice_bed = gvl.get_splice_bed(
    str(gtf), transcript_support_level=None, m3=True,
).filter(pl.col("chrom") == "chr1")

# ---- 4. Build SVAR via genoray (two-step pattern: avoids gvl.write's
#         _write_from_vcf shortcut that hit #162's max_mem overshoot). ----
svar_dir = WORK / "chr1.svar"
if not svar_dir.exists():
    genoray.SparseVar.from_vcf(
        out=svar_dir,
        vcf=genoray.VCF(str(vcf_normed)),
        max_mem="32g",
        overwrite=False,
    )

# ---- 5. Build GVL dataset (BED-indexed, uses the pre-built SVAR) ----
gvl_dir = WORK / "chr1.gvl"
if not gvl_dir.exists():
    gvl.write(
        path=str(gvl_dir),
        bed=splice_bed,
        variants=genoray.SparseVar(svar_dir),
        max_mem="2g",
        overwrite=False,
    )

# ---- 6. The asymmetry. Same dataset, two open paths. ----

# 6a. open(splice_info, var_filter) — BROKEN
print("=== Path A: Dataset.open(splice_info, var_filter) ===", flush=True)
try:
    ds_a = gvl.Dataset.open(
        gvl_dir,
        reference=str(ref_fa),
        splice_info=("transcript_id", "exon_number"),
        var_filter="exonic",
    ).with_seqs("haplotypes")
    haps = ds_a[0, :].to_ak()
    print(f"Path A succeeded — shape {haps.layout}")
except Exception as e:
    print(f"Path A FAILED — {type(e).__name__}: {e}")

# 6b. with_seqs.with_settings(splice_info, var_filter) — WORKS
print("\n=== Path B: .with_seqs().with_settings(splice_info, var_filter) ===", flush=True)
ds_b = (
    gvl.Dataset.open(gvl_dir, reference=str(ref_fa))
    .with_seqs("haplotypes")
    .with_settings(
        splice_info=("transcript_id", "exon_number"),
        var_filter="exonic",
    )
)
haps_b = ds_b[0, :].to_ak()
print(f"Path B succeeded — first sample/phase bytes len={len(bytes(haps_b[0, 0]))}")

Expected output:

=== Path A: Dataset.open(splice_info, var_filter) ===
Path A FAILED — MemoryError: Allocation failed (probably too large).

=== Path B: .with_seqs().with_settings(splice_info, var_filter) ===
Path B succeeded — first sample/phase bytes len=NNNNN

Same SVAR, same GVL dataset, same reference, same kwargs — only the call order differs.

Bisection evidence

Across two chromosomes (chr1 + chr2 of the same cohort) and four (GVL × genoray) version combos:

seqlab + GVL config chr1 chr2
Path A on GVL main (= 0.25.0) MemoryError MemoryError
Path A on GVL v0.24.1 + #170 cherry-pick MemoryError MemoryError
Path A on the above + genoray==2.3.3 pin MemoryError negative-dim ValueError
Path B on the same stack ✅ extract_done + upload_done ✅ extract_done + upload_done

Same pods, same data, same env — only the call order differs.

Related issues / PRs

Why this matters

Dataset.open(splice_info=..., var_filter=...) is a documented entry point and the natural call shape. Anyone relying on the open() signature today would hit this on cohort-scale data. Worth either fixing the open() path internally to route through the same code as with_settings(), or documenting that splice_info + var_filter MUST be applied via with_settings() until then.

I'm happy to dig further into where the state diverges between the two paths if it'd help — let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions