Summary
Passing splice_info + var_filter="exonic" via Dataset.open(...) produces broken kernel state inside choose_exonic_variants (numerical overflow on n_variants → MemoryError or ValueError: negative dimensions not allowed), while passing the same arguments via with_settings(splice_info=..., var_filter=...) after with_seqs("haplotypes") works correctly on the same data.
Symptom
Traceback (most recent call last):
File "<repro>.py", line 89, in <module>
haps = ds[0, :].to_ak()
File "genvarloader/_dataset/_impl.py", line 2231, in __getitem__
return super().__getitem__(idx)
File "genvarloader/_dataset/_impl.py", line 1561, in __getitem__
recon, squeeze, out_reshape = self._getitem_spliced(idx, self._sp_idxer)
File "genvarloader/_dataset/_impl.py", line 1670, in _getitem_spliced
recon = inner_ds._recon(...)
File "genvarloader/_dataset/_reconstruct.py", line 354, in __call__
haps, *_ = self.get_haps_and_shifts(...)
File "genvarloader/_dataset/_reconstruct.py", line 395, in get_haps_and_shifts
keep, keep_offsets = choose_exonic_variants(...)
MemoryError: Allocation failed (probably too large).
# (chr1; equivalent run on chr2 raises `ValueError: negative dimensions not allowed`)
The kernel computes lengths[query, hap] = o_e - o_s then keep = np.empty(n_variants, np.bool_) where n_variants = keep_offsets[-1]. Either:
- The
(o_s, o_e) offsets are wrong via the open() setup path (negative or huge gap), or
_sp_idxer / _spliced_bed get a different internal shape via open(splice_info=...) than via with_settings(splice_info=...).
Environment
|
|
| GenVarLoader |
0.24.1 PyPI and main post-#170 (both reproduce) |
| genoray |
reproduces on 2.3.3, 2.6.0 (whole range) |
| Python |
3.12 (runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404) |
| OS |
Ubuntu 24.04 |
| Pod |
16 vCPU / 64 GB / cpu3g on Runpod |
Full end-to-end reproduction
Self-contained — downloads the 1KG NYGC 30x chr1 phased panel + reference + GTF, builds the SVAR+GVL pair, and demonstrates the asymmetry by opening the same dataset two ways. Tested with genvarloader==0.24.1 + the merged #170 fix on top.
# Pod env (assumes bcftools, pip, ~120 GB free disk).
pip install "genvarloader @ git+https://github.com/mcvickerlab/GenVarLoader@main" \
"genoray==2.3.3" pooch polars pyarrow pysam seqpro
# repro.py
import polars as pl
import pooch
import pysam
import subprocess
import sys
from pathlib import Path
import genoray, genvarloader as gvl
WORK = Path("/tmp/gvl-repro-176")
WORK.mkdir(exist_ok=True, parents=True)
# ---- 1. Inputs (1KG NYGC 30x phased chr1 + GRCh38 + Ensembl 113) ----
vcf_url = (
"https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
"1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/"
"1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
)
vcf_raw = Path(pooch.retrieve(vcf_url, known_hash=None, path=WORK / "1kg"))
pooch.retrieve(vcf_url + ".tbi", known_hash=None, path=WORK / "1kg")
ref_url = ( # use whatever GRCh38 your environment already has
"https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/"
"GRCh38.primary_assembly.genome.fa.gz"
)
ref_fa = Path(pooch.retrieve(ref_url, known_hash=None, path=WORK / "ref"))
# Index + faidx if not already:
subprocess.run(["bgzip", "-r", str(ref_fa)], check=False)
subprocess.run(["samtools", "faidx", str(ref_fa)], check=True)
gtf_url = "https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.chr.gtf.gz"
gtf = Path(pooch.retrieve(gtf_url, known_hash=None, path=WORK / "gtf"))
# ---- 2. bcftools norm (left-align + atomize + split multi-allelic) ----
# GVL/genoray require this. See https://genvarloader.readthedocs.io/en/latest/write.html
vcf_normed = WORK / "chr1.normed.vcf.gz"
if not vcf_normed.exists():
subprocess.run([
"bcftools", "norm",
"-f", str(ref_fa),
"-a", "--atom-overlaps", ".",
"-m", "-any", "--multi-overlaps", ".",
"--threads", "8",
"-O", "z", "-o", str(vcf_normed),
str(vcf_raw),
], check=True)
subprocess.run(["bcftools", "index", "--tbi", str(vcf_normed)], check=True)
# ---- 3. Build the splice BED (per-transcript MANE Select CDS regions) ----
splice_bed = gvl.get_splice_bed(
str(gtf), transcript_support_level=None, m3=True,
).filter(pl.col("chrom") == "chr1")
# ---- 4. Build SVAR via genoray (two-step pattern: avoids gvl.write's
# _write_from_vcf shortcut that hit #162's max_mem overshoot). ----
svar_dir = WORK / "chr1.svar"
if not svar_dir.exists():
genoray.SparseVar.from_vcf(
out=svar_dir,
vcf=genoray.VCF(str(vcf_normed)),
max_mem="32g",
overwrite=False,
)
# ---- 5. Build GVL dataset (BED-indexed, uses the pre-built SVAR) ----
gvl_dir = WORK / "chr1.gvl"
if not gvl_dir.exists():
gvl.write(
path=str(gvl_dir),
bed=splice_bed,
variants=genoray.SparseVar(svar_dir),
max_mem="2g",
overwrite=False,
)
# ---- 6. The asymmetry. Same dataset, two open paths. ----
# 6a. open(splice_info, var_filter) — BROKEN
print("=== Path A: Dataset.open(splice_info, var_filter) ===", flush=True)
try:
ds_a = gvl.Dataset.open(
gvl_dir,
reference=str(ref_fa),
splice_info=("transcript_id", "exon_number"),
var_filter="exonic",
).with_seqs("haplotypes")
haps = ds_a[0, :].to_ak()
print(f"Path A succeeded — shape {haps.layout}")
except Exception as e:
print(f"Path A FAILED — {type(e).__name__}: {e}")
# 6b. with_seqs.with_settings(splice_info, var_filter) — WORKS
print("\n=== Path B: .with_seqs().with_settings(splice_info, var_filter) ===", flush=True)
ds_b = (
gvl.Dataset.open(gvl_dir, reference=str(ref_fa))
.with_seqs("haplotypes")
.with_settings(
splice_info=("transcript_id", "exon_number"),
var_filter="exonic",
)
)
haps_b = ds_b[0, :].to_ak()
print(f"Path B succeeded — first sample/phase bytes len={len(bytes(haps_b[0, 0]))}")
Expected output:
=== Path A: Dataset.open(splice_info, var_filter) ===
Path A FAILED — MemoryError: Allocation failed (probably too large).
=== Path B: .with_seqs().with_settings(splice_info, var_filter) ===
Path B succeeded — first sample/phase bytes len=NNNNN
Same SVAR, same GVL dataset, same reference, same kwargs — only the call order differs.
Bisection evidence
Across two chromosomes (chr1 + chr2 of the same cohort) and four (GVL × genoray) version combos:
| seqlab + GVL config |
chr1 |
chr2 |
Path A on GVL main (= 0.25.0) |
MemoryError |
MemoryError |
| Path A on GVL v0.24.1 + #170 cherry-pick |
MemoryError |
MemoryError |
Path A on the above + genoray==2.3.3 pin |
MemoryError |
negative-dim ValueError |
| Path B on the same stack |
✅ extract_done + upload_done |
✅ extract_done + upload_done |
Same pods, same data, same env — only the call order differs.
Related issues / PRs
Why this matters
Dataset.open(splice_info=..., var_filter=...) is a documented entry point and the natural call shape. Anyone relying on the open() signature today would hit this on cohort-scale data. Worth either fixing the open() path internally to route through the same code as with_settings(), or documenting that splice_info + var_filter MUST be applied via with_settings() until then.
I'm happy to dig further into where the state diverges between the two paths if it'd help — let me know.
Summary
Passing
splice_info+var_filter="exonic"viaDataset.open(...)produces broken kernel state insidechoose_exonic_variants(numerical overflow onn_variants→MemoryErrororValueError: negative dimensions not allowed), while passing the same arguments viawith_settings(splice_info=..., var_filter=...)afterwith_seqs("haplotypes")works correctly on the same data.Symptom
The kernel computes
lengths[query, hap] = o_e - o_sthenkeep = np.empty(n_variants, np.bool_)wheren_variants = keep_offsets[-1]. Either:(o_s, o_e)offsets are wrong via theopen()setup path (negative or huge gap), or_sp_idxer/_spliced_bedget a different internal shape viaopen(splice_info=...)than viawith_settings(splice_info=...).Environment
0.24.1PyPI andmainpost-#170 (both reproduce)2.3.3,2.6.0(whole range)runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404)Full end-to-end reproduction
Self-contained — downloads the 1KG NYGC 30x chr1 phased panel + reference + GTF, builds the SVAR+GVL pair, and demonstrates the asymmetry by opening the same dataset two ways. Tested with
genvarloader==0.24.1+ the merged #170 fix on top.Expected output:
Same SVAR, same GVL dataset, same reference, same kwargs — only the call order differs.
Bisection evidence
Across two chromosomes (chr1 + chr2 of the same cohort) and four (GVL × genoray) version combos:
main(= 0.25.0)genoray==2.3.3pinSame pods, same data, same env — only the call order differs.
Related issues / PRs
TypingError: slice(array, array)from a missingndimguard on the second prange loop. Dataset.open(splice_info, var_filter) breaks choose_exonic_variants; with_settings(splice_info, var_filter) works #176 is the next-layer bug that became visible only after fix: ndim guard on geno_offsets in choose_exonic_variants second loop #170 unmasked it.gvl.writeexceedsmax_memby >2x on chrom-scale builds (peak >64 GiB withmax_mem="32g") #162 (open): different code path (gvl.writemax_mem), but shows that the read-side and write-side both have memory-accounting / pathological-allocation sore spots.Why this matters
Dataset.open(splice_info=..., var_filter=...)is a documented entry point and the natural call shape. Anyone relying on theopen()signature today would hit this on cohort-scale data. Worth either fixing theopen()path internally to route through the same code aswith_settings(), or documenting thatsplice_info+var_filterMUST be applied viawith_settings()until then.I'm happy to dig further into where the state diverges between the two paths if it'd help — let me know.