Skip to content

Stream /scores & /counts CSV responses (no row cap) #770

@bencap

Description

@bencap

Context

The CSV endpoints build the entire response in memory (StreamingResponse(iter([csv_str])), src/mavedb/routers/score_sets.py) and return it as a fake stream. This is the memory bomb that let a ChatGPT-User crawler pull full, unpaginated CSVs and saturate prod-api. Full-CSV download is a legitimate feature (the UI download button pulls with no pagination), so we stream rather than cap rows.

Part of the effort to make /scores & /counts safe to reopen to AI agents.

Scope

  • Extract _derive_csv_columns(...) from get_score_set_variants_as_csv (src/mavedb/lib/score_sets.py), behavior-preserving, so the buffered and streaming paths share column logic.
  • Add stream_score_set_variants_as_csv(...): write the header, iterate the Variant query with stream_results=True, flush ~1k-row batches via variant_to_csv_row.
  • Use the request-scoped db session (no extra session).
  • Make drop_na_columns streamable via a single up-front aggregate query — bool_or(...) over hgvs_nt / hgvs_splice / hgvs_pro, with a Postgres regex mirroring is_null — instead of the all-rows Python scan; use DictWriter(extrasaction='ignore').
  • Retain the buffered build as a fallback only for namespaces needing all-rows side-queries (clingen/vep/gnomad/clinvar/post-mapped) — i.e. the custom-download endpoint when those are selected.
  • Point all three CSV endpoints at the generator.

Acceptance criteria

  • Byte-for-byte parity with current output across scores / counts / custom-namespace + drop_na_columns, including a test case with an all-"NA" hgvs column.
  • A streaming test on a score set large enough to force multiple chunks streams fully and correctly, and fails loudly if the request DB session closes mid-stream (guards the FastAPI 0.121 yield-dependency timing).
  • Peak memory for /scores & /counts no longer scales with full CSV size (spot-checked).

Dependencies

None. Highest leverage — ship first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: backendTask implementation touches the backendtype: enhancementEnhancement to an existing feature
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions