Skip to content

feat(enrichr): retrieve gene sets incl. MSigDB collections (#139)#241

Draft
Elarwei001 wants to merge 13 commits into
scverse:devfrom
Elarwei001:feature/enrichr-msigdb-139
Draft

feat(enrichr): retrieve gene sets incl. MSigDB collections (#139)#241
Elarwei001 wants to merge 13 commits into
scverse:devfrom
Elarwei001:feature/enrichr-msigdb-139

Conversation

@Elarwei001

Copy link
Copy Markdown
Contributor

Resolves #139

Summary

gget enrichr: Added support for retrieving gene sets, including MSigDB collections (fixes issue 139).

Testing

Unit tests added/extended in tests/test_enrichr.py with fixture entries in tests/fixtures/test_enrichr.json; run with pytest.

lauraluebbert and others added 12 commits June 21, 2026 22:01
…#178, scverse#177) (scverse#222)

* feat(pdb): support PDBx/mmCIF format and auto-fallback (scverse#178, scverse#177)

The legacy PDB format is being phased out by RCSB and is unavailable for
large structures (e.g. 6Q38, 7A01), causing `gget pdb` to fail with
"not found" — the bug reported in scverse#177.

- Add `resource="mmcif"` to download the structure in PDBx/mmCIF (.cif).
- `resource="pdb"` (default) now automatically falls back to PDBx/mmCIF
  when the legacy PDB file is unavailable, logging a warning. Saved files
  use the correct extension (.cif vs .pdb) based on the format fetched.
- Backward compatible: existing commands that already worked are unchanged.
- Tests: explicit mmcif download + legacy->mmcif fallback regression (6Q38).
- Docs + updates.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: ruff lint errors flagged by pre-commit.ci

- gget_g2p.py: collapse the multi-line docstring summary into a
  single line + blank line (ruff D205).
- main.py: add the missing # noqa: E402 to the 6 new import lines
  (g2p, ref, search, seq, setup, virus). All earlier imports already
  carry this noqa because dt_string is computed at module top before
  the import block, so E402 fires on any unmarked later import.
  Also drops the stray "# Module functions" comment that was
  splitting one alphabetical import list into two.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laura Luebbert <laura.lbt60@gmail.com>
Send `User-Agent: gget/<version> (+https://github.com/scverse/gget)` on
all Bgee API calls so the upstream service can attribute traffic to gget
and reach the project if needed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…+ recent main bot commits

# Conflicts:
#	.github/badges/tests.json
#	tests/pytest_results.txt
…cverse#225)

* feat(types): pay down 67 mypy errors (var-annotated + json overloads)

mypy baseline: 613 → 546 (−67, ~11%). No behavior changes — every
edit is annotation-only and verified with python -m py_compile +
`import gget` smoke test + `pytest --collect-only` (400 tests, 0
collection errors). Resolves part of scverse#216.

Two passes:

1. var-annotated quick wins (−21 errors, → 0 remaining)
   - Added explicit type annotations to 13 empty container literals
     across utils.py, gget_ref.py, gget_info.py, gget_blat.py,
     gget_muscle.py, gget_virus.py. Inferred element type from
     surrounding code (list[str], dict[str, Any], etc.) — fell back
     to Any only when the type was genuinely dynamic.

2. typing.overload for the json= flag pattern (−~20 union-attr,
   plus ~26 other category errors that depended on the narrowed
   return type)
   - Added @typing.overload signatures for the 12 modules with
     `def f(..., json: bool = False, ...) -> DataFrame | dict`:
     gget_8cube, gget_archs4, gget_bgee, gget_blast, gget_blat,
     gget_cosmic, gget_diamond, gget_elm, gget_enrichr, gget_info,
     gget_opentargets, gget_search.
   - Now `f(...)` returns DataFrame and `f(..., json=True)` returns
     dict at the type-check level. Implementation signature unchanged.

Why only 67 and not the predicted ~150:
- Most remaining [union-attr] errors come from BeautifulSoup
  (`Tag | None` from `.find()`) and `str | None` checks, not the
  json= flag pattern. Those need per-callsite None-guards, which
  is the next batch.

Remaining categories (sorted, top 6):
  [index]        ~157  (pandas df["col"] indexing — needs cast() or # type: ignore)
  [union-attr]   115   (BeautifulSoup / str|None — needs None-guards)
  [attr-defined]  68   (dynamic JSON response shapes)
  [call-overload] 58   (pandas/numpy stubs)
  [assignment]   ~56
  [arg-type]     ~53

* fix(bgee): restore Literal + overload imports lost in dev merge

The merge of dev into feat/mypy-cleanup (925f66d) collided on
gget_bgee.py's typing import line. Git auto-resolved by taking
dev's version (`from typing import TYPE_CHECKING, Any` — added by
the bgee user-agent PR scverse#224) and silently dropped the `Literal,
overload` additions from this branch, while keeping the @overload
decorators at lines 183/192 that use them. Result: module-load
NameError that broke test collection for every test file that
imports gget.

Restore the full import: TYPE_CHECKING, Any, Literal, overload.

* ci: re-trigger pre-commit.ci

* fix(pre-commit): exclude .github/badges/*.json from formatting

The badge JSON is regenerated by ci.yml's "Generate tests badge JSON"
step using json.dumps() with no `indent` parameter — single-line
compact output. biome's default JSON formatter wants multi-line tab-
indented output. So every CI run writes the compact form, and every
pre-commit.ci run reformats it back, and we get a permanent
biome-format failure that never resolves.

Same fix as the tests/pytest_results.txt entry: just exclude the
auto-generated file from formatting hooks.
Add `gget.enrichr_library()` (CLI: `gget enrichr --get_library`) to fetch the
gene sets (members) of any Enrichr gene-set library — the recommended way to
retrieve MSigDB gene sets (e.g. MSigDB_Hallmark_2020) without MSigDB login.

- Returns a long-format DataFrame (gene_set, gene), or a {gene_set: [genes]}
  dict with json=True. `gene_set=` returns a single set; `species` selects the
  non-human Enrichr variants.
- CLI: new --get_library/-gl and --gene_set/-gs; genes/--database made optional
  in library mode (still enforced for enrichment). Backward compatible.
- Detects Enrichr's HTML-404 (HTTP 200) response for unknown libraries.
- Tests + fixtures (live Enrichr) and docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.62%. Comparing base (5cf607f) to head (69cc3ea).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #241      +/-   ##
==========================================
+ Coverage   56.14%   56.62%   +0.48%     
==========================================
  Files          29       29              
  Lines        9244     9285      +41     
==========================================
+ Hits         5190     5258      +68     
+ Misses       4054     4027      -27     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add a network-free TestEnrichrLibraryOffline class that mocks requests
to cover enrichr_library: invalid species, verbose logging, blank-line
parsing, bad/empty library errors, gene_set filter + not-found, and the
json/json+save/CSV-save branches. All PR-added lines now covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
@lauraluebbert lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31
@lauraluebbert lauraluebbert reopened this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants