fix: add thread cleanup conftest for data tests to prevent CI hangs by NickGeneva · Pull Request #891 · NVIDIA/earth2studio

NickGeneva · 2026-05-30T00:15:15Z

Problem

Data source tests (test_cmip6.py, test_cds.py, etc.) cause CI to hang for 29+ minutes after pytest-timeout fires. The process was killed after 1748s with exit -15.

Stack traces show non-daemon threads blocking on:

ThreadPoolExecutor-1_* workers: queue.get(block=True)
fsspecIO: asyncio event loop select()
asyncio_0/1/2: default executor threads

Root Cause

When pytest-timeout fires via SIGALRM (signal method), it raises an exception in the main thread and the test is marked xfail. However, non-daemon ThreadPoolExecutor workers from intake-esm/xarray/dask remain alive, idle but blocking on work_queue.get(). At process shutdown, Python joins all non-daemon threads indefinitely — hence the hang.

Solution

Add test/data/conftest.py with an autouse fixture that cleans up orphaned threads after each data source test:

Stops fsspec's global asyncio IO loop and resets its singleton state (fsspec.asyn.loop[0], fsspec.asyn.iothread[0])
Discovers orphaned ThreadPoolExecutor instances via gc.get_objects() and calls shutdown(wait=False, cancel_futures=True)
Joins remaining non-daemon threads with a 1s timeout

Testing

Verified with a synthetic test that creates stuck threads + timeout → cleanup completes in ~2s
Existing data tests (wb2era5, arco): 15 xpassed, no regressions
Lexicon tests: 170 passed in 2.43s (unaffected)
All pre-commit hooks pass

Add test/data/conftest.py with an autouse fixture that cleans up orphaned threads after each data source test. When pytest-timeout fires via SIGALRM, non-daemon ThreadPoolExecutor workers from intake-esm/xarray/dask remain alive blocking on queue.get(), causing Python shutdown to join them indefinitely (observed 1748s hang in CI). The fixture: - Stops fsspec's global asyncio IO loop and resets its singleton state - Discovers orphaned ThreadPoolExecutor instances via gc and shuts them down - Joins remaining non-daemon threads with a 1s timeout This ensures timed-out tests no longer leave zombie threads that block process exit.

greptile-apps · 2026-05-30T00:18:09Z

Greptile Summary

This PR adds a test/data/conftest.py autouse fixture that tears down fsspec's global IO loop and orphaned ThreadPoolExecutor instances after each data source test, preventing CI hangs when pytest-timeout fires. It also fixes a device-resolution bug in cbottle.py where device_buffer.device could diverge from the actual device used by the wrapped CBottle3d model.

test/data/conftest.py: New autouse fixture snapshots live threads before each test, then on teardown stops fsspec's asyncio loop, resets its singleton state (loop[0], iothread[0]), and calls shutdown(wait=False, cancel_futures=True) on all live ThreadPoolExecutor objects found via gc.get_objects(), followed by a brief join on remaining non-daemon threads.
earth2studio/data/cbottle.py: __call__ now uses self.core_model.device (the actual parameter device of the wrapped cbottle model) instead of self.device_buffer.device (a registered-buffer proxy that can lag behind when the inner model is moved independently).

Confidence Score: 5/5

Safe to merge; changes are scoped to test infrastructure cleanup and a one-line device-resolution fix in the CBottle data source.

The conftest fixture is careful and well-documented, and the cbottle device fix is a targeted correction with no logic regressions. The only non-blocking concern is that the fixture accesses fsspec private internals without guarding against an AttributeError, but this is a future-proofing concern rather than a current defect.

No files require special attention; test/data/conftest.py is worth a second look only if the team is concerned about long-term fsspec API stability.

Important Files Changed

Filename	Overview
test/data/conftest.py	New autouse fixture cleanly handles fsspec teardown; accesses private fsspec internals (`loop[0]`, `iothread[0]`) without guarding against `AttributeError`, which could break all data tests if fsspec's internal API changes.
earth2studio/data/cbottle.py	Fixes device-resolution in `__call__` by using `core_model.device` instead of the buffer proxy; `get_cbottle_input` still uses `device_buffer.device` for the SST tensor but is functionally unaffected since the result is moved back to CPU before returning.

_{Reviews (2): Last reviewed commit: "fix: use core model device in CBottle3D ..." | Re-trigger Greptile}

NickGeneva · 2026-05-30T00:33:59Z

/blossom-ci

NickGeneva · 2026-05-30T01:11:41Z

@greptile-apps

CBottle3d's __init__ auto-selects CUDA when available (device=None). This means the model ends up on CUDA even when CBottle3D.to() is never called, causing device mismatches with device_buffer (on CPU) for sigma_max and regridder. Fix: pass device="cpu" to CBottle3d() so the model starts on CPU and only moves when CBottle3D.to(device) is explicitly called. Also use self.core_model.device in __call__ to derive the compute device from where model parameters actually reside.

Snapshot live ThreadPoolExecutor instances before each test via gc and only shut down executors that are new (not in the pre-test snapshot). This prevents the fixture from killing shared/global executors (e.g. dask global thread pool, pytest-asyncio worker pool) that later tests may depend on.

NickGeneva · 2026-05-30T01:33:16Z

/blossom-ci

Per-test executor shutdown broke dask/zarr/asyncio singleton thread pools that are lazily created and reused across tests. The new approach: 1. Per-test: only reset fsspec's global IO loop (daemon, safe to reset) 2. Session-end: shut down all executors and force os._exit(0) if any non-daemon threads remain stuck after a grace period This fixes RuntimeError('cannot schedule new futures after shutdown') in test_data_utils.py while still preventing CI hangs from stuck background threads after timeouts.

NickGeneva · 2026-05-30T02:52:41Z

/blossom-ci

…instances Filesystem objects (s3fs.S3FileSystem, etc.) cache the fsspec IO loop at construction time. Resetting the loop per-test invalidates any module-level filesystem instances (e.g. GEFS_FX in test_gefs.py), causing 'Loop is not running' errors in subsequent tests. Since the fsspecIO thread is a daemon (won't block process exit), there is no need to reset it between tests. The session-end fixture handles the actual CI hang problem by shutting down stuck non-daemon threads.

NickGeneva · 2026-05-30T04:08:23Z

/blossom-ci

pzharrington

Approving, but just double checking you intended to include the cbottle changes as well?

greptile-apps Bot reviewed May 30, 2026

View reviewed changes

Comment thread test/data/conftest.py

Comment thread test/data/conftest.py Outdated

NickGeneva force-pushed the ngeneva/pytest_data_fixes branch from 1746b99 to 02b03a6 Compare May 30, 2026 01:17

NickGeneva requested a review from pzharrington June 1, 2026 16:02

pzharrington approved these changes Jun 1, 2026

View reviewed changes

NickGeneva merged commit 9e7bbae into NVIDIA:main Jun 1, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add thread cleanup conftest for data tests to prevent CI hangs#891

fix: add thread cleanup conftest for data tests to prevent CI hangs#891
NickGeneva merged 5 commits into
NVIDIA:mainfrom
NickGeneva:ngeneva/pytest_data_fixes

NickGeneva commented May 30, 2026

Uh oh!

greptile-apps Bot commented May 30, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

pzharrington left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NickGeneva commented May 30, 2026

Problem

Root Cause

Solution

Testing

Uh oh!

greptile-apps Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

NickGeneva commented May 30, 2026

Uh oh!

pzharrington left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 30, 2026 •

edited

Loading