Skip to content

fix: add thread cleanup conftest for data tests to prevent CI hangs#891

Merged
NickGeneva merged 5 commits into
NVIDIA:mainfrom
NickGeneva:ngeneva/pytest_data_fixes
Jun 1, 2026
Merged

fix: add thread cleanup conftest for data tests to prevent CI hangs#891
NickGeneva merged 5 commits into
NVIDIA:mainfrom
NickGeneva:ngeneva/pytest_data_fixes

Conversation

@NickGeneva
Copy link
Copy Markdown
Collaborator

Problem

Data source tests (test_cmip6.py, test_cds.py, etc.) cause CI to hang for 29+ minutes after pytest-timeout fires. The process was killed after 1748s with exit -15.

Stack traces show non-daemon threads blocking on:

  • ThreadPoolExecutor-1_* workers: queue.get(block=True)
  • fsspecIO: asyncio event loop select()
  • asyncio_0/1/2: default executor threads

Root Cause

When pytest-timeout fires via SIGALRM (signal method), it raises an exception in the main thread and the test is marked xfail. However, non-daemon ThreadPoolExecutor workers from intake-esm/xarray/dask remain alive, idle but blocking on work_queue.get(). At process shutdown, Python joins all non-daemon threads indefinitely — hence the hang.

Solution

Add test/data/conftest.py with an autouse fixture that cleans up orphaned threads after each data source test:

  1. Stops fsspec's global asyncio IO loop and resets its singleton state (fsspec.asyn.loop[0], fsspec.asyn.iothread[0])
  2. Discovers orphaned ThreadPoolExecutor instances via gc.get_objects() and calls shutdown(wait=False, cancel_futures=True)
  3. Joins remaining non-daemon threads with a 1s timeout

Testing

  • Verified with a synthetic test that creates stuck threads + timeout → cleanup completes in ~2s
  • Existing data tests (wb2era5, arco): 15 xpassed, no regressions
  • Lexicon tests: 170 passed in 2.43s (unaffected)
  • All pre-commit hooks pass

Add test/data/conftest.py with an autouse fixture that cleans up orphaned
threads after each data source test. When pytest-timeout fires via SIGALRM,
non-daemon ThreadPoolExecutor workers from intake-esm/xarray/dask remain
alive blocking on queue.get(), causing Python shutdown to join them
indefinitely (observed 1748s hang in CI).

The fixture:
- Stops fsspec's global asyncio IO loop and resets its singleton state
- Discovers orphaned ThreadPoolExecutor instances via gc and shuts them down
- Joins remaining non-daemon threads with a 1s timeout

This ensures timed-out tests no longer leave zombie threads that block
process exit.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 30, 2026

Greptile Summary

This PR adds a test/data/conftest.py autouse fixture that tears down fsspec's global IO loop and orphaned ThreadPoolExecutor instances after each data source test, preventing CI hangs when pytest-timeout fires. It also fixes a device-resolution bug in cbottle.py where device_buffer.device could diverge from the actual device used by the wrapped CBottle3d model.

  • test/data/conftest.py: New autouse fixture snapshots live threads before each test, then on teardown stops fsspec's asyncio loop, resets its singleton state (loop[0], iothread[0]), and calls shutdown(wait=False, cancel_futures=True) on all live ThreadPoolExecutor objects found via gc.get_objects(), followed by a brief join on remaining non-daemon threads.
  • earth2studio/data/cbottle.py: __call__ now uses self.core_model.device (the actual parameter device of the wrapped cbottle model) instead of self.device_buffer.device (a registered-buffer proxy that can lag behind when the inner model is moved independently).

Confidence Score: 5/5

Safe to merge; changes are scoped to test infrastructure cleanup and a one-line device-resolution fix in the CBottle data source.

The conftest fixture is careful and well-documented, and the cbottle device fix is a targeted correction with no logic regressions. The only non-blocking concern is that the fixture accesses fsspec private internals without guarding against an AttributeError, but this is a future-proofing concern rather than a current defect.

No files require special attention; test/data/conftest.py is worth a second look only if the team is concerned about long-term fsspec API stability.

Important Files Changed

Filename Overview
test/data/conftest.py New autouse fixture cleanly handles fsspec teardown; accesses private fsspec internals (loop[0], iothread[0]) without guarding against AttributeError, which could break all data tests if fsspec's internal API changes.
earth2studio/data/cbottle.py Fixes device-resolution in __call__ by using core_model.device instead of the buffer proxy; get_cbottle_input still uses device_buffer.device for the SST tensor but is functionally unaffected since the result is moved back to CPU before returning.

Reviews (2): Last reviewed commit: "fix: use core model device in CBottle3D ..." | Re-trigger Greptile

Comment thread test/data/conftest.py
Comment thread test/data/conftest.py Outdated
@NickGeneva
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva
Copy link
Copy Markdown
Collaborator Author

@greptile-apps

CBottle3d's __init__ auto-selects CUDA when available (device=None).
This means the model ends up on CUDA even when CBottle3D.to() is never
called, causing device mismatches with device_buffer (on CPU) for
sigma_max and regridder.

Fix: pass device="cpu" to CBottle3d() so the model starts on CPU and
only moves when CBottle3D.to(device) is explicitly called. Also use
self.core_model.device in __call__ to derive the compute device from
where model parameters actually reside.
@NickGeneva NickGeneva force-pushed the ngeneva/pytest_data_fixes branch from 1746b99 to 02b03a6 Compare May 30, 2026 01:17
Snapshot live ThreadPoolExecutor instances before each test via gc and
only shut down executors that are new (not in the pre-test snapshot).
This prevents the fixture from killing shared/global executors (e.g.
dask global thread pool, pytest-asyncio worker pool) that later tests
may depend on.
@NickGeneva
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

Per-test executor shutdown broke dask/zarr/asyncio singleton thread
pools that are lazily created and reused across tests.  The new approach:

1. Per-test: only reset fsspec's global IO loop (daemon, safe to reset)
2. Session-end: shut down all executors and force os._exit(0) if any
   non-daemon threads remain stuck after a grace period

This fixes RuntimeError('cannot schedule new futures after shutdown')
in test_data_utils.py while still preventing CI hangs from stuck
background threads after timeouts.
@NickGeneva
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

…instances

Filesystem objects (s3fs.S3FileSystem, etc.) cache the fsspec IO loop
at construction time. Resetting the loop per-test invalidates any
module-level filesystem instances (e.g. GEFS_FX in test_gefs.py),
causing 'Loop is not running' errors in subsequent tests.

Since the fsspecIO thread is a daemon (won't block process exit), there
is no need to reset it between tests. The session-end fixture handles
the actual CI hang problem by shutting down stuck non-daemon threads.
@NickGeneva
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva NickGeneva requested a review from pzharrington June 1, 2026 16:02
Copy link
Copy Markdown
Collaborator

@pzharrington pzharrington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but just double checking you intended to include the cbottle changes as well?

@NickGeneva NickGeneva merged commit 9e7bbae into NVIDIA:main Jun 1, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants