fix: add thread cleanup conftest for data tests to prevent CI hangs#891
Conversation
Add test/data/conftest.py with an autouse fixture that cleans up orphaned threads after each data source test. When pytest-timeout fires via SIGALRM, non-daemon ThreadPoolExecutor workers from intake-esm/xarray/dask remain alive blocking on queue.get(), causing Python shutdown to join them indefinitely (observed 1748s hang in CI). The fixture: - Stops fsspec's global asyncio IO loop and resets its singleton state - Discovers orphaned ThreadPoolExecutor instances via gc and shuts them down - Joins remaining non-daemon threads with a 1s timeout This ensures timed-out tests no longer leave zombie threads that block process exit.
Greptile SummaryThis PR adds a
|
| Filename | Overview |
|---|---|
| test/data/conftest.py | New autouse fixture cleanly handles fsspec teardown; accesses private fsspec internals (loop[0], iothread[0]) without guarding against AttributeError, which could break all data tests if fsspec's internal API changes. |
| earth2studio/data/cbottle.py | Fixes device-resolution in __call__ by using core_model.device instead of the buffer proxy; get_cbottle_input still uses device_buffer.device for the SST tensor but is functionally unaffected since the result is moved back to CPU before returning. |
Reviews (2): Last reviewed commit: "fix: use core model device in CBottle3D ..." | Re-trigger Greptile
|
/blossom-ci |
CBottle3d's __init__ auto-selects CUDA when available (device=None). This means the model ends up on CUDA even when CBottle3D.to() is never called, causing device mismatches with device_buffer (on CPU) for sigma_max and regridder. Fix: pass device="cpu" to CBottle3d() so the model starts on CPU and only moves when CBottle3D.to(device) is explicitly called. Also use self.core_model.device in __call__ to derive the compute device from where model parameters actually reside.
1746b99 to
02b03a6
Compare
Snapshot live ThreadPoolExecutor instances before each test via gc and only shut down executors that are new (not in the pre-test snapshot). This prevents the fixture from killing shared/global executors (e.g. dask global thread pool, pytest-asyncio worker pool) that later tests may depend on.
|
/blossom-ci |
Per-test executor shutdown broke dask/zarr/asyncio singleton thread
pools that are lazily created and reused across tests. The new approach:
1. Per-test: only reset fsspec's global IO loop (daemon, safe to reset)
2. Session-end: shut down all executors and force os._exit(0) if any
non-daemon threads remain stuck after a grace period
This fixes RuntimeError('cannot schedule new futures after shutdown')
in test_data_utils.py while still preventing CI hangs from stuck
background threads after timeouts.
|
/blossom-ci |
…instances Filesystem objects (s3fs.S3FileSystem, etc.) cache the fsspec IO loop at construction time. Resetting the loop per-test invalidates any module-level filesystem instances (e.g. GEFS_FX in test_gefs.py), causing 'Loop is not running' errors in subsequent tests. Since the fsspecIO thread is a daemon (won't block process exit), there is no need to reset it between tests. The session-end fixture handles the actual CI hang problem by shutting down stuck non-daemon threads.
|
/blossom-ci |
pzharrington
left a comment
There was a problem hiding this comment.
Approving, but just double checking you intended to include the cbottle changes as well?
Problem
Data source tests (test_cmip6.py, test_cds.py, etc.) cause CI to hang for 29+ minutes after pytest-timeout fires. The process was killed after 1748s with
exit -15.Stack traces show non-daemon threads blocking on:
ThreadPoolExecutor-1_*workers:queue.get(block=True)fsspecIO:asyncio event loop select()asyncio_0/1/2: default executor threadsRoot Cause
When
pytest-timeoutfires via SIGALRM (signal method), it raises an exception in the main thread and the test is marked xfail. However, non-daemon ThreadPoolExecutor workers from intake-esm/xarray/dask remain alive, idle but blocking onwork_queue.get(). At process shutdown, Python joins all non-daemon threads indefinitely — hence the hang.Solution
Add
test/data/conftest.pywith an autouse fixture that cleans up orphaned threads after each data source test:fsspec.asyn.loop[0],fsspec.asyn.iothread[0])gc.get_objects()and callsshutdown(wait=False, cancel_futures=True)Testing