Fix shared-RNG data race in saveat_regression ensemble test by ChrisRackauckas-Claude · Pull Request #600 · SciML/JumpProcesses.jl

ChrisRackauckas-Claude · 2026-06-06T23:30:00Z

Ignore until reviewed by @ChrisRackauckas.

Problem

test/saveat_regression.jl shares a single StableRNG(12345) across every trajectory of an EnsembleProblem solved with the default EnsembleThreads(). Under multithreading (JULIA_NUM_THREADS=auto/8), worker threads concurrently mutate the one LehmerRNG, racing on its internal state. The corruption produces degenerate trajectories, the ensemble mean no longer tracks exp(-10t), and roughly 190/202 of the rtol=0.1 assertions fail nondeterministically. This is a flaky failure observed on master's own Tests (lts/pre, InterfaceII) jobs. Single-threaded the test passes 202/202.

Fix

Give every trajectory its own independent, seeded StableRNG via the EnsembleProblem prob_func, keyed off ctx.sim_id. This makes the threaded solve race-free and reproducible.

Does not switch to EnsembleSerial() — the threaded path is still exercised.
Does not loosen the rtol — it remains a real rtol=0.1 regression check.
Preserves the original per-solve save_positions ((false, false) for the saveat solve, the DiscreteProblem default (false, true) for the interpolated solve), so the statistical intent is unchanged.

Only test/saveat_regression.jl is touched. File is already SciMLStyle-formatted (JuliaFormatter v2, no diff).

Verification (run locally, multithreaded, latest deps)

Isolated depot, JumpProcesses 9.29.0 / DiffEqBase 7.5.5 / SciMLBase 3.18.0 / StableRNGs 1.0.4.

Before (unmodified master):

# Julia 1.10, JULIA_NUM_THREADS=8 (3 runs) and Julia 1.11, 8 threads
saveat_regression (wrapped) |   12   190    202   # FAIL, every run
# single-threaded: 202 202  (passes)

After (this PR):

# Julia 1.10, 8 threads x4:   202 202
# Julia 1.10, auto/128 threads: 202 202
# Julia 1.11, 8 threads x3:   202 202

Passes deterministically across all repeated multithreaded runs.

🤖 Generated with Claude Code

isaacsas · 2026-06-06T23:50:05Z

Shouldn't the current SciMLBase handle this case even without the JumpProcesses 10 changes? I thought we tested jumps with threading there?

Under EnsembleThreads, SciMLBase v3.18 isolates the JumpProblem per task (deepcopy into task_local_storage — race-free; verified 8 distinct RNG objects on 8 threads). But the test passes no seed to solve, so every per-task deepcopy starts from the identical StableRNG(12345) and replays the SAME stream: the ensemble collapses to ~nthreads distinct trajectories and the rtol=0.1 mean stops tracking exp(-10t). Deterministic (not a data race): 202/202 single-threaded, ~6/101 at 8 threads. This is the documented "uniqueness requires explicit seeding" behavior, not a bug in SciMLBase. Fix: give each trajectory an independent seeded StableRNG via the EnsembleProblem prob_func (keyed on ctx.sim_id), matching the ensemble_uniqueness.jl idiom. Race-free, reproducible, 202/202 multithreaded on Julia 1.10.11 and 1.11.9. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChrisRackauckas-Claude · 2026-06-07T00:18:01Z

You're right that SciMLBase handles the thread-safety here — thanks for pushing on it; I had the mechanism mischaracterized. Digging in against the actually-resolved stack (SciMLBase v3.18.0 + JumpProcesses v9.29, Julia 1.10.11 and 1.11.9):

SciMLBase v3.18's threaded ensemble path does isolate the JumpProblem — for safetycopy=false + an AbstractJumpProblem it deepcopies the problem once per task into task_local_storage()[:_ensemble_jump_prob] (basic_ensemble_solve.jl:524-539). I instrumented the original test: 200 trajectories on 8 threads touch exactly 8 distinct RNG objects (1 single-threaded). So there is no live data race — my "data race" framing was wrong for this version (it fits the old v2.x path where the per-thread jump deepcopy was dead code). I've corrected the commit message.

What actually breaks is statistical and deterministic: solve gets seed=nothing, so every per-task deepcopy starts from the identical StableRNG(12345) and replays the same stream → the ensemble collapses to ~n-threads distinct trajectories and the rtol=0.1 mean stops tracking exp(-10t). Measured: single-threaded 202/202; 8 threads a deterministic 6/101; 128 threads worse. This is exactly what ensemble_problems.jl's "EnsembleThreads: no data race" testset documents ("deepcopy'd problems on non-main threads start with identical RNG states... uniqueness requires explicit seeding") — which is why thread_safety.jl and that testset only assert completion, not statistics. So nothing's broken in SciMLBase.

Two clean ways to make the saveat regression pass under threads (both verified 101/101 multithreaded):

Drop the explicit rng and rely on JP's DEFAULT_RNG = Random.default_rng() (per-task, thread-safe) — simplest, but non-reproducible run-to-run.
Per-trajectory seeded StableRNG via the EnsembleProblem prob_func keyed on ctx.sim_id (this PR) — race-free, reproducible, same idiom as ensemble_uniqueness.jl.

I went with (2) for reproducibility; happy to switch to (1) if you prefer minimalism. Either way the commit message now describes this as RNG-stream degeneracy under threading (race-free), not a data race.

…D pending upstream ForwardDiff/LogExpFunctions — red until ForwardDiff 1.3.4 (#811). saveat fix is separate (#600) — do NOT touch it — natural failure, not disabled; auto-greens on upstream release. (#599) * Re-enable downgrade CI with corrected [compat] lower bounds Re-enables the disabled Downgrade workflow (removes `if: false`) and raises the [compat] lower bounds to the lowest set that both resolves and passes the full test suite on the downgrade Julia version (1.10). Bumps (old -> new lower bound; upper ranges preserved): ADTypes: 1 -> 1.22.0 Adapt: 4 -> 4.6.0 Aqua: 0.8 -> 0.8.16 ArrayInterface: 7.15 -> 7.25.0 DataStructures: 0.18 -> 0.19.5 DiffEqBase: 6.192 -> 7.5.5 DiffEqCallbacks: 4.7 -> 4.17.0 DocStringExtensions: 0.9 -> 0.9.5 FastBroadcast: 0.3 -> 1.3.2 FunctionWrappers: 1.1 -> 1.1.3 Graphs: 1.11 -> 1.14.0 KernelAbstractions: 0.9 -> 0.9.36 LinearSolve: 3 -> 3.83.0 OrdinaryDiffEq: 6 -> 7.0.0 OrdinaryDiffEqCore: 3.17 -> 4.3.0 OrdinaryDiffEqFunctionMap: 1 -> 2.0.0 RecursiveArrayTools: 3.35 -> 4.3.0 Reexport: 1.2 -> 1.2.2 SciMLBase: 2.115 -> 3.18.0 StaticArrays: 1.9.8 -> 1.9.18 StochasticDiffEq: 6.82 -> 7.0.0 SymbolicIndexingInterface: 0.3.36 -> 0.3.48 Downgrade suite run locally at these floors (Julia 1.10, GROUP=All, weakdeps included): PASS. All 39 testsets passed (e.g. Variable Rate 75/75, Extinction 292/292, Pure diffusion 210/210), zero failures. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): allow-reresolve true (was false) Strict no-reresolve can't reconcile floored core deps against the latest-floating transitive SciML stack; reresolve still enforces the floor [compat] pins but resolves a consistent set. Verified locally. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): disable job (strict unsatisfiable at current floors) Strict downgrade (allow-reresolve:false) cannot resolve at the current [compat] lower bounds. The deps floor pins NonlinearSolveBase to 2.30.3, which pulls LogExpFunctions 1.0.1; LogExpFunctions 1.0.1 caps ForwardDiff at <=0.10.22 while NonlinearSolveBase 2.30.3 and the OrdinaryDiffEq / StochasticDiffEq test stack require ForwardDiff >=0.10.38. No ForwardDiff version satisfies both, so Pkg.test(allow_reresolve=false) errors with "Unsatisfiable requirements detected for package ForwardDiff" before any test runs. Rather than relax with allow-reresolve:true (which defeats the downgrade test) or lower any compat floors, the job is set to `if: false` and the allow-reresolve override is removed (default false). Re-enable once the upstream floors move past the LogExpFunctions 1.0.1 / ForwardDiff wall. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): document precise floor-unfixable root cause Verified exhaustively that the strict (allow-reresolve:false) downgrade is floor-unfixable: it reproduces even with every direct [compat] floor raised to its latest. Pkg.test resolves the package graph first (no test extras, so ForwardDiff is absent) and locks LogExpFunctions to latest 1.0.1 via NonlinearSolveBase 2.29.0+ (which widened LogExpFunctions compat to [0.3, 1]). The test sandbox then needs ForwardDiff >=0.10.38, but ForwardDiff >=0.10.23 caps LogExpFunctions to 0.3, contradicting the locked 1.0.1. No raised floor changes the package-resolve choice of LogExpFunctions 1.0.1; the only fixes are forbidden (allow-reresolve:true or pinning a transitive in [deps]). Keep the job disabled and refine the comment to reflect the verified mechanism. downgrade suite run locally at the current floors: FAIL (floor-unfixable, upstream LogExpFunctions 1.x / ForwardDiff wall). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): re-enable strict downgrade job (remove if:false) Remove the `if: false` guard so the downgrade job runs again under strict resolution (allow_reresolve=false). It is expected to be RED until the upstream fix lands (ForwardDiff/LogExpFunctions — red until ForwardDiff 1.3.4, #811); this is a natural failure, not a disabled job, and will auto-green when the upstream fix releases. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Chris Rackauckas <accounts@chrisrackauckas.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChrisRackauckas-Claude force-pushed the fix-saveat-rng-race branch from ddbb3aa to d0dba22 Compare June 7, 2026 00:17

ChrisRackauckas requested a review from isaacsas June 7, 2026 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix shared-RNG data race in saveat_regression ensemble test#600

Fix shared-RNG data race in saveat_regression ensemble test#600
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:masterfrom
ChrisRackauckas-Claude:fix-saveat-rng-race

ChrisRackauckas-Claude commented Jun 6, 2026

Uh oh!

isaacsas commented Jun 6, 2026

Uh oh!

ChrisRackauckas-Claude commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

ChrisRackauckas-Claude commented Jun 6, 2026

Problem

Fix

Verification (run locally, multithreaded, latest deps)

Uh oh!

isaacsas commented Jun 6, 2026

Uh oh!

ChrisRackauckas-Claude commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants