Fix shared-RNG data race in saveat_regression ensemble test#600
Fix shared-RNG data race in saveat_regression ensemble test#600ChrisRackauckas-Claude wants to merge 1 commit into
Conversation
|
Shouldn't the current SciMLBase handle this case even without the JumpProcesses 10 changes? I thought we tested jumps with threading there? |
Under EnsembleThreads, SciMLBase v3.18 isolates the JumpProblem per task (deepcopy into task_local_storage — race-free; verified 8 distinct RNG objects on 8 threads). But the test passes no seed to solve, so every per-task deepcopy starts from the identical StableRNG(12345) and replays the SAME stream: the ensemble collapses to ~nthreads distinct trajectories and the rtol=0.1 mean stops tracking exp(-10t). Deterministic (not a data race): 202/202 single-threaded, ~6/101 at 8 threads. This is the documented "uniqueness requires explicit seeding" behavior, not a bug in SciMLBase. Fix: give each trajectory an independent seeded StableRNG via the EnsembleProblem prob_func (keyed on ctx.sim_id), matching the ensemble_uniqueness.jl idiom. Race-free, reproducible, 202/202 multithreaded on Julia 1.10.11 and 1.11.9. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ddbb3aa to
d0dba22
Compare
|
You're right that SciMLBase handles the thread-safety here — thanks for pushing on it; I had the mechanism mischaracterized. Digging in against the actually-resolved stack (SciMLBase v3.18.0 + JumpProcesses v9.29, Julia 1.10.11 and 1.11.9): SciMLBase v3.18's threaded ensemble path does isolate the JumpProblem — for What actually breaks is statistical and deterministic: Two clean ways to make the saveat regression pass under threads (both verified 101/101 multithreaded):
I went with (2) for reproducibility; happy to switch to (1) if you prefer minimalism. Either way the commit message now describes this as RNG-stream degeneracy under threading (race-free), not a data race. |
…D pending upstream ForwardDiff/LogExpFunctions — red until ForwardDiff 1.3.4 (#811). saveat fix is separate (#600) — do NOT touch it — natural failure, not disabled; auto-greens on upstream release. (#599) * Re-enable downgrade CI with corrected [compat] lower bounds Re-enables the disabled Downgrade workflow (removes `if: false`) and raises the [compat] lower bounds to the lowest set that both resolves and passes the full test suite on the downgrade Julia version (1.10). Bumps (old -> new lower bound; upper ranges preserved): ADTypes: 1 -> 1.22.0 Adapt: 4 -> 4.6.0 Aqua: 0.8 -> 0.8.16 ArrayInterface: 7.15 -> 7.25.0 DataStructures: 0.18 -> 0.19.5 DiffEqBase: 6.192 -> 7.5.5 DiffEqCallbacks: 4.7 -> 4.17.0 DocStringExtensions: 0.9 -> 0.9.5 FastBroadcast: 0.3 -> 1.3.2 FunctionWrappers: 1.1 -> 1.1.3 Graphs: 1.11 -> 1.14.0 KernelAbstractions: 0.9 -> 0.9.36 LinearSolve: 3 -> 3.83.0 OrdinaryDiffEq: 6 -> 7.0.0 OrdinaryDiffEqCore: 3.17 -> 4.3.0 OrdinaryDiffEqFunctionMap: 1 -> 2.0.0 RecursiveArrayTools: 3.35 -> 4.3.0 Reexport: 1.2 -> 1.2.2 SciMLBase: 2.115 -> 3.18.0 StaticArrays: 1.9.8 -> 1.9.18 StochasticDiffEq: 6.82 -> 7.0.0 SymbolicIndexingInterface: 0.3.36 -> 0.3.48 Downgrade suite run locally at these floors (Julia 1.10, GROUP=All, weakdeps included): PASS. All 39 testsets passed (e.g. Variable Rate 75/75, Extinction 292/292, Pure diffusion 210/210), zero failures. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): allow-reresolve true (was false) Strict no-reresolve can't reconcile floored core deps against the latest-floating transitive SciML stack; reresolve still enforces the floor [compat] pins but resolves a consistent set. Verified locally. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): disable job (strict unsatisfiable at current floors) Strict downgrade (allow-reresolve:false) cannot resolve at the current [compat] lower bounds. The deps floor pins NonlinearSolveBase to 2.30.3, which pulls LogExpFunctions 1.0.1; LogExpFunctions 1.0.1 caps ForwardDiff at <=0.10.22 while NonlinearSolveBase 2.30.3 and the OrdinaryDiffEq / StochasticDiffEq test stack require ForwardDiff >=0.10.38. No ForwardDiff version satisfies both, so Pkg.test(allow_reresolve=false) errors with "Unsatisfiable requirements detected for package ForwardDiff" before any test runs. Rather than relax with allow-reresolve:true (which defeats the downgrade test) or lower any compat floors, the job is set to `if: false` and the allow-reresolve override is removed (default false). Re-enable once the upstream floors move past the LogExpFunctions 1.0.1 / ForwardDiff wall. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): document precise floor-unfixable root cause Verified exhaustively that the strict (allow-reresolve:false) downgrade is floor-unfixable: it reproduces even with every direct [compat] floor raised to its latest. Pkg.test resolves the package graph first (no test extras, so ForwardDiff is absent) and locks LogExpFunctions to latest 1.0.1 via NonlinearSolveBase 2.29.0+ (which widened LogExpFunctions compat to [0.3, 1]). The test sandbox then needs ForwardDiff >=0.10.38, but ForwardDiff >=0.10.23 caps LogExpFunctions to 0.3, contradicting the locked 1.0.1. No raised floor changes the package-resolve choice of LogExpFunctions 1.0.1; the only fixes are forbidden (allow-reresolve:true or pinning a transitive in [deps]). Keep the job disabled and refine the comment to reflect the verified mechanism. downgrade suite run locally at the current floors: FAIL (floor-unfixable, upstream LogExpFunctions 1.x / ForwardDiff wall). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(downgrade): re-enable strict downgrade job (remove if:false) Remove the `if: false` guard so the downgrade job runs again under strict resolution (allow_reresolve=false). It is expected to be RED until the upstream fix lands (ForwardDiff/LogExpFunctions — red until ForwardDiff 1.3.4, #811); this is a natural failure, not a disabled job, and will auto-green when the upstream fix releases. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Chris Rackauckas <accounts@chrisrackauckas.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ignore until reviewed by @ChrisRackauckas.
Problem
test/saveat_regression.jlshares a singleStableRNG(12345)across every trajectory of anEnsembleProblemsolved with the defaultEnsembleThreads(). Under multithreading (JULIA_NUM_THREADS=auto/8), worker threads concurrently mutate the oneLehmerRNG, racing on its internal state. The corruption produces degenerate trajectories, the ensemble mean no longer tracksexp(-10t), and roughly 190/202 of thertol=0.1assertions fail nondeterministically. This is a flaky failure observed onmaster's ownTests (lts/pre, InterfaceII)jobs. Single-threaded the test passes 202/202.Fix
Give every trajectory its own independent, seeded
StableRNGvia theEnsembleProblemprob_func, keyed offctx.sim_id. This makes the threaded solve race-free and reproducible.EnsembleSerial()— the threaded path is still exercised.rtol— it remains a realrtol=0.1regression check.save_positions((false, false)for the saveat solve, theDiscreteProblemdefault(false, true)for the interpolated solve), so the statistical intent is unchanged.Only
test/saveat_regression.jlis touched. File is already SciMLStyle-formatted (JuliaFormatter v2, no diff).Verification (run locally, multithreaded, latest deps)
Isolated depot, JumpProcesses 9.29.0 / DiffEqBase 7.5.5 / SciMLBase 3.18.0 / StableRNGs 1.0.4.
Before (unmodified
master):After (this PR):
Passes deterministically across all repeated multithreaded runs.
🤖 Generated with Claude Code