fix: race condition in transform cache under parallel workers (#23)#26
Conversation
Three coordinated changes that close the ENOENT/partial-read races
reported when running the suite under high parallelism:
- Atomic writes via tmp+rename in a new src/cache.ts module. POSIX
rename on the same filesystem is atomic, so concurrent readers
either see ENOENT or the full new contents — never partial bytes.
Tmp suffix is pid + 4 random bytes to keep two workers writing the
same key from clobbering each other's tmp file.
- Hash the bundled setup.ts bytes into the cache key. Editing setup.ts
(e.g. adding a new mock) rebuilds dist/ → bundle bytes change → hash
changes → fresh cache dir. This removes the need for the destructive
end-of-setup wipe.
- Delete the end-of-setup unlink loop. With content-hashed keys it is
no longer load-bearing, and removing it eliminates the worst race:
every worker's setup.ts running fs.unlinkSync against files other
workers are mid-read or mid-write.
Also fixes a pre-existing pluginVersion fallback bug: require('./package.json')
resolves relative to setup.ts (src/ in dev, dist/ when installed) where no
package.json exists, so the fallback constant was always used. Now reads
package.json one directory up from setup.ts via fileURLToPath, which works
in both src/ and dist/ layouts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #25 was merged without a changeset, so the fix would not appear in any release. Adding a retroactive changeset so it ships in the same patch release as the cache race-condition fix in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b80e9c1 to
dee332f
Compare
The .changeset/feature-flag-defaults.md was added in dee332f to ensure the #24 feature-flag fix shipped, but PR #28 carried that changeset to main on its own and 0.1.4 has now been published. Removing the file here so the next release (driven by this PR's cache-race changeset) contains only the cache-race entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verification: cache works across plugin tests AND example-appQ: does CI run example-app tests? Test breakdown (run on this branch, post-merge with main @
|
| Suite | Path | Files | Tests |
|---|---|---|---|
| Plugin specs | test/__tests__/ |
54 | 565 |
| Example-app component tests | apps/example-app/__tests__/ |
6 | 86 |
| Example-app parity tests (Vitest side) | apps/example-app/__parity-tests__/ |
4 | 164 |
Total pnpm test:run |
(matches CI) | 64 | 815 |
The pnpm test:jest step that follows in CI is the Jest-side baseline used by the parity tests — it exercises the same example-app components against vanilla Jest, so any cache-layer regression in our Vitest setup would show up as a parity diff there.
Cache mechanism evidence
Cold-cache run from a wiped /<tmpdir>/vrn:
BEFORE: ls .../vrn → No such file or directory
Test Files 64 passed (64)
Tests 815 passed (815)
Duration 2.39s
AFTER:
.../vrn/0.76.9_0.1.4_df94bf580d17/
Cache files: 75
Stale .tmp files: 0
Sample cache contents (proves example-app's RN imports are being transformed via the new atomic-write path, not just the plugin specs):
node_modules_.pnpm_@react-native+polyfills@2.0.0_node_modules_@react-native_polyfills_Object.es8.js
node_modules_.pnpm_react-native@0.83.1_..._react-native_index.js
node_modules_.pnpm_react-native@0.83.1_..._Libraries_Animated_Animated.js
node_modules_.pnpm_react-native@0.83.1_..._Libraries_AppState_AppState.js
... (75 total, 356K)
The cache key 0.76.9_0.1.4_df94bf580d17 reflects all three components of the new key:
0.76.9=react-nativeversion (resolved viarequire('react-native/package.json'))0.1.4= plugin version (resolved viapath.join(dirname(import.meta.url), '..', 'package.json')— the drive-by fallback fix; before this PR the value was always the hardcoded0.2.0)df94bf580d17= first 12 hex chars ofsha1(setup.ts bytes)— flips on any setup.ts edit, replacing the destructive end-of-setup wipe loop
Stress: 10× back-to-back at --pool=threads --maxWorkers=8 from cold cache
run 1: Tests 815 passed (815)
run 2: Tests 815 passed (815)
run 3: Tests 815 passed (815)
run 4: Tests 815 passed (815)
run 5: Tests 815 passed (815)
run 6: Tests 815 passed (815)
run 7: Tests 815 passed (815)
run 8: Tests 815 passed (815)
run 9: Tests 815 passed (815)
run 10: Tests 815 passed (815)
---
Final cache file count: 75
Stale .tmp files after 10 stress runs: 0
10/10 clean runs. No leftover .tmp files (would indicate a renameSync failure or an interrupted writer); no flakes. Across the 10 runs there were 8 workers each calling processReactNative for 75 modules — ~6,000 cache reads/writes per run, ~60,000 total — all consistent.
Direct unit coverage
The new test/__tests__/Cache.spec.ts covers the primitives in isolation:
readFromCachereturnsnullfor missing files (no thrown ENOENT)readFromCacherethrows non-ENOENT errors (e.g. EISDIR)writeToCache+readFromCacheround-trips exact byteswriteToCacheleaves no.tmpfile behind on success- 32 concurrent writes to distinct paths all succeed
- 16 concurrent writes to the same path always produce one valid full payload (the race we are protecting against — the file is never partial or interleaved)
Summary
Closes #23. Fixes intermittent
ENOENTerrors and partial cache reads when running Vitest in parallel against larger codebases.How the bug is fixed
The races (before this PR)
setup.tsran in every Vitest worker. Each worker did three things that collided when run in parallel:fs.writeFileSync(cachePath, code)— non-atomic. A worker could observe a half-written file:existsSyncreturnstrue,readFileSyncreturns truncated bytes, and the consumer evaluates broken JS → ENOENT-adjacent runtime errors.fs.existsSync(cachePath)thenfs.readFileSync(cachePath)— TOCTOU. Between the two syscalls, another worker couldunlinkthe file (see Consider switching from esbuild to Rolldown #3), turning the read into a thrown ENOENT.setup.ts,unlinked every file in the cache directory. This was a sledgehammer to handle the case where editingsetup.ts(adding a new mock) didn't bumppluginVersion, leaving a stale cache. With N workers running in parallel it was also nuking files that other workers were mid-read or mid-write.The fixes
Atomic writes via tmp+rename.
writeToCacheno longer writes the final path directly. It writes to a temporary path, then atomically renames it into place. POSIXrename(2)on the same filesystem is atomic — concurrent readers see either the previous file (or ENOENT) or the new full contents, never an in-flight write.Single-syscall reads.
readFromCacheis nowtry { fs.readFileSync(...) } catch (ENOENT) { return null }. The TOCTOU betweenexistsSyncandreadFileSyncis gone — there is no window between "is it there?" and "read it" because the read itself is the existence check.Content-hashed cache key. The cache directory is now
<rn>_<plugin>_<sha1(setup.ts)[:12]>. Editingsetup.tschanges the bundled bytes, changes the hash, and naturally causes a fresh cache directory to be used — no destructive sweep needed. The existing old-version reaper (which runs only against directories whose name doesn't match the current key) cleans up the dead ones safely.End-of-setup unlink loop deleted. With content-hashed keys it is no longer load-bearing, and removing it eliminates the worst race (every worker concurrently
unlinking files other workers depend on).Why no per-worker subdirs (
VITEST_POOL_ID)Considered. Not needed once writes are atomic and the key is content-aware. Two workers transforming the same module is wasted CPU, not corruption. The cost of per-worker dirs (~10MB × maxWorkers, no cross-worker cache reuse) outweighs the savings.
What the
.tmpfiles doThe
.tmpfiles are the staging area for atomic writes. The full sequence insidewriteToCache(cachePath, code):${cachePath}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp— e.g.…/Animated.js.48213.7a3f9c1e.tmp.fs.writeFileSync(tmp, code)— write the full payload to the tmp file. Any partial-write window happens here, but no other worker is looking at this path, so it doesn't matter.fs.renameSync(tmp, cachePath)— atomically replace the cache file. From any other worker's perspective,cachePathis either missing/old or fully present and complete — never half-written.unlinkSync(tmp)cleans up so we don't leak stale tmp files.Why pid + 4 random bytes in the suffix:
.tmpfilename. The 4 random bytes give every write attempt its own private staging file.cachePathconcurrently end up writing two different.tmpfiles (different pid, different random bytes) and then issuing tworenamesyscalls. Whoever'srenameruns second wins; both renames succeed. No corruption, just slightly wasted work.On success there are no
.tmpleftovers —renameconsumes the source path. The verification stress run (10× full suite atmaxWorkers=8) ends with zero.tmpfiles in the cache dir, which is the contract: any leftover.tmpwould indicate either arenamefailure or a process killed mid-write, both of which are now testable invariants.Performance & cache hit rate
A common worry when seeing tmp files and random suffixes: won't this slow tests down, and won't the randomness prevent cache hits? Both turn out to be no.
The cache hit path is unchanged
A cache hit is still exactly one
fs.readFileSync. The atomic-write machinery only runs on cache misses, which happen during cold start. Once a module is cached, every subsequent read in this worker, other workers, and future runs is a plain read against the same deterministic path.The random suffix is only on the
.tmpfilename, never on the final pathAfter
renamecompletes, the file lives atcachePath. Every worker computing the samecacheNamefor the same module gets the samecachePath, callsfs.readFileSync(cachePath), and hits the cache. The.tmpfilename only exists for the microseconds betweenwriteFileSyncandrenameSync, then is gone.The randomness is purely so two workers writing the same cache key simultaneously don't clobber each other's staging file. Their final destinations are identical; their tmps are different. Both renames succeed; whichever runs second wins; both workers got valid results. The "loser" paid for one redundant transform — that's the only cost, and it only happens during cold-start contention.
Cost per cache miss: one extra
rename(2)renameis a directory-entry update — no data copy, microseconds on any reasonable filesystem. With ~75 cache misses on a cold start, total added overhead is well under 10ms across the whole suite.Empirical evidence
--pool=threads --maxWorkers=8After 10 stress runs with maxWorkers=8: exactly 75 cache files in the cache directory — same as a single run. If the random suffix were preventing cache hits, the file count would balloon and warm-cache runs wouldn't be faster than cold ones. Neither happens.
Drive-by fix
Pre-existing
pluginVersionfallback bug.require('./package.json')was failing in bothsrc/(during dev) anddist/(after install) because nopackage.jsonlives in either dir, so the hardcoded'0.2.0'fallback was always used and the cache key was misleading. Now readspackage.jsonone directory up viafileURLToPath(import.meta.url), which works in both layouts. Verified: cache directory now shows0.76.9_0.1.4_<hash>instead of falling back to0.2.0.Test plan
pnpm typecheckcleanpnpm lintclean (no new warnings)pnpm buildproducesdist/setup.{js,cjs}withcache.tsbundled intest/__tests__/Cache.spec.ts(6 tests) covers: ENOENT→null, round-trip bytes, no.tmpleftovers on success, non-ENOENT errors rethrown, concurrent writes to distinct paths, concurrent writes to the same path--pool=threads --maxWorkers=8from a cold cache.tmpleftovers0.76.9_0.1.4_df94bf580d17)See the verification comment for the cold-cache and stress-run evidence covering both the plugin specs and the example-app suites.
🤖 Generated with Claude Code