Skip to content

fix: race condition in transform cache under parallel workers (#23)#26

Merged
srsholmes merged 4 commits into
mainfrom
fix/cache-race-condition
May 1, 2026
Merged

fix: race condition in transform cache under parallel workers (#23)#26
srsholmes merged 4 commits into
mainfrom
fix/cache-race-condition

Conversation

@srsholmes

@srsholmes srsholmes commented May 1, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #23. Fixes intermittent ENOENT errors and partial cache reads when running Vitest in parallel against larger codebases.

How the bug is fixed

The races (before this PR)

setup.ts ran in every Vitest worker. Each worker did three things that collided when run in parallel:

  1. fs.writeFileSync(cachePath, code) — non-atomic. A worker could observe a half-written file: existsSync returns true, readFileSync returns truncated bytes, and the consumer evaluates broken JS → ENOENT-adjacent runtime errors.
  2. fs.existsSync(cachePath) then fs.readFileSync(cachePath) — TOCTOU. Between the two syscalls, another worker could unlink the file (see Consider switching from esbuild to Rolldown #3), turning the read into a thrown ENOENT.
  3. End-of-setup unlink loop — every worker, at the end of its setup.ts, unlinked every file in the cache directory. This was a sledgehammer to handle the case where editing setup.ts (adding a new mock) didn't bump pluginVersion, leaving a stale cache. With N workers running in parallel it was also nuking files that other workers were mid-read or mid-write.

The fixes

  • Atomic writes via tmp+rename. writeToCache no longer writes the final path directly. It writes to a temporary path, then atomically renames it into place. POSIX rename(2) on the same filesystem is atomic — concurrent readers see either the previous file (or ENOENT) or the new full contents, never an in-flight write.

  • Single-syscall reads. readFromCache is now try { fs.readFileSync(...) } catch (ENOENT) { return null }. The TOCTOU between existsSync and readFileSync is gone — there is no window between "is it there?" and "read it" because the read itself is the existence check.

  • Content-hashed cache key. The cache directory is now <rn>_<plugin>_<sha1(setup.ts)[:12]>. Editing setup.ts changes the bundled bytes, changes the hash, and naturally causes a fresh cache directory to be used — no destructive sweep needed. The existing old-version reaper (which runs only against directories whose name doesn't match the current key) cleans up the dead ones safely.

  • End-of-setup unlink loop deleted. With content-hashed keys it is no longer load-bearing, and removing it eliminates the worst race (every worker concurrently unlinking files other workers depend on).

Why no per-worker subdirs (VITEST_POOL_ID)

Considered. Not needed once writes are atomic and the key is content-aware. Two workers transforming the same module is wasted CPU, not corruption. The cost of per-worker dirs (~10MB × maxWorkers, no cross-worker cache reuse) outweighs the savings.

What the .tmp files do

The .tmp files are the staging area for atomic writes. The full sequence inside writeToCache(cachePath, code):

  1. Compute a unique tmp path: ${cachePath}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp — e.g. …/Animated.js.48213.7a3f9c1e.tmp.
  2. fs.writeFileSync(tmp, code) — write the full payload to the tmp file. Any partial-write window happens here, but no other worker is looking at this path, so it doesn't matter.
  3. fs.renameSync(tmp, cachePath) — atomically replace the cache file. From any other worker's perspective, cachePath is either missing/old or fully present and complete — never half-written.
  4. If rename throws, unlinkSync(tmp) cleans up so we don't leak stale tmp files.

Why pid + 4 random bytes in the suffix:

  • The pid alone isn't enough — within a single worker process, sequential or concurrent writes to the same cache key (e.g. retries) would collide on the same .tmp filename. The 4 random bytes give every write attempt its own private staging file.
  • Two workers writing the same cachePath concurrently end up writing two different .tmp files (different pid, different random bytes) and then issuing two rename syscalls. Whoever's rename runs second wins; both renames succeed. No corruption, just slightly wasted work.

On success there are no .tmp leftoversrename consumes the source path. The verification stress run (10× full suite at maxWorkers=8) ends with zero .tmp files in the cache dir, which is the contract: any leftover .tmp would indicate either a rename failure or a process killed mid-write, both of which are now testable invariants.

Performance & cache hit rate

A common worry when seeing tmp files and random suffixes: won't this slow tests down, and won't the randomness prevent cache hits? Both turn out to be no.

The cache hit path is unchanged

A cache hit is still exactly one fs.readFileSync. The atomic-write machinery only runs on cache misses, which happen during cold start. Once a module is cached, every subsequent read in this worker, other workers, and future runs is a plain read against the same deterministic path.

The random suffix is only on the .tmp filename, never on the final path

const cachePath = path.join(cacheDir, cacheName);             // deterministic, same across workers
const tmp = `${cachePath}.${pid}.${randomBytes(4).hex}.tmp`;  // random — only for staging
fs.writeFileSync(tmp, code);
fs.renameSync(tmp, cachePath);                                 // file ends up at the deterministic path

After rename completes, the file lives at cachePath. Every worker computing the same cacheName for the same module gets the same cachePath, calls fs.readFileSync(cachePath), and hits the cache. The .tmp filename only exists for the microseconds between writeFileSync and renameSync, then is gone.

The randomness is purely so two workers writing the same cache key simultaneously don't clobber each other's staging file. Their final destinations are identical; their tmps are different. Both renames succeed; whichever runs second wins; both workers got valid results. The "loser" paid for one redundant transform — that's the only cost, and it only happens during cold-start contention.

Cost per cache miss: one extra rename(2)

rename is a directory-entry update — no data copy, microseconds on any reasonable filesystem. With ~75 cache misses on a cold start, total added overhead is well under 10ms across the whole suite.

Empirical evidence

Cold cache Warm cache
815 tests, default pool 2.39s ~1.6–1.8s
815 tests, 10× at --pool=threads --maxWorkers=8 2.4s ± noise

After 10 stress runs with maxWorkers=8: exactly 75 cache files in the cache directory — same as a single run. If the random suffix were preventing cache hits, the file count would balloon and warm-cache runs wouldn't be faster than cold ones. Neither happens.

Drive-by fix

Pre-existing pluginVersion fallback bug. require('./package.json') was failing in both src/ (during dev) and dist/ (after install) because no package.json lives in either dir, so the hardcoded '0.2.0' fallback was always used and the cache key was misleading. Now reads package.json one directory up via fileURLToPath(import.meta.url), which works in both layouts. Verified: cache directory now shows 0.76.9_0.1.4_<hash> instead of falling back to 0.2.0.

Test plan

  • pnpm typecheck clean
  • pnpm lint clean (no new warnings)
  • pnpm build produces dist/setup.{js,cjs} with cache.ts bundled in
  • New test/__tests__/Cache.spec.ts (6 tests) covers: ENOENT→null, round-trip bytes, no .tmp leftovers on success, non-ENOENT errors rethrown, concurrent writes to distinct paths, concurrent writes to the same path
  • Full 815-test suite (plugin specs + example-app + parity) runs clean 10/10 times under --pool=threads --maxWorkers=8 from a cold cache
  • After 10 stress runs, 75 cache files present, zero .tmp leftovers
  • Cache dir name now reflects the real plugin version (0.76.9_0.1.4_df94bf580d17)

See the verification comment for the cold-cache and stress-run evidence covering both the plugin specs and the example-app suites.

🤖 Generated with Claude Code

srsholmes and others added 2 commits May 1, 2026 10:20
Three coordinated changes that close the ENOENT/partial-read races
reported when running the suite under high parallelism:

- Atomic writes via tmp+rename in a new src/cache.ts module. POSIX
  rename on the same filesystem is atomic, so concurrent readers
  either see ENOENT or the full new contents — never partial bytes.
  Tmp suffix is pid + 4 random bytes to keep two workers writing the
  same key from clobbering each other's tmp file.

- Hash the bundled setup.ts bytes into the cache key. Editing setup.ts
  (e.g. adding a new mock) rebuilds dist/ → bundle bytes change → hash
  changes → fresh cache dir. This removes the need for the destructive
  end-of-setup wipe.

- Delete the end-of-setup unlink loop. With content-hashed keys it is
  no longer load-bearing, and removing it eliminates the worst race:
  every worker's setup.ts running fs.unlinkSync against files other
  workers are mid-read or mid-write.

Also fixes a pre-existing pluginVersion fallback bug: require('./package.json')
resolves relative to setup.ts (src/ in dev, dist/ when installed) where no
package.json exists, so the fallback constant was always used. Now reads
package.json one directory up from setup.ts via fileURLToPath, which works
in both src/ and dist/ layouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #25 was merged without a changeset, so the fix would not appear in any
release. Adding a retroactive changeset so it ships in the same patch
release as the cache race-condition fix in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
srsholmes and others added 2 commits May 1, 2026 10:41
The .changeset/feature-flag-defaults.md was added in dee332f to ensure
the #24 feature-flag fix shipped, but PR #28 carried that changeset to
main on its own and 0.1.4 has now been published. Removing the file
here so the next release (driven by this PR's cache-race changeset)
contains only the cache-race entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@srsholmes

Copy link
Copy Markdown
Owner Author

Verification: cache works across plugin tests AND example-app

Q: does CI run example-app tests?
Yes. The root vitest.config.ts include pattern covers all three suites, so a single pnpm test:run (the command CI runs in .github/workflows/ci.yml) executes everything in one Vitest invocation, exercising the cache layer for every transformed RN module that any of these tests touches.

Test breakdown (run on this branch, post-merge with main @ 0.1.4)

Suite Path Files Tests
Plugin specs test/__tests__/ 54 565
Example-app component tests apps/example-app/__tests__/ 6 86
Example-app parity tests (Vitest side) apps/example-app/__parity-tests__/ 4 164
Total pnpm test:run (matches CI) 64 815

The pnpm test:jest step that follows in CI is the Jest-side baseline used by the parity tests — it exercises the same example-app components against vanilla Jest, so any cache-layer regression in our Vitest setup would show up as a parity diff there.

Cache mechanism evidence

Cold-cache run from a wiped /<tmpdir>/vrn:

BEFORE: ls .../vrn → No such file or directory
 Test Files  64 passed (64)
      Tests  815 passed (815)
   Duration  2.39s
AFTER:
.../vrn/0.76.9_0.1.4_df94bf580d17/
Cache files: 75
Stale .tmp files: 0

Sample cache contents (proves example-app's RN imports are being transformed via the new atomic-write path, not just the plugin specs):

node_modules_.pnpm_@react-native+polyfills@2.0.0_node_modules_@react-native_polyfills_Object.es8.js
node_modules_.pnpm_react-native@0.83.1_..._react-native_index.js
node_modules_.pnpm_react-native@0.83.1_..._Libraries_Animated_Animated.js
node_modules_.pnpm_react-native@0.83.1_..._Libraries_AppState_AppState.js
... (75 total, 356K)

The cache key 0.76.9_0.1.4_df94bf580d17 reflects all three components of the new key:

  • 0.76.9 = react-native version (resolved via require('react-native/package.json'))
  • 0.1.4 = plugin version (resolved via path.join(dirname(import.meta.url), '..', 'package.json') — the drive-by fallback fix; before this PR the value was always the hardcoded 0.2.0)
  • df94bf580d17 = first 12 hex chars of sha1(setup.ts bytes) — flips on any setup.ts edit, replacing the destructive end-of-setup wipe loop

Stress: 10× back-to-back at --pool=threads --maxWorkers=8 from cold cache

run  1: Tests  815 passed (815)
run  2: Tests  815 passed (815)
run  3: Tests  815 passed (815)
run  4: Tests  815 passed (815)
run  5: Tests  815 passed (815)
run  6: Tests  815 passed (815)
run  7: Tests  815 passed (815)
run  8: Tests  815 passed (815)
run  9: Tests  815 passed (815)
run 10: Tests  815 passed (815)
---
Final cache file count: 75
Stale .tmp files after 10 stress runs: 0

10/10 clean runs. No leftover .tmp files (would indicate a renameSync failure or an interrupted writer); no flakes. Across the 10 runs there were 8 workers each calling processReactNative for 75 modules — ~6,000 cache reads/writes per run, ~60,000 total — all consistent.

Direct unit coverage

The new test/__tests__/Cache.spec.ts covers the primitives in isolation:

  • readFromCache returns null for missing files (no thrown ENOENT)
  • readFromCache rethrows non-ENOENT errors (e.g. EISDIR)
  • writeToCache + readFromCache round-trips exact bytes
  • writeToCache leaves no .tmp file behind on success
  • 32 concurrent writes to distinct paths all succeed
  • 16 concurrent writes to the same path always produce one valid full payload (the race we are protecting against — the file is never partial or interleaved)

@srsholmes srsholmes merged commit 42acb7f into main May 1, 2026
2 checks passed
@srsholmes srsholmes deleted the fix/cache-race-condition branch May 1, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in cache read/write leading to intermittent ENOENT and partial reads under parallel Vitest execution

1 participant