fix(skstore): purge DeletedDir tombstones at end of update transaction#1245
fix(skstore): purge DeletedDir tombstones at end of update transaction#1245zaccharyfavere wants to merge 2 commits into
Conversation
|
Hi, thanks for your contribution! |
gen2 CircleCI machines enforce virtual memory limits at the cgroup level. The Skip runtime's default SKIP_CAPACITY of 16 GB (palloc.c) fails to mmap on any class with <16 GB RAM, producing "ERROR (MAP FAILED): Cannot allocate memory" early in the job. Observed on skipruntime/large.gen2 in PR #1245 and on an earlier skdb-wasm/large.gen2 run that died at 61 s. gen1 tolerated the same 16 GB virtual mmap because only RSS was enforced and the mmap is lazy. Set SKIP_CAPACITY explicitly per gen2 job, leaving ~25% RAM headroom for OS and tooling: check-ts medium.gen2 (4 GB) -> 3G skdb large.gen2 (8 GB) -> 6G skdb-wasm large.gen2 (8 GB) -> 6G skipruntime large.gen2 (8 GB)* -> 6G compiler xlarge.gen2 (16 GB) -> 12G * postgres + kafka sidecars get their own RAM allocation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Gen2 CircleCI machines enforce virtual memory limits at the cgroup level. The Skip runtime's default `SKIP_CAPACITY` of 16 GB (`skiplang/prelude/runtime/palloc.c:449-451`) fails to `mmap` on any class with <16 GB RAM, producing `ERROR (MAP FAILED): Cannot allocate memory` early in the job. Observed on `skipruntime` / `large.gen2` in PR #1245 (job 16682) and on an earlier `skdb-wasm` / `large.gen2` run that died at 61 s (workflow `f78dd673`, 2026-05-27 09:37). Gen1 tolerated the same 16 GB virtual mmap because only RSS was enforced and the mmap is lazy — gen2 evidently does not. Set `SKIP_CAPACITY` explicitly per gen2 job, leaving ~25% RAM headroom for OS + tooling: | Job | Class | RAM | `SKIP_CAPACITY` | |---|---|---|---| | check-ts | medium.gen2 | 4 GB | `3G` | | skdb | large.gen2 | 8 GB | `6G` | | skdb-wasm | large.gen2 | 8 GB | `6G` | | skipruntime | large.gen2 (+ pg/kafka sidecars) | 8 GB primary | `6G` | | compiler | xlarge.gen2 | 16 GB | `12G` | Why also setting the ones that haven't (yet) failed: same `mmap` happens on every Skip-toolchain invocation, so any gen2 job using `skiplabs/skip*` images is one coin flip away from the same failure. `skipruntime` on `large.gen2` succeeded 3× in a row on the Phase 1 PR before failing on #1245 — the variance is real. `check-examples` is on `large.gen2` but uses `cimg/base` and only runs the Skip toolchain inside docker-compose containers (separate cgroups), so it's not affected. ## Test plan - [ ] Land this, then trigger a PR that exercises each gen2 workflow (touch `skiplang/prelude/src/foo` for compiler + skdb + skdb-wasm + skipruntime, and a ts workspace file for check-ts). - [ ] Confirm `skipruntime` no longer dies with `MAP FAILED` at `skargo build`. - [ ] Confirm `compiler` still passes — 12 G cap is well above measured peak (~10 GB). - [ ] Watch `skdb` and `skdb-wasm` over the next few PRs for any new failures specifically at allocation time; if they appear, drop the cap further. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
|
Hi, alright thanks for your return ! |
eb65f41 to
a04d30d
Compare
|
Sorry, still fixing the CI... |
|
No worries, I just re pushed because I realised there was a format error still in my code. That should be fixed now since it made one of the CI pass |
a04d30d to
1b2a3bb
Compare
|
I think the CI is back to normal, I've rebased your commit to check |
|
Concretely, a I understand the memory leak but a safe purge needs to be bounded by the slowest consumer of the change history, e.g. track the min tick across active synchronizers/subscribers and only drop tombstones below it, or do the compaction at a synchronization boundary that's provably past the last reader. |
1b2a3bb to
7671ddd
Compare
Previously, DeletedDir tombstones accumulated indefinitely in the context's directory map, causing memory growth proportional to the number of removed directories. Then, I naively tried purging at the end of updateWithStatus, and it broke Synchronizer-based propagation: exportToRoot/importNext rely on DeletedDir tombstones in the source context after updateWithStatus returns to propagate deletions to root. This commit purges at the start of the two entry points that obtain a mutable context for an update cycle (runWithGc and runWithResult), just after mclone. At that point, the tombstones present in the context are leftovers from the previous tick, and all downstream consumers from that previous tick have already had the chance to observe them. A dedicated set of marked-deleted dir names is maintained to avoid scanning the full directory map at each update.
7671ddd to
a8191ad
Compare
DeletedDir entries were created in Dirs.state by removeDir() to mark directories as deleted, but were never cleaned up. They accumulated indefinitely, causing a persistent memory leak proportional to the number of subdirectories created and destroyed over time.
The fix adds a purgeDeletedDirs() method on the Dirs class and calls it at the end of updateWithStatus(), right after the cascade finishes and fromContext is reset. At that point the tombstones have served their purpose and can safely be removed.
This is the natural cleanup point: it runs after each merge of changes into the context state, matching the lifecycle of the tombstones themselves.
cc @skiplabsdaniel