engine: release job resources on exit-rule abandon (ACP adapter leak) by rickgorman · Pull Request #12 · zackham/stepwise

rickgorman · 2026-06-10T08:53:45Z

fix: release job resources when a job is abandoned via exit rule

Summary

Jobs that reach terminal state through an exit rule with action: abandon (evaluated on a completed step in _process_completion) never release their per-job resources. The job is marked FAILED and the handler returns without calling _cleanup_job_sessions, so release_for_job is never invoked on the registered ResourceManagers (the ACP process pool and, when containment is on, the VM pool).

The practical effect: every abandoned job leaks its full claude-agent-acp adapter chain — npm exec wrapper + node adapter + agent SDK CLI, ~500MB RSS combined — parented to the live server, for the remaining life of the server process. The startup orphan sweep (0.45.0/0.46.x) can't help because it only runs at boot, and shutdown()'s release_all only fires on graceful exit.

On a host running a flow whose triage step has a routine cancel → action: abandon exit rule, this accumulated 49 adapter chains (~5GB) over one evening and OOM-starved an 8GB machine (MemAvailable fell from ~5GB to 31MB; kill -9 of the adapters instantly recovered the memory).

Mechanism

src/stepwise/engine.py, _process_completion, the case "abandon": arm (~line 3443). Compare with every other terminal transition, which all drain resource managers via _cleanup_job_sessions:

Terminal path	Releases resources?
Job completion (sync `_tick_job` + async `_check_job_terminal`)	yes
Settled-but-failed (`no_terminal_reached`, both engines)	yes
`_halt_job` (step failure)	yes
`abandon` arm of failure-routing (`_fail_run`, ~line 3911)	yes
`cancel_job`	yes
`abandon` arm of completion exit rules (~line 3443)	no — this bug

ACP processes are job-scoped in ResourceLifecycleManager (job_refs ref-counting). With release_for_job never called, the refs never drain, _kill_process never runs, and the adapter subprocess (spawned with start_new_session=True, stdin held open by the live server) stays alive indefinitely.

Empirical verification (0.46.4, live server)

Two jobs abandoned via a cancel exit rule on their completed triage step at 08:09 (job.failed {"reason": "abandoned", "step": "triage", "rule": "cancel"}). Their adapter chains (pids 13252, 13537, one per job — confirmed via executor_state.pid and per-job --allowedPaths) were still alive 40+ minutes later, parented to the running server, ~530MB RSS each.
Control: a job whose agent step completes normally has its adapter reaped at job completion (process table identical before/after).
The handshake-failure path (AcpError during session/new → lifecycle.discard) also cleans up correctly.

Fix

One line: call self._cleanup_job_sessions(job.id, job) after emitting JOB_FAILED in the completion-path abandon arm, matching the failure-routing abandon arm.

Test

tests/test_resource_manager_lifecycle.py::TestJobTerminalRelease::test_abandoned_job_triggers_release — a workflow whose only step completes and hits an always/abandon exit rule must trigger release_for_job on registered resource managers. Fails on master (released_jobs == []), passes with the fix.

Full suite: 3113 passed, 3 skipped, 6 failed — all 6 failures also fail on pristine master in this environment (macOS dev box): 4× test_orphan_sweep.py and 1× test_streaming.py depend on Linux /proc / non-symlinked /tmp; 1× test_jobs_api_stress.py concurrent-sqlite-reads test is timing-sensitive. No new failures introduced.

Notes / adjacent observations (not addressed here)

When a job is abandoned while other steps of the same job are still in flight, those step runs are not cancelled and can be left in running state under a FAILED job (matches "stale running runs while the job is marked failed" seen in the field). cancel_job handles this for the cancel path; the abandon paths don't. Possibly worth a follow-up.
.stepwise/pids/acp-*.json files written by _spawn_process are only unlinked by _kill_process; adapters that die any other way (e.g. startup sweep, manual kill) leave stale pid files behind. Cosmetic — nothing reads them today.

The abandon arm of exit-rule resolution in _process_completion marked the job FAILED and returned without calling _cleanup_job_sessions, so the job's refs in every registered ResourceManager (ACP process pool, containment VMs) were never dropped. Each job abandoned this way leaked its claude-agent-acp adapter chain (~500MB RSS: npm exec + node adapter + agent SDK CLI) for the remaining life of the server process. Every other terminal transition already drains resource managers: job completion (sync + async), settled-but-failed, _halt_job, the abandon arm of failure-routing, and cancel_job. This was the one terminal path that didn't. Observed in production: a flow whose triage step has a 'cancel' exit rule (action: abandon) leaked one ~500MB adapter tree per abandoned job; 49 accumulated adapters (~5GB) OOM-starved an 8GB host. Verified live on 0.46.4: two jobs abandoned via this rule at 08:09 still had their adapter chains alive 40+ minutes later, parented to the running server, while completed and cancelled jobs' adapters were reaped correctly. Regression test: an always/abandon exit rule on a completed step must trigger release_for_job on registered resource managers.

rickgorman · 2026-06-10T08:55:07Z

@zackham psst found a memory leak^

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: release job resources on exit-rule abandon (ACP adapter leak)#12

engine: release job resources on exit-rule abandon (ACP adapter leak)#12
rickgorman wants to merge 1 commit into
zackham:masterfrom
rickgorman:fix/acp-session-leak

rickgorman commented Jun 10, 2026

Uh oh!

rickgorman commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rickgorman commented Jun 10, 2026

fix: release job resources when a job is abandoned via exit rule

Summary

Mechanism

Empirical verification (0.46.4, live server)

Fix

Test

Notes / adjacent observations (not addressed here)

Uh oh!

rickgorman commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant