Skip to content

engine: release job resources on exit-rule abandon (ACP adapter leak)#12

Open
rickgorman wants to merge 1 commit into
zackham:masterfrom
rickgorman:fix/acp-session-leak
Open

engine: release job resources on exit-rule abandon (ACP adapter leak)#12
rickgorman wants to merge 1 commit into
zackham:masterfrom
rickgorman:fix/acp-session-leak

Conversation

@rickgorman

Copy link
Copy Markdown

fix: release job resources when a job is abandoned via exit rule

Summary

Jobs that reach terminal state through an exit rule with action: abandon (evaluated on a completed step in _process_completion) never release their per-job resources. The job is marked FAILED and the handler returns without calling _cleanup_job_sessions, so release_for_job is never invoked on the registered ResourceManagers (the ACP process pool and, when containment is on, the VM pool).

The practical effect: every abandoned job leaks its full claude-agent-acp adapter chain — npm exec wrapper + node adapter + agent SDK CLI, ~500MB RSS combined — parented to the live server, for the remaining life of the server process. The startup orphan sweep (0.45.0/0.46.x) can't help because it only runs at boot, and shutdown()'s release_all only fires on graceful exit.

On a host running a flow whose triage step has a routine cancel → action: abandon exit rule, this accumulated 49 adapter chains (~5GB) over one evening and OOM-starved an 8GB machine (MemAvailable fell from ~5GB to 31MB; kill -9 of the adapters instantly recovered the memory).

Mechanism

src/stepwise/engine.py, _process_completion, the case "abandon": arm (~line 3443). Compare with every other terminal transition, which all drain resource managers via _cleanup_job_sessions:

Terminal path Releases resources?
Job completion (sync _tick_job + async _check_job_terminal) yes
Settled-but-failed (no_terminal_reached, both engines) yes
_halt_job (step failure) yes
abandon arm of failure-routing (_fail_run, ~line 3911) yes
cancel_job yes
abandon arm of completion exit rules (~line 3443) no — this bug

ACP processes are job-scoped in ResourceLifecycleManager (job_refs ref-counting). With release_for_job never called, the refs never drain, _kill_process never runs, and the adapter subprocess (spawned with start_new_session=True, stdin held open by the live server) stays alive indefinitely.

Empirical verification (0.46.4, live server)

  • Two jobs abandoned via a cancel exit rule on their completed triage step at 08:09 (job.failed {"reason": "abandoned", "step": "triage", "rule": "cancel"}). Their adapter chains (pids 13252, 13537, one per job — confirmed via executor_state.pid and per-job --allowedPaths) were still alive 40+ minutes later, parented to the running server, ~530MB RSS each.
  • Control: a job whose agent step completes normally has its adapter reaped at job completion (process table identical before/after).
  • The handshake-failure path (AcpError during session/newlifecycle.discard) also cleans up correctly.

Fix

One line: call self._cleanup_job_sessions(job.id, job) after emitting JOB_FAILED in the completion-path abandon arm, matching the failure-routing abandon arm.

Test

tests/test_resource_manager_lifecycle.py::TestJobTerminalRelease::test_abandoned_job_triggers_release — a workflow whose only step completes and hits an always/abandon exit rule must trigger release_for_job on registered resource managers. Fails on master (released_jobs == []), passes with the fix.

Full suite: 3113 passed, 3 skipped, 6 failed — all 6 failures also fail on pristine master in this environment (macOS dev box): 4× test_orphan_sweep.py and 1× test_streaming.py depend on Linux /proc / non-symlinked /tmp; 1× test_jobs_api_stress.py concurrent-sqlite-reads test is timing-sensitive. No new failures introduced.

Notes / adjacent observations (not addressed here)

  • When a job is abandoned while other steps of the same job are still in flight, those step runs are not cancelled and can be left in running state under a FAILED job (matches "stale running runs while the job is marked failed" seen in the field). cancel_job handles this for the cancel path; the abandon paths don't. Possibly worth a follow-up.
  • .stepwise/pids/acp-*.json files written by _spawn_process are only unlinked by _kill_process; adapters that die any other way (e.g. startup sweep, manual kill) leave stale pid files behind. Cosmetic — nothing reads them today.

The abandon arm of exit-rule resolution in _process_completion marked
the job FAILED and returned without calling _cleanup_job_sessions, so
the job's refs in every registered ResourceManager (ACP process pool,
containment VMs) were never dropped. Each job abandoned this way leaked
its claude-agent-acp adapter chain (~500MB RSS: npm exec + node adapter
+ agent SDK CLI) for the remaining life of the server process.

Every other terminal transition already drains resource managers:
job completion (sync + async), settled-but-failed, _halt_job, the
abandon arm of failure-routing, and cancel_job. This was the one
terminal path that didn't.

Observed in production: a flow whose triage step has a 'cancel' exit
rule (action: abandon) leaked one ~500MB adapter tree per abandoned
job; 49 accumulated adapters (~5GB) OOM-starved an 8GB host. Verified
live on 0.46.4: two jobs abandoned via this rule at 08:09 still had
their adapter chains alive 40+ minutes later, parented to the running
server, while completed and cancelled jobs' adapters were reaped
correctly.

Regression test: an always/abandon exit rule on a completed step must
trigger release_for_job on registered resource managers.
@rickgorman

Copy link
Copy Markdown
Author

@zackham psst found a memory leak^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant