engine: release job resources on exit-rule abandon (ACP adapter leak)#12
Open
rickgorman wants to merge 1 commit into
Open
engine: release job resources on exit-rule abandon (ACP adapter leak)#12rickgorman wants to merge 1 commit into
rickgorman wants to merge 1 commit into
Conversation
The abandon arm of exit-rule resolution in _process_completion marked the job FAILED and returned without calling _cleanup_job_sessions, so the job's refs in every registered ResourceManager (ACP process pool, containment VMs) were never dropped. Each job abandoned this way leaked its claude-agent-acp adapter chain (~500MB RSS: npm exec + node adapter + agent SDK CLI) for the remaining life of the server process. Every other terminal transition already drains resource managers: job completion (sync + async), settled-but-failed, _halt_job, the abandon arm of failure-routing, and cancel_job. This was the one terminal path that didn't. Observed in production: a flow whose triage step has a 'cancel' exit rule (action: abandon) leaked one ~500MB adapter tree per abandoned job; 49 accumulated adapters (~5GB) OOM-starved an 8GB host. Verified live on 0.46.4: two jobs abandoned via this rule at 08:09 still had their adapter chains alive 40+ minutes later, parented to the running server, while completed and cancelled jobs' adapters were reaped correctly. Regression test: an always/abandon exit rule on a completed step must trigger release_for_job on registered resource managers.
Author
|
@zackham psst found a memory leak^ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix: release job resources when a job is abandoned via exit rule
Summary
Jobs that reach terminal state through an exit rule with
action: abandon(evaluated on a completed step in_process_completion) never release their per-job resources. The job is marked FAILED and the handler returns without calling_cleanup_job_sessions, sorelease_for_jobis never invoked on the registered ResourceManagers (the ACP process pool and, when containment is on, the VM pool).The practical effect: every abandoned job leaks its full
claude-agent-acpadapter chain —npm execwrapper + node adapter + agent SDK CLI, ~500MB RSS combined — parented to the live server, for the remaining life of the server process. The startup orphan sweep (0.45.0/0.46.x) can't help because it only runs at boot, andshutdown()'srelease_allonly fires on graceful exit.On a host running a flow whose triage step has a routine
cancel → action: abandonexit rule, this accumulated 49 adapter chains (~5GB) over one evening and OOM-starved an 8GB machine (MemAvailable fell from ~5GB to 31MB;kill -9of the adapters instantly recovered the memory).Mechanism
src/stepwise/engine.py,_process_completion, thecase "abandon":arm (~line 3443). Compare with every other terminal transition, which all drain resource managers via_cleanup_job_sessions:_tick_job+ async_check_job_terminal)no_terminal_reached, both engines)_halt_job(step failure)abandonarm of failure-routing (_fail_run, ~line 3911)cancel_jobabandonarm of completion exit rules (~line 3443)ACP processes are job-scoped in
ResourceLifecycleManager(job_refsref-counting). Withrelease_for_jobnever called, the refs never drain,_kill_processnever runs, and the adapter subprocess (spawned withstart_new_session=True, stdin held open by the live server) stays alive indefinitely.Empirical verification (0.46.4, live server)
cancelexit rule on their completedtriagestep at 08:09 (job.failed {"reason": "abandoned", "step": "triage", "rule": "cancel"}). Their adapter chains (pids 13252, 13537, one per job — confirmed viaexecutor_state.pidand per-job--allowedPaths) were still alive 40+ minutes later, parented to the running server, ~530MB RSS each.AcpErrorduringsession/new→lifecycle.discard) also cleans up correctly.Fix
One line: call
self._cleanup_job_sessions(job.id, job)after emittingJOB_FAILEDin the completion-path abandon arm, matching the failure-routing abandon arm.Test
tests/test_resource_manager_lifecycle.py::TestJobTerminalRelease::test_abandoned_job_triggers_release— a workflow whose only step completes and hits analways/abandonexit rule must triggerrelease_for_jobon registered resource managers. Fails on master (released_jobs == []), passes with the fix.Full suite: 3113 passed, 3 skipped, 6 failed — all 6 failures also fail on pristine master in this environment (macOS dev box): 4×
test_orphan_sweep.pyand 1×test_streaming.pydepend on Linux/proc/ non-symlinked/tmp; 1×test_jobs_api_stress.pyconcurrent-sqlite-reads test is timing-sensitive. No new failures introduced.Notes / adjacent observations (not addressed here)
runningstate under a FAILED job (matches "stale running runs while the job is marked failed" seen in the field).cancel_jobhandles this for the cancel path; the abandon paths don't. Possibly worth a follow-up..stepwise/pids/acp-*.jsonfiles written by_spawn_processare only unlinked by_kill_process; adapters that die any other way (e.g. startup sweep, manual kill) leave stale pid files behind. Cosmetic — nothing reads them today.