Skip to content

fix(core): add a per-iteration watchdog to bound the agent loop#521

Merged
lmorchard merged 1 commit into
mainfrom
fix/agent-iteration-watchdog
Jun 9, 2026
Merged

fix(core): add a per-iteration watchdog to bound the agent loop#521
lmorchard merged 1 commit into
mainfrom
fix/agent-iteration-watchdog

Conversation

@lmorchard

Copy link
Copy Markdown
Collaborator

Closes #515 and #516 (filed them as one PR — they overlap; a per-iteration watchdog is the natural mechanism to bound both the stalled stream and the iteration).

Problem

Abort was previously external-only (this.abortSignal from the caller); no internal watchdog.

Fix — per-iteration watchdog

runWithIterationWatchdog(work) in runMainLoop:

  • Creates an AbortController + a timer (iterationTimeoutMs), and sets this.currentIterationSignal = AbortSignal.any([external, watchdog]).
  • Wraps the iteration body (snapshot + generateAndProcessAction) in Promise.race([work(), timeout]). The race is the correctness guarantee — the iteration is bounded even if an operation ignores the abort signal. The abort is best-effort cleanup of the in-flight LLM call.
  • The action streamText and the validation generateTextWithRetry now use this.currentIterationSignal ?? this.abortSignal (streamText fullStream iteration has no iteration-level timeout (can hang the agent) #515 — a stalled stream gets aborted). Planning (runs before the loop) and tool creation (once) keep the external signal.
  • On timeout: IterationTimeoutError → dedicated catch branch → ends the task with the new TaskErrorCode.ITERATION_TIMEOUT, distinct from the user-abort path.

Scope

  • iterationTimeoutMs is a WebAgent constructor option + DEFAULT_ITERATION_TIMEOUT_MS = 300000 (5 min), mirroring fix(core): bound aria-tree frame.evaluate with a timeout to prevent indefinite agent freeze #511's ariaFrameEvaluateTimeoutMs. Not wired through the CLI/PiloConfig system this pass (separable follow-up). 5 min is generous: an iteration legitimately spans an LLM call (~120s) + tool execution (extract can be another ~120s call), so it only fires on a genuine hang.
  • Per-iteration only and watchdog-only for the stream (no per-chunk inactivity timer) — both decided up front. The open question "does abortSignal interrupt an in-flight for await?" doesn't affect correctness here: the Promise.race bounds the iteration regardless; the abort is propagation/cleanup. A tighter stream inactivity timer remains a follow-up only if needed.

Testing

New "iteration watchdog" tests (real timers + iterationTimeoutMs: 50):

  • A hung LLM stream → execute returns success:false, error.code === "ITERATION_TIMEOUT" (vs. hanging forever today).
  • streamText receives a watchdog-derived abort signal that fires on timeout.

Full core suite: 863 pass (861 baseline + 2). Typecheck + prettier --check clean. One live run vs https://theverge.com confirms normal operation still completes (watchdog does not fire at the 300s default).

Context

Follow-up to #511 / #514. Last of the three hang-class reliability issues filed from #511.

🤖 Generated with Claude Code

@lmorchard lmorchard requested a review from Copilot June 4, 2026 22:32
@lmorchard lmorchard force-pushed the fix/agent-iteration-watchdog branch from 13fa67d to 1576a07 Compare June 4, 2026 22:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a per-iteration watchdog to WebAgent’s main loop to guarantee that a single iteration (snapshot + action generation/tooling) cannot hang indefinitely, addressing the stalled-stream and unbounded-iteration issues described in #515 and #516.

Changes:

  • Introduces an iteration-level timeout (iterationTimeoutMs, default 5 minutes) enforced via Promise.race and a watchdog AbortController.
  • Propagates the per-iteration abort signal to the main action streamText call and validation generateTextWithRetry, and surfaces timeouts as TaskErrorCode.ITERATION_TIMEOUT.
  • Adds Vitest coverage for a hung LLM stream and for verifying the watchdog-derived abort signal is passed to streamText.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

File Description
packages/core/src/webAgent.ts Adds per-iteration watchdog wrapper, new task error code, and uses per-iteration abort signal for action + validation LLM calls.
packages/core/src/errors.ts Adds IterationTimeoutError used to signal watchdog timeouts.
packages/core/src/constants.ts Adds DEFAULT_ITERATION_TIMEOUT_MS (5 minutes).
packages/core/test/webAgent.test.ts Adds tests for watchdog timeout behavior and abort signal propagation to streamText.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +480 to +497
let timer!: ReturnType<typeof setTimeout>;
const timeout = new Promise<never>((_, reject) => {
timer = setTimeout(() => {
watchdog.abort();
reject(
new IterationTimeoutError(
`Iteration exceeded the ${this.iterationTimeoutMs}ms watchdog timeout`,
),
);
}, this.iterationTimeoutMs);
});

try {
return await Promise.race([work(), timeout]);
} finally {
clearTimeout(timer);
this.currentIterationSignal = undefined;
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f6d5c39 — on the timeout path the finally now leaves the (already-aborted) currentIterationSignal in place (only resets it on the normal path), so any work() still running in the background cancels its remaining LLM calls instead of falling back to the un-aborted external signal.

Comment on lines +466 to +469
*
* Sets `this.currentIterationSignal` to the caller's abortSignal combined with a fresh
* watchdog controller, so the iteration's LLM calls abort when the watchdog fires.
* Races the work against the timeout — the race is the correctness guarantee: the

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — corrected the docstring in f6d5c39 to scope it accurately: only the action streamText and validation read the watchdog signal; the extract tool uses the external signal captured at tool-creation, and the Promise.race bounds the iteration regardless. Threading the per-iteration signal into the tools is a separable follow-up.

A single agent iteration had no wall-clock bound: maxIterations caps the count,
not time, and the streamText fullStream consumption could stall mid-stream even
with the SDK timeout. Either could freeze the whole task indefinitely.

Add a per-iteration watchdog in runMainLoop: runWithIterationWatchdog wraps the
snapshot + action work in Promise.race against a timeout (default 300s, via the
new iterationTimeoutMs option), and exposes an AbortController combined with the
caller's signal as this.currentIterationSignal. The race guarantees the
iteration is bounded even if an operation ignores the abort signal; the abort is
best-effort cleanup of the in-flight LLM call. The action streamText and the
validation call now use the per-iteration signal (planning, which runs before
the loop, keeps the external signal).

On timeout the iteration throws IterationTimeoutError, handled by a dedicated
branch that ends the task with the new TaskErrorCode.ITERATION_TIMEOUT, distinct
from the user-abort path.

Closes #515
Closes #516

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmorchard lmorchard force-pushed the fix/agent-iteration-watchdog branch from 1576a07 to f6d5c39 Compare June 4, 2026 22:39
@lmorchard lmorchard requested a review from Copilot June 4, 2026 22:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@lmorchard lmorchard merged commit 511c655 into main Jun 9, 2026
14 checks passed
@lmorchard lmorchard deleted the fix/agent-iteration-watchdog branch June 10, 2026 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

streamText fullStream iteration has no iteration-level timeout (can hang the agent)

2 participants