fix(core): add a per-iteration watchdog to bound the agent loop#521
Conversation
13fa67d to
1576a07
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a per-iteration watchdog to WebAgent’s main loop to guarantee that a single iteration (snapshot + action generation/tooling) cannot hang indefinitely, addressing the stalled-stream and unbounded-iteration issues described in #515 and #516.
Changes:
- Introduces an iteration-level timeout (
iterationTimeoutMs, default 5 minutes) enforced viaPromise.raceand a watchdogAbortController. - Propagates the per-iteration abort signal to the main action
streamTextcall and validationgenerateTextWithRetry, and surfaces timeouts asTaskErrorCode.ITERATION_TIMEOUT. - Adds Vitest coverage for a hung LLM stream and for verifying the watchdog-derived abort signal is passed to
streamText.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/core/src/webAgent.ts | Adds per-iteration watchdog wrapper, new task error code, and uses per-iteration abort signal for action + validation LLM calls. |
| packages/core/src/errors.ts | Adds IterationTimeoutError used to signal watchdog timeouts. |
| packages/core/src/constants.ts | Adds DEFAULT_ITERATION_TIMEOUT_MS (5 minutes). |
| packages/core/test/webAgent.test.ts | Adds tests for watchdog timeout behavior and abort signal propagation to streamText. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let timer!: ReturnType<typeof setTimeout>; | ||
| const timeout = new Promise<never>((_, reject) => { | ||
| timer = setTimeout(() => { | ||
| watchdog.abort(); | ||
| reject( | ||
| new IterationTimeoutError( | ||
| `Iteration exceeded the ${this.iterationTimeoutMs}ms watchdog timeout`, | ||
| ), | ||
| ); | ||
| }, this.iterationTimeoutMs); | ||
| }); | ||
|
|
||
| try { | ||
| return await Promise.race([work(), timeout]); | ||
| } finally { | ||
| clearTimeout(timer); | ||
| this.currentIterationSignal = undefined; | ||
| } |
There was a problem hiding this comment.
Fixed in f6d5c39 — on the timeout path the finally now leaves the (already-aborted) currentIterationSignal in place (only resets it on the normal path), so any work() still running in the background cancels its remaining LLM calls instead of falling back to the un-aborted external signal.
| * | ||
| * Sets `this.currentIterationSignal` to the caller's abortSignal combined with a fresh | ||
| * watchdog controller, so the iteration's LLM calls abort when the watchdog fires. | ||
| * Races the work against the timeout — the race is the correctness guarantee: the |
There was a problem hiding this comment.
Good catch — corrected the docstring in f6d5c39 to scope it accurately: only the action streamText and validation read the watchdog signal; the extract tool uses the external signal captured at tool-creation, and the Promise.race bounds the iteration regardless. Threading the per-iteration signal into the tools is a separable follow-up.
A single agent iteration had no wall-clock bound: maxIterations caps the count, not time, and the streamText fullStream consumption could stall mid-stream even with the SDK timeout. Either could freeze the whole task indefinitely. Add a per-iteration watchdog in runMainLoop: runWithIterationWatchdog wraps the snapshot + action work in Promise.race against a timeout (default 300s, via the new iterationTimeoutMs option), and exposes an AbortController combined with the caller's signal as this.currentIterationSignal. The race guarantees the iteration is bounded even if an operation ignores the abort signal; the abort is best-effort cleanup of the in-flight LLM call. The action streamText and the validation call now use the per-iteration signal (planning, which runs before the loop, keeps the external signal). On timeout the iteration throws IterationTimeoutError, handled by a dedicated branch that ends the task with the new TaskErrorCode.ITERATION_TIMEOUT, distinct from the user-abort path. Closes #515 Closes #516 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1576a07 to
f6d5c39
Compare
Closes #515 and #516 (filed them as one PR — they overlap; a per-iteration watchdog is the natural mechanism to bound both the stalled stream and the iteration).
Problem
maxIterationscaps the count, not time, so one hung operation freezes the whole task forever.generateAndProcessActionpassestimeout/abortSignaltostreamText, but thefor await (fullStream)loop + trailingPromise.allcan stall mid-stream with no independent bound.Abort was previously external-only (
this.abortSignalfrom the caller); no internal watchdog.Fix — per-iteration watchdog
runWithIterationWatchdog(work)inrunMainLoop:AbortController+ a timer (iterationTimeoutMs), and setsthis.currentIterationSignal = AbortSignal.any([external, watchdog]).generateAndProcessAction) inPromise.race([work(), timeout]). The race is the correctness guarantee — the iteration is bounded even if an operation ignores the abort signal. The abort is best-effort cleanup of the in-flight LLM call.streamTextand the validationgenerateTextWithRetrynow usethis.currentIterationSignal ?? this.abortSignal(streamText fullStream iteration has no iteration-level timeout (can hang the agent) #515 — a stalled stream gets aborted). Planning (runs before the loop) and tool creation (once) keep the external signal.IterationTimeoutError→ dedicated catch branch → ends the task with the newTaskErrorCode.ITERATION_TIMEOUT, distinct from the user-abort path.Scope
iterationTimeoutMsis a WebAgent constructor option +DEFAULT_ITERATION_TIMEOUT_MS = 300000(5 min), mirroring fix(core): bound aria-tree frame.evaluate with a timeout to prevent indefinite agent freeze #511'sariaFrameEvaluateTimeoutMs. Not wired through the CLI/PiloConfig system this pass (separable follow-up). 5 min is generous: an iteration legitimately spans an LLM call (~120s) + tool execution (extract can be another ~120s call), so it only fires on a genuine hang.abortSignalinterrupt an in-flightfor await?" doesn't affect correctness here: thePromise.racebounds the iteration regardless; the abort is propagation/cleanup. A tighter stream inactivity timer remains a follow-up only if needed.Testing
New "iteration watchdog" tests (real timers +
iterationTimeoutMs: 50):executereturnssuccess:false,error.code === "ITERATION_TIMEOUT"(vs. hanging forever today).streamTextreceives a watchdog-derived abort signal that fires on timeout.Full core suite: 863 pass (861 baseline + 2). Typecheck +
prettier --checkclean. One live run vshttps://theverge.comconfirms normal operation still completes (watchdog does not fire at the 300s default).Context
Follow-up to #511 / #514. Last of the three hang-class reliability issues filed from #511.
🤖 Generated with Claude Code