Skip to content

fix(core): bound streamText fullStream iteration with an inactivity timeout (#515)#518

Closed
MohammadYusif wants to merge 2 commits into
mozilla:mainfrom
MohammadYusif:fix/issue-515
Closed

fix(core): bound streamText fullStream iteration with an inactivity timeout (#515)#518
MohammadYusif wants to merge 2 commits into
mozilla:mainfrom
MohammadYusif:fix/issue-515

Conversation

@MohammadYusif

Copy link
Copy Markdown

Closes #515.

Problem

WebAgent.generateAndProcessAction passes timeout/abortSignal to streamText, but those only bound the initial request. The for await (...streamResult.fullStream) that consumes the stream is not independently bounded. A provider that stalls mid-stream (partial data then silence, or a half-open TCP connection that never sends FIN) can leave the iteration hanging forever and freeze the agent with no error and no progress.

Same indefinite-hang class as the aria-tree frame.evaluate fix in #511.

Fix

Wrap the stream consumption in iterateWithInactivityTimeout, which races each next() against the existing llmProviderTimeoutMs as an inactivity window and rejects with a descriptive error if no chunk arrives in time. The rejection flows into the existing try/catch and surfaces like any generation failure; the iterator's return() is invoked on exit to release stream resources. Reuses the configured provider timeout rather than adding a new knob — mirrors the request-level bound.

Testing

  • New: healthy stream passes through, mid-iteration stall rejects (hangs without the fix), return() is called on timeout — 3 tests in packages/core/test/webAgent.test.ts
  • Core suite: 863/864 pass (1 pre-existing Windows path-separator failure in config.test.ts, confirmed on clean checkout, unrelated)
  • Prettier clean

PR Type

  • Bug Fix

Checklist

  • I understand the code I am submitting
  • I have tested this code locally
  • New and existing tests pass locally (pnpm test)
  • I have added tests that prove my fix works
  • I have read and followed the contribution guidelines

AI Usage

  • AI was used for drafting/refactoring

MohammadYusif and others added 2 commits June 4, 2026 22:18
…imeout (mozilla#515)

The `timeout`/`abortSignal` passed to `streamText` in
`WebAgent.generateAndProcessAction` only bound the initial request. The
`for await (...streamResult.fullStream)` that consumes the stream was not
independently bounded, so a provider that stalled mid-stream (partial data
then silence, or a half-open TCP connection that never sends a FIN) could
leave the iteration pending forever and freeze the agent with no error and
no progress.

Wrap the stream consumption in `iterateWithInactivityTimeout`, which races
each `next()` against the existing `llmProviderTimeoutMs` as an inactivity
window and rejects with a descriptive error if no chunk arrives in time. The
rejection flows into the existing try/catch and is surfaced like any other
generation failure; the underlying iterator's `return()` is invoked on exit
so the stream's resources are released. Reuses the configured provider
timeout rather than adding a new knob, mirroring the request-level bound.

Same indefinite-hang class as the aria-tree frame.evaluate fix in mozilla#511.
@lmorchard

Copy link
Copy Markdown
Collaborator

Thank you for this, @MohammadYusif — and apologies for the slow turnaround. 🙏

You actually filed this before #521, and you correctly diagnosed the same unbounded-stream hang. I'm going to close this one as superseded rather than rejected — and I want to explain the why, because your approach is genuinely the tighter of the two.

#521 ended up bounding the whole iteration with a watchdog (Promise.race around the iteration body), which also let it close #516 in the same pass, so #515 is now resolved on main. The watchdog is the correctness guarantee: the iteration is bounded even if an in-flight operation ignores the abort signal.

Your iterateWithInactivityTimeout is the per-chunk inactivity timer that #521 deliberately left as a possible follow-up. The two actually nest cleanly — yours would sit inside the watchdog-wrapped iteration:

So merging on top wouldn't break anything. The reason I'm holding off is that the marginal benefit is small relative to the cost:

  • Faster detection is the only real gain — a dead stream surfaces ~1 inactivity-window after the last chunk instead of waiting up to the 300s watchdog. Real, but modest now that the bound is already guaranteed.
  • Error taxonomy. fix(core): add a per-iteration watchdog to bound the agent loop #521 added a dedicated TaskErrorCode.ITERATION_TIMEOUT. A stream stall caught by the inactivity timer would instead surface as a generic generation failure, bypassing that code — so we'd lose a little observability clarity.
  • Knob reuse. llm_provider_timeout_ms (120s) was tuned as a whole-request bound (~2× the p99 of the full generate span), not as a per-chunk inactivity window. It's loose enough that false positives are unlikely, but it's a slightly off-label reuse that could drift if that timeout is ever retuned for an unrelated reason.

If we find in practice that the 300s watchdog is too coarse for real mid-stream stalls, your next()-racing pattern (including the return() cleanup) is exactly the design we'd reach for — likely with the stall throwing into #521's dedicated timeout branch so it keeps the specific error code.

So: close-as-superseded, not a rejection — just unlucky timing on parallel work. I really appreciate the clear writeup, the tests, and the careful reasoning, and I hope to see more PRs from you. 🙂

@lmorchard lmorchard closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

streamText fullStream iteration has no iteration-level timeout (can hang the agent)

2 participants