Skip to content

fix(engine): recover from stalled in-progress turns#2283

Open
HUQIANTAO wants to merge 3 commits into
Hmbown:mainfrom
HUQIANTAO:fix/turn-stall-watchdog
Open

fix(engine): recover from stalled in-progress turns#2283
HUQIANTAO wants to merge 3 commits into
Hmbown:mainfrom
HUQIANTAO:fix/turn-stall-watchdog

Conversation

@HUQIANTAO
Copy link
Copy Markdown

@HUQIANTAO HUQIANTAO commented May 27, 2026

Summary

  • Fix watchdog blind spot in reconcile_turn_liveness() where a turn stuck in "in_progress" was never recovered, leaving is_loading permanently true
  • Add TURN_STALL_WATCHDOG_TIMEOUT (5 min) constant matched to stream idle timeout
  • Add Branch 3 that checks turn_started_at for staleness when no sub-agents are running

Problem

Users reported requests hanging indefinitely — the UI showed a spinner, new messages were queued ("edit last queued message"), and the app appeared frozen.

Root cause: reconcile_turn_liveness() had two branches that missed the "in_progress" state:

Branch Checks Covers
1 runtime_turn_status.is_none() Before TurnStarted arrives
2 runtime_turn_status ∈ {completed, interrupted, failed} After TurnComplete
3 (new) runtime_turn_status == "in_progress" + timeout Turn started but never completed

When TurnStarted arrived but TurnComplete was lost (sub-agent hang, engine panic, etc.), neither existing branch triggered. dispatch_started_at is cleared on TurnStarted (line 1387), so Branch 1's 30s timeout only covers the dispatch window, not the actual turn execution.

Fix

Added a third recovery branch that fires when:

  • is_loading is true
  • runtime_turn_status is "in_progress"
  • No running sub-agents (they legitimately extend turn lifetime)
  • Not compacting
  • turn_started_at exceeds TURN_STALL_WATCHDOG_TIMEOUT (300s)

Recovery clears is_loading, turn_started_at, runtime_turn_status, and dispatch_started_at, then shows an Error toast.

Test plan

  • turn_liveness_leaves_active_turn_running — turn within timeout does NOT trigger recovery
  • turn_liveness_recovers_stalled_in_progress_turn — turn exceeding timeout DOES trigger recovery
  • Existing tests pass (turn_liveness_watchdog_clears_stale_dispatch, turn_liveness_reconciles_completed_busy_state)
  • No new clippy warnings in modified files

Greptile Summary

This PR fixes a watchdog blind spot in reconcile_turn_liveness() where a turn stuck in \"in_progress\" (engine panic, lost completion event) was never recovered, leaving is_loading permanently true and the UI frozen. All three issues flagged in the previous review round are addressed in this version.

  • Branch 3 recovery clears is_loading, turn_started_at, runtime_turn_status, runtime_turn_id, dispatch_started_at, and user_scrolled_during_stream, calls the same cell-finalisation helpers as apply_engine_error_to_app, and fires an Error toast after 300 s with no completion signal.
  • Tests cover both the happy path (within-timeout turn stays running) and the recovery path (turn past timeout triggers cleanup), with assertions on every field the branch modifies.

Confidence Score: 5/5

Safe to merge; the new watchdog branch is well-guarded and mirrors the existing error-recovery helpers correctly.

The change adds one new conditional branch with tight guards. All fields touched by the branch are the same ones cleared in the existing TurnComplete and apply_engine_error_to_app paths. The only open item is a broken rustdoc link, which has no runtime impact.

No files require special attention; both changed files are narrow in scope.

Important Files Changed

Filename Overview
crates/tui/src/tui/ui.rs Adds TURN_STALL_WATCHDOG_TIMEOUT constant (300s) and Branch 3 in reconcile_turn_liveness() that recovers a stalled in-progress turn; cleanly clears streaming state, resets per-turn flags, and fires an Error toast.
crates/tui/src/tui/ui/tests.rs Adds turn_liveness_recovers_stalled_in_progress_turn and updates turn_liveness_leaves_active_turn_running to assert Branch 3 does not fire within the timeout window.

Fix All in Codex Fix All in Claude Code Fix All in Cursor

Reviews (3): Last reviewed commit: "fix(tui): reset user_scrolled_during_str..." | Re-trigger Greptile

reconcile_turn_liveness() had a blind spot: when TurnStarted arrived
(setting runtime_turn_status to "in_progress") but TurnComplete never
came (sub-agent hang, engine panic, lost event), neither existing
watchdog branch fired. is_loading stayed true permanently, queuing
all subsequent messages.

Add Branch 3 with a 5-minute timeout (matched to stream idle timeout)
that checks turn_started_at for staleness when the turn is stuck in
"in_progress" with no running sub-agents.
@HUQIANTAO HUQIANTAO force-pushed the fix/turn-stall-watchdog branch from a3c73bf to 8ed924f Compare May 27, 2026 13:29
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a watchdog timeout (TURN_STALL_WATCHDOG_TIMEOUT set to 300 seconds) to detect and recover from stalled turns in the TUI, along with corresponding unit tests. Feedback on the changes points out that when a turn stalls and is recovered, active tool executions or streaming assistant messages are left in a running state, causing permanent spinners in the UI. It is recommended to finalize the active cell and streaming assistant as interrupted and reset the streaming state during recovery.

Comment thread crates/tui/src/tui/ui.rs
Comment thread crates/tui/src/tui/ui.rs
Comment thread crates/tui/src/tui/ui.rs
The watchdog Branch 3 recovery left in-flight tool cells and streaming
assistant messages in a running state, causing permanent spinners in the
transcript. Also left runtime_turn_id stale, showing "(in progress)"
for a turn that had already been recovered.

Align the cleanup with apply_engine_error_to_app: finalize thinking,
streaming assistant, and active cells as interrupted; reset streaming
state; clear runtime_turn_id and streaming indices.
Comment thread crates/tui/src/tui/ui.rs
Without this, the turn immediately after a stall recovery would inherit
the scroll-lock from the stalled turn and silently skip auto-scroll,
leaving the user staring at stale content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant