Skip to content

feat(orchestrator): add multi-signal auto-continue to prevent agent stopping mid-task#3357

Open
Mustaqeem66 wants to merge 6 commits into
tailcallhq:mainfrom
Mustaqeem66:fix/auto-continue-agent
Open

feat(orchestrator): add multi-signal auto-continue to prevent agent stopping mid-task#3357
Mustaqeem66 wants to merge 6 commits into
tailcallhq:mainfrom
Mustaqeem66:fix/auto-continue-agent

Conversation

@Mustaqeem66
Copy link
Copy Markdown

Summary

This fix addresses issue #2890 where the agent stops mid-task and requires manual 'continue' prompts. The solution uses a multi-signal confidence scoring system that analyzes multiple independent signals before deciding to auto-continue.

Problem

When using ForgeCode, the agent frequently stops mid-task and waits for user input instead of continuing automatically. Users report having to type "continue" 5-10 times for a single complex task.

Root causes:

  • Models (especially MiniMax) return finish_reason: "stop" instead of finish_reason: "tool_calls"
  • Models say "Let me continue" but don't actually make tool calls
  • Empty tool_calls arrays are treated as "no tools needed" instead of "protocol violation"

Solution

Implemented a multi-signal confidence scoring system inspired by production spam filters and fraud detection:

5 Independent Signals

Signal Score Description
S1: finish_reason 30 Model indicated tool use but didn't provide tool_calls
S2: last_event 25 Last event was ToolResult - model should continue
S3: content_intent 25 Content contains "continue" phrases but NOT "complete"
S4: no_summary 10 Content does NOT contain summary phrases
S5: tool_ratio 10 >50% of recent turns had tool calls

Decision Rule

Confidence Score = S1 + S2 + S3 + S4 + S5

if score >= 60: AUTO-CONTINUE (high confidence)
elif score >= 40: LOG WARNING but don't auto-continue (medium confidence)
else: FINISH TURN (low confidence - task likely done)

Example Scoring

Scenario S1 S2 S3 S4 S5 Total Result
Model says "continue" after tool result 0 25 25 10 10 70 ✅ Auto-continue
finish_reason=tool_calls but empty array 30 25 0 10 10 75 ✅ Auto-continue
No finish_reason after tool result 15 25 25 10 10 85 ✅ Auto-continue
"task is complete, summarize" 0 25 0 0 10 35 ❌ Finish turn
"please review my changes" 0 0 0 0 10 10 ❌ Finish turn
"continue" but no tool history 0 0 25 10 0 35 ❌ Finish turn

Why Multi-Signal?

No single signal can trigger auto-continue. At least 2-3 signals must agree. This prevents false positives:

  • "Let me continue with a summary" → Won't auto-continue (completion phrases override)
  • "Let me know if you'd like changes" → Won't auto-continue (waiting for user)
  • "The task is complete. All changes have been made." → Won't auto-continue (completion detected)

Changes

  • crates/forge_domain/src/auto_continue.rs - Core auto-continue logic with 8 tests
  • crates/forge_domain/src/lib.rs - Module exports
  • crates/forge_app/src/orch.rs - Integration into agent loop

Testing

8 comprehensive tests covering:

  • ✅ True positives (should auto-continue)
  • ✅ True negatives (should finish turn)
  • ✅ Edge cases (empty content, ambiguous scenarios)

Fixes

YAMAL
forge:
auto_continue:
enabled: true
confidence_threshold: 60 # Can be tuned per model
max_retries: 3
intent_phrases:
- "let me continue"
- "next step"
# ... extensible
completion_phrases:
- "task is complete"
- "i'm done"
# ... extensible


Testing Coverage

All scenarios tested in auto_continue.rs:

  • ✅ 3 true positives (should auto-continue)
  • ✅ 5 true negatives (should finish)
  • ✅ Edge cases (empty content, ambiguous phrases)

Files Changed

File Change Lines
crates/forge_domain/src/auto_continue.rs New module +410
crates/forge_domain/src/lib.rs Module export +2
crates/forge_app/src/orch.rs Integration +80

Fixes

Issue Title Status
#2890 Agent stops mid-task ✅ Fixed
#2641 Intermittent exit with MiniMax ✅ Fixed
#2950 MiniMax stops mid-response ✅ Fixed
#3170 Empty tool_calls with finish_reason=tool_calls ✅ Fixed

Do you want me to update your PR description with this complete version?

This fix addresses .forge.db corruption issues in ForgeCode by:

1. Startup WAL Recovery:
   - Checkpoints any leftover WAL from previous crashed sessions
   - Runs database integrity check on startup
   - Ensures data is recovered before new session starts

2. Auto-Checkpoint Threshold Reduced:
   - Changed from 1000 to 100 frames (~5MB max instead of ~50MB)
   - Prevents massive WAL files during long sessions

3. Async Checkpoint Method:
   - Added checkpoint_async() for graceful shutdown scenarios
   - Uses pool-based connection (async-safe)

4. Drop Checkpoint:
   - Checkpoints WAL when DatabasePool is dropped
   - Logs warnings if fails (expected on force-kill)

5. Comprehensive Tests:
   - test_checkpoint_method_exists
   - test_drop_calls_checkpoint
   - test_in_memory_pool_has_checkpoint
   - test_checkpoint_truncates_wal
   - test_wal_recovery_on_startup
   - test_async_checkpoint_method
   - test_autocheckpoint_threshold_reduced

Fixes tailcallhq#3260 related corruption issues by preventing WAL accumulation
and ensuring data integrity on startup.

Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
Phase 1 - Safety Critical:
- Add unique match validation (count all matches, error if > 1)
- Add overlap detection with validation
- Add atomic write with temp file + rename
- Add verification and memory-based rollback
- Add better error messages with file path

Phase 2 - Robustness:
- Add line-based whitespace normalization
- Add line-window fuzzy matching with 0.90 threshold
- Add 3-layer fallback chain (exact -> whitespace -> fuzzy)

Key improvements:
- Reverse-order application (already done)
- Unique match validation prevents silent wrong replacements
- Overlap detection rejects logically impossible edits
- Atomic write prevents half-written files
- Whitespace normalization handles LLM whitespace differences
- Fuzzy matching catches near-matches
- Better error messages with file path

Tests added:
- 30+ new tests covering all features

Fixes: tailcallhq#3249, tailcallhq#3182, tailcallhq#2815, tailcallhq#2773, tailcallhq#2997, tailcallhq#3115, tailcallhq#3291

Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
Added line:column information to overlap error messages for better debugging.
This helps users identify exactly where overlapping edits occur in their files.

Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
Changed multi_patch edit application to use pre-computed positions
(plan.position and plan.old_len) instead of re-searching in modified content.

This ensures byte offset corruption cannot happen since we're using
exact positions from the original content rather than fresh searches.

Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
This fix addresses issue tailcallhq#2641 by adding proper JSON validation for tool call arguments.

Changes:
- Added parse_json() method to ToolCallArguments that validates JSON and returns proper errors
- Updated try_from_parts() to use parse_json() instead of from_json()
- Added 4 comprehensive tests for parse_json() functionality

This ensures malformed JSON in tool call arguments is detected early and returns a proper error instead of being silently stored as Unparsed.

Fixes: tailcallhq#2641

Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
…rom stopping mid-task

This fix addresses issue tailcallhq#2890 where the agent stops mid-task and requires
manual 'continue' prompts. The solution uses a multi-signal confidence scoring
system that analyzes multiple independent signals before deciding to auto-continue.

Changes:
- Added AutoContinueConfig and AutoContinueAnalyzer to forge_domain
- Implemented 5 independent signals:
  - S1: finish_reason analysis (30 points)
  - S2: last event was ToolResult (25 points)
  - S3: content intent phrases (25 points)
  - S4: no summary language (10 points)
  - S5: recent tool_call ratio (10 points)
- Auto-continue triggers when confidence >= 60 and max retries not exceeded
- Reset counter when turn completes normally

The confidence scoring approach prevents false positives by requiring multiple
signals to agree before auto-continuing. This is similar to how production spam
filters and fraud detection systems work.

Fixes: tailcallhq#2890, tailcallhq#2641, tailcallhq#2950, tailcallhq#3170

Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
@github-actions github-actions Bot added the type: feature Brand new functionality, features, pages, workflows, endpoints, etc. label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: feature Brand new functionality, features, pages, workflows, endpoints, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant