feat(orchestrator): add multi-signal auto-continue to prevent agent stopping mid-task#3357
Open
Mustaqeem66 wants to merge 6 commits into
Open
feat(orchestrator): add multi-signal auto-continue to prevent agent stopping mid-task#3357Mustaqeem66 wants to merge 6 commits into
Mustaqeem66 wants to merge 6 commits into
Conversation
This fix addresses .forge.db corruption issues in ForgeCode by: 1. Startup WAL Recovery: - Checkpoints any leftover WAL from previous crashed sessions - Runs database integrity check on startup - Ensures data is recovered before new session starts 2. Auto-Checkpoint Threshold Reduced: - Changed from 1000 to 100 frames (~5MB max instead of ~50MB) - Prevents massive WAL files during long sessions 3. Async Checkpoint Method: - Added checkpoint_async() for graceful shutdown scenarios - Uses pool-based connection (async-safe) 4. Drop Checkpoint: - Checkpoints WAL when DatabasePool is dropped - Logs warnings if fails (expected on force-kill) 5. Comprehensive Tests: - test_checkpoint_method_exists - test_drop_calls_checkpoint - test_in_memory_pool_has_checkpoint - test_checkpoint_truncates_wal - test_wal_recovery_on_startup - test_async_checkpoint_method - test_autocheckpoint_threshold_reduced Fixes tailcallhq#3260 related corruption issues by preventing WAL accumulation and ensuring data integrity on startup. Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
Phase 1 - Safety Critical: - Add unique match validation (count all matches, error if > 1) - Add overlap detection with validation - Add atomic write with temp file + rename - Add verification and memory-based rollback - Add better error messages with file path Phase 2 - Robustness: - Add line-based whitespace normalization - Add line-window fuzzy matching with 0.90 threshold - Add 3-layer fallback chain (exact -> whitespace -> fuzzy) Key improvements: - Reverse-order application (already done) - Unique match validation prevents silent wrong replacements - Overlap detection rejects logically impossible edits - Atomic write prevents half-written files - Whitespace normalization handles LLM whitespace differences - Fuzzy matching catches near-matches - Better error messages with file path Tests added: - 30+ new tests covering all features Fixes: tailcallhq#3249, tailcallhq#3182, tailcallhq#2815, tailcallhq#2773, tailcallhq#2997, tailcallhq#3115, tailcallhq#3291 Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
Added line:column information to overlap error messages for better debugging. This helps users identify exactly where overlapping edits occur in their files. Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
Changed multi_patch edit application to use pre-computed positions (plan.position and plan.old_len) instead of re-searching in modified content. This ensures byte offset corruption cannot happen since we're using exact positions from the original content rather than fresh searches. Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
This fix addresses issue tailcallhq#2641 by adding proper JSON validation for tool call arguments. Changes: - Added parse_json() method to ToolCallArguments that validates JSON and returns proper errors - Updated try_from_parts() to use parse_json() instead of from_json() - Added 4 comprehensive tests for parse_json() functionality This ensures malformed JSON in tool call arguments is detected early and returns a proper error instead of being silently stored as Unparsed. Fixes: tailcallhq#2641 Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
…rom stopping mid-task This fix addresses issue tailcallhq#2890 where the agent stops mid-task and requires manual 'continue' prompts. The solution uses a multi-signal confidence scoring system that analyzes multiple independent signals before deciding to auto-continue. Changes: - Added AutoContinueConfig and AutoContinueAnalyzer to forge_domain - Implemented 5 independent signals: - S1: finish_reason analysis (30 points) - S2: last event was ToolResult (25 points) - S3: content intent phrases (25 points) - S4: no summary language (10 points) - S5: recent tool_call ratio (10 points) - Auto-continue triggers when confidence >= 60 and max retries not exceeded - Reset counter when turn completes normally The confidence scoring approach prevents false positives by requiring multiple signals to agree before auto-continuing. This is similar to how production spam filters and fraud detection systems work. Fixes: tailcallhq#2890, tailcallhq#2641, tailcallhq#2950, tailcallhq#3170 Co-authored-by: Mustaqeem66 <ageisnode@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This fix addresses issue #2890 where the agent stops mid-task and requires manual 'continue' prompts. The solution uses a multi-signal confidence scoring system that analyzes multiple independent signals before deciding to auto-continue.
Problem
When using ForgeCode, the agent frequently stops mid-task and waits for user input instead of continuing automatically. Users report having to type "continue" 5-10 times for a single complex task.
Root causes:
finish_reason: "stop"instead offinish_reason: "tool_calls"tool_callsarrays are treated as "no tools needed" instead of "protocol violation"Solution
Implemented a multi-signal confidence scoring system inspired by production spam filters and fraud detection:
5 Independent Signals
Decision Rule
Confidence Score = S1 + S2 + S3 + S4 + S5
if score >= 60: AUTO-CONTINUE (high confidence)
elif score >= 40: LOG WARNING but don't auto-continue (medium confidence)
else: FINISH TURN (low confidence - task likely done)
Example Scoring
finish_reason=tool_callsbut empty arrayWhy Multi-Signal?
No single signal can trigger auto-continue. At least 2-3 signals must agree. This prevents false positives:
Changes
crates/forge_domain/src/auto_continue.rs- Core auto-continue logic with 8 testscrates/forge_domain/src/lib.rs- Module exportscrates/forge_app/src/orch.rs- Integration into agent loopTesting
8 comprehensive tests covering:
Fixes
YAMAL
forge:
auto_continue:
enabled: true
confidence_threshold: 60 # Can be tuned per model
max_retries: 3
intent_phrases:
- "let me continue"
- "next step"
# ... extensible
completion_phrases:
- "task is complete"
- "i'm done"
# ... extensible
Testing Coverage
All scenarios tested in
auto_continue.rs:Files Changed
crates/forge_domain/src/auto_continue.rscrates/forge_domain/src/lib.rscrates/forge_app/src/orch.rsFixes
Do you want me to update your PR description with this complete version?