fix: Agent max_retry_limit config is silently ignored; tool failures have no retry/backoff path#2081
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Caution Review failedAn error occurred during the review process. Please try again later. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces robust exponential backoff and jitter retry mechanisms for both tool execution and guardrail validation within the agent framework. It adds a new BackoffPolicy utility, updates ExecutionConfig to support and validate retry parameters, and integrates this retry logic across synchronous and asynchronous execution paths. The review feedback highlights two important issues: first, a bug where self._execution_config is incorrectly accessed as self.execution_config in tool_execution.py, causing it to silently fall back to default values; and second, a gap in error handling where exceptions raised outside the retry loop (such as in hooks) bypass the ToolExecutionError wrapping.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| logging.warning(f"Tool {function_name} timed out after {tool_timeout}s") | ||
| result = {"error": f"Tool timed out after {tool_timeout}s", "timeout": True} | ||
| # Apply tool retry logic with exponential backoff | ||
| execution_config = getattr(self, 'execution_config', None) |
There was a problem hiding this comment.
The execution_config attribute is stored as self._execution_config (with a leading underscore) in the Agent class initialization. Accessing self.execution_config without the underscore will always return None, causing the tool retry logic to silently fall back to the default hardcoded values instead of using the user-configured settings.
| execution_config = getattr(self, 'execution_config', None) | |
| execution_config = getattr(self, '_execution_config', None) |
| except Exception as e: | ||
| # Emit tool call end with error | ||
| # Emit tool call end with error for exceptions that escape the retry loop | ||
| _duration_ms = (_time.time() - _tool_start_time) * 1000 | ||
| _trace_emitter.tool_call_end(self.name, function_name, None, _duration_ms, str(e)) | ||
|
|
||
| # Gap 3a fix: Wrap exceptions in ToolExecutionError for better observability | ||
| is_retryable = not isinstance(e, (ValueError, TypeError, AttributeError)) | ||
| raise ToolExecutionError( | ||
| f"Tool '{function_name}' failed: {e}", | ||
| tool_name=function_name, | ||
| agent_id=self.name, | ||
| is_retryable=is_retryable, | ||
| ) from e | ||
| raise # Re-raise the exception as it's already properly wrapped |
There was a problem hiding this comment.
Any exceptions raised outside of the retry loop (such as in the BEFORE_TOOL or AFTER_TOOL hooks) will propagate to this outer except block. Since they are not wrapped in ToolExecutionError, simply calling raise will bypass the custom error wrapping and break the uniform error handling/observability contract (Gap 3a fix) for those exceptions. We should check if the exception is already a ToolExecutionError and wrap it if not.
except Exception as e:
# Emit tool call end with error for exceptions that escape the retry loop
_duration_ms = (_time.time() - _tool_start_time) * 1000
_trace_emitter.tool_call_end(self.name, function_name, None, _duration_ms, str(e))
if not isinstance(e, ToolExecutionError):
is_retryable = not isinstance(e, (ValueError, TypeError, AttributeError))
raise ToolExecutionError(
f"Tool '{function_name}' failed: {e}",
tool_name=function_name,
agent_id=self.name,
is_retryable=is_retryable,
) from e
raise
Greptile SummaryThis PR adds exponential-backoff retry logic for tool failures on the synchronous execution path and wires guardrail retries to the same
Confidence Score: 3/5The core retry logic is sound and the attribute-name bug is fixed, but the newly added input validation in ExecutionConfig.post_init is unreachable in every normal agent initialization because it is placed after early-return blocks that fire for all internal callers. The post_init validation for retry_initial_delay, retry_backoff_factor, and retry_jitter is silently skipped in normal operation — internal callers set _context_compaction_internal to True on the first call and all subsequent instantiations return before reaching the validation block. Invalid values will pass through without error. src/praisonai-agents/praisonaiagents/config/feature_configs.py — the validation block in post_init needs to be moved before the context_compaction early-return guards. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[execute_tool called] --> B{blocked_result?}
B -- yes --> C[result = blocked_result]
B -- no --> D[Read _execution_config from self]
D --> E{_execution_config found?}
E -- yes --> F[Use config retry params]
E -- no --> G[Use hardcoded defaults]
F --> H[loop: attempt in range 1..max_retry_limit+2]
G --> H
H --> I{tool_timeout set?}
I -- yes --> J[Submit to ThreadPoolExecutor with timeout]
I -- no --> K[_execute_tool_with_circuit_breaker]
J --> L{TimeoutError?}
L -- yes --> M[result = error dict timeout=True]
L -- no --> K
K --> N{result is error dict?}
M --> N
N -- circuit_open or timeout --> O[raise ToolExecutionError is_retryable=True]
N -- other error dict or success --> P[break - pass to LLM]
O --> Q{retryable AND attempts left?}
Q -- yes --> R[BackoffPolicy.delay / time.sleep]
R --> H
Q -- no --> S[raise ToolExecutionError]
P --> T[wrap_if_external / truncate output]
T --> U[Return result to LLM]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[execute_tool called] --> B{blocked_result?}
B -- yes --> C[result = blocked_result]
B -- no --> D[Read _execution_config from self]
D --> E{_execution_config found?}
E -- yes --> F[Use config retry params]
E -- no --> G[Use hardcoded defaults]
F --> H[loop: attempt in range 1..max_retry_limit+2]
G --> H
H --> I{tool_timeout set?}
I -- yes --> J[Submit to ThreadPoolExecutor with timeout]
I -- no --> K[_execute_tool_with_circuit_breaker]
J --> L{TimeoutError?}
L -- yes --> M[result = error dict timeout=True]
L -- no --> K
K --> N{result is error dict?}
M --> N
N -- circuit_open or timeout --> O[raise ToolExecutionError is_retryable=True]
N -- other error dict or success --> P[break - pass to LLM]
O --> Q{retryable AND attempts left?}
Q -- yes --> R[BackoffPolicy.delay / time.sleep]
R --> H
Q -- no --> S[raise ToolExecutionError]
P --> T[wrap_if_external / truncate output]
T --> U[Return result to LLM]
Reviews (3): Last reviewed commit: "fix: correct critical retry logic bugs i..." | Re-trigger Greptile |
| def execute_with_context(): | ||
| with with_injection_context(state): | ||
| return self._execute_tool_with_circuit_breaker(function_name, arguments) |
There was a problem hiding this comment.
Late-binding closure in retry loop captures
function_name/arguments at call time
execute_with_context is re-defined on every iteration without explicitly capturing the current loop values. Because neither variable is reassigned inside the loop this works correctly today, but it is a fragile pattern — any future refactor that reassigns those names between iterations would silently pass wrong values to the submitted thread. Capturing them explicitly (e.g. def execute_with_context(fn=function_name, args=arguments): ...) makes the intent clear and resilient to future changes.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings. Review areas:
|
|
@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding. Phase 1: Review per AGENTS.md
Phase 2: FIX Valid Issues Phase 3: Final Verdict |
|
Claude finished @MervinPraison's task —— View job ✅ Final Architecture Review & Fixes CompletePhase 1: Architecture Review (per AGENTS.md)
Findings:
Phase 2: Fix Valid IssuesCritical bugs fixed in commit abc1734:
Phase 3: Final Verdict
🎯 Final Recommendation: APPROVEDThis PR successfully implements exponential backoff retry logic for tool execution failures and guardrail validation retries. All critical issues identified by reviewers have been fixed: Files Modified:
What I didn't change (and why):
The PR is now ready to merge. The retry logic will prevent transient tool failures from breaking agent execution, improving overall reliability. |
| # Apply tool retry logic with exponential backoff | ||
| execution_config = getattr(self, '_execution_config', None) | ||
| if execution_config is None: | ||
| # Fall back to reading individual config attributes for backward compatibility | ||
| max_retry_limit = getattr(self, 'max_retry_limit', 2) | ||
| retry_initial_delay = 1.0 | ||
| retry_backoff_factor = 2.0 | ||
| retry_jitter = 0.1 | ||
| else: | ||
| with with_injection_context(state): | ||
| result = self._execute_tool_with_circuit_breaker(function_name, arguments) | ||
| max_retry_limit = execution_config.max_retry_limit | ||
| retry_initial_delay = execution_config.retry_initial_delay | ||
| retry_backoff_factor = execution_config.retry_backoff_factor | ||
| retry_jitter = execution_config.retry_jitter | ||
|
|
||
| result = None | ||
| last_exception = None | ||
|
|
||
| # max_retry_limit is the number of retries (not total attempts) | ||
| # So total attempts = 1 (initial) + max_retry_limit (retries) | ||
| for attempt in range(1, max_retry_limit + 2): | ||
| try: | ||
| # P8/G11: Apply tool timeout if configured | ||
| tool_timeout = getattr(self, '_tool_timeout', None) | ||
| if tool_timeout and tool_timeout > 0: | ||
| # Use copy_context to preserve injection context in executor thread | ||
| ctx = contextvars.copy_context() | ||
|
|
||
| def execute_with_context(): | ||
| with with_injection_context(state): | ||
| return self._execute_tool_with_circuit_breaker(function_name, arguments) | ||
|
|
||
| # Use reusable executor to prevent resource leaks | ||
| if not hasattr(self, '_tool_executor'): | ||
| self._tool_executor = concurrent.futures.ThreadPoolExecutor( | ||
| max_workers=2, thread_name_prefix=f"tool-{self.name}" | ||
| ) | ||
|
|
||
| future = self._tool_executor.submit(ctx.run, execute_with_context) | ||
| try: | ||
| result = future.result(timeout=tool_timeout) | ||
| except concurrent.futures.TimeoutError: | ||
| future.cancel() | ||
| logging.warning(f"Tool {function_name} timed out after {tool_timeout}s") | ||
| result = {"error": f"Tool timed out after {tool_timeout}s", "timeout": True} | ||
| else: | ||
| with with_injection_context(state): | ||
| result = self._execute_tool_with_circuit_breaker(function_name, arguments) | ||
|
|
||
| # Check if the result indicates a retryable error | ||
| if isinstance(result, dict) and result.get("error"): | ||
| # Check if this is a circuit breaker error (always retryable) | ||
| if result.get("circuit_open"): | ||
| raise ToolExecutionError( | ||
| result["error"], | ||
| tool_name=function_name, | ||
| agent_id=self.name, | ||
| is_retryable=True, | ||
| ) | ||
| # Check if this is a timeout error (retryable) | ||
| elif result.get("timeout"): | ||
| raise ToolExecutionError( | ||
| result["error"], | ||
| tool_name=function_name, | ||
| agent_id=self.name, | ||
| is_retryable=True, | ||
| ) | ||
| # For other error dicts, treat as non-retryable unless specified | ||
| else: | ||
| # Success path - return the result | ||
| break | ||
| else: | ||
| # Success path - return the result | ||
| break | ||
|
|
||
| except ToolExecutionError as e: | ||
| last_exception = e | ||
| # Only retry if the error is marked as retryable and we have retries left | ||
| # attempt starts at 1, so (attempt - 1) gives us the retry count | ||
| if not e.is_retryable or (attempt - 1) >= max_retry_limit: | ||
| raise e | ||
|
|
||
| # Calculate delay for exponential backoff | ||
| delay = BackoffPolicy.delay(attempt, retry_initial_delay, retry_backoff_factor, retry_jitter) | ||
| logging.warning( | ||
| f"Tool {function_name} failed (attempt {attempt}/{max_retry_limit + 1}): {e}. " | ||
| f"Retrying in {delay:.2f}s..." | ||
| ) | ||
| time.sleep(delay) | ||
|
|
||
| except Exception as e: | ||
| # Wrap unexpected exceptions in ToolExecutionError | ||
| # Most tool errors are considered retryable unless they're programming errors | ||
| is_retryable = not isinstance(e, (ValueError, TypeError, AttributeError)) | ||
| tool_error = ToolExecutionError( | ||
| f"Tool '{function_name}' failed: {e}", | ||
| tool_name=function_name, | ||
| agent_id=self.name, | ||
| is_retryable=is_retryable, | ||
| ) | ||
| last_exception = tool_error | ||
|
|
||
| # attempt starts at 1, so (attempt - 1) gives us the retry count | ||
| if not is_retryable or (attempt - 1) >= max_retry_limit: | ||
| raise tool_error from e | ||
|
|
||
| # Calculate delay for exponential backoff | ||
| delay = BackoffPolicy.delay(attempt, retry_initial_delay, retry_backoff_factor, retry_jitter) | ||
| logging.warning( | ||
| f"Tool {function_name} failed (attempt {attempt}/{max_retry_limit + 1}): {e}. " | ||
| f"Retrying in {delay:.2f}s..." | ||
| ) | ||
| time.sleep(delay) |
There was a problem hiding this comment.
Retry logic bypassed on the async execution path
The new retry loop lives entirely inside _execute_tool_with_context, which is only reached through the synchronous execute_tool → _execute_tool_with_context call chain. The async path used by agents running astart() binds execute_tool_fn to self.execute_tool_async (chat_mixin.py lines 1021 and 2173), which is a completely separate implementation in execution_mixin.py that calls the tool function directly — there is no retry loop there.
As a result, any agent that goes through the async LLM completion path still silently ignores max_retry_limit for tool failures, which is the same bug this PR set out to fix. The sync path is now protected; the async path is not.
| def __post_init__(self) -> None: | ||
| """Validate retry configuration parameters.""" | ||
| if self.retry_initial_delay <= 0: | ||
| raise ValueError("ExecutionConfig.retry_initial_delay must be positive.") | ||
| if self.retry_backoff_factor < 1.0: | ||
| raise ValueError("ExecutionConfig.retry_backoff_factor must be >= 1.0.") | ||
| if self.retry_jitter < 0: | ||
| raise ValueError("ExecutionConfig.retry_jitter must be non-negative.") |
There was a problem hiding this comment.
Missing
max_retry_limit lower-bound guard in __post_init__
The new validation checks the three delay fields but not max_retry_limit itself. A caller passing max_retry_limit=-1 would produce range(1, 1) — an empty loop — so result stays None and is silently returned to the LLM with no error and no retry. Adding if self.max_retry_limit < 0: raise ValueError(...) makes the contract explicit.
| def __post_init__(self) -> None: | |
| """Validate retry configuration parameters.""" | |
| if self.retry_initial_delay <= 0: | |
| raise ValueError("ExecutionConfig.retry_initial_delay must be positive.") | |
| if self.retry_backoff_factor < 1.0: | |
| raise ValueError("ExecutionConfig.retry_backoff_factor must be >= 1.0.") | |
| if self.retry_jitter < 0: | |
| raise ValueError("ExecutionConfig.retry_jitter must be non-negative.") | |
| def __post_init__(self) -> None: | |
| """Validate retry configuration parameters.""" | |
| if self.retry_initial_delay <= 0: | |
| raise ValueError("ExecutionConfig.retry_initial_delay must be positive.") | |
| if self.retry_backoff_factor < 1.0: | |
| raise ValueError("ExecutionConfig.retry_backoff_factor must be >= 1.0.") | |
| if self.retry_jitter < 0: | |
| raise ValueError("ExecutionConfig.retry_jitter must be non-negative.") | |
| if self.max_retry_limit < 0: | |
| raise ValueError("ExecutionConfig.max_retry_limit must be non-negative.") |
|
@claude Merge conflict vs Older branch ( Conflict files
Representative hunksOverlaps #2080 on Resolution hint: after rebase, honour consolidated |
|
Claude finished @MervinPraison's task —— View job ✅ Successfully resolved merge conflicts with mainI've successfully rebased this PR onto main and resolved all conflicts while preserving the retry/backoff logic implementation. 📋 Completed Tasks:
🔧 Resolution Details:Files resolved:
✅ Tests passed:
The PR is now cleanly rebased on main and ready for review. |
…failures (fixes #1809) - Add retry configuration fields to ExecutionConfig (retry_initial_delay, retry_backoff_factor, retry_jitter) - Implement BackoffPolicy class for exponential backoff with jitter - Add retry loop to tool execution that respects max_retry_limit from ExecutionConfig - Honor ToolExecutionError.is_retryable for retry decisions - Add backoff delays to guardrail retries (sync and async versions) - Tool failures now retry with exponential backoff instead of immediate termination - Guardrail retries no longer hammer LLM API with immediate re-calls Co-authored-by: MervinPraison <MervinPraison@users.noreply.github.com>
- Wire ExecutionConfig into guardrail retries by storing _execution_config on Agent instance - Add max_delay cap (60s) to BackoffPolicy to prevent unbounded exponential growth - Add validation for retry parameters in ExecutionConfig.__post_init__ - Use shared BackoffPolicy in guardrail retry logic to eliminate code duplication - Fixes issues identified by Greptile, CodeRabbit, and Copilot reviewers Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
- Fix attribute name mismatch: '_execution_config' not 'execution_config' This was causing all user-configured retry settings to be silently ignored - Fix off-by-one retry count: max_retry_limit now correctly means number of retries With max_retry_limit=1, we now get 1 retry (2 total attempts) instead of 0 retries - Fix error wrapping: exceptions from hooks are now properly wrapped in ToolExecutionError This ensures uniform error handling and observability Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
abc1734 to
d7dabae
Compare
| def __post_init__(self) -> None: | ||
| """Post-initialization processing with deprecation warnings.""" | ||
| """Post-initialization processing with deprecation warnings and validation.""" | ||
| # Handle context_compaction serialization round-trip | ||
| if isinstance(self.context_compaction, dict): |
There was a problem hiding this comment.
Retry validation is silently bypassed on every normal instantiation.
__post_init__ returns early (lines 784, 786, 801) whenever context_compaction is False (the default) and the caller is internal library code — which is the case for every ExecutionConfig() created during agent startup. The three raise ValueError guards at lines 815–820 are therefore never reached in practice, so callers can pass retry_initial_delay=-1.0 or retry_backoff_factor=0.5 without error. Moving the validation before the context_compaction guard block ensures it always runs.
| def __post_init__(self) -> None: | |
| """Post-initialization processing with deprecation warnings.""" | |
| """Post-initialization processing with deprecation warnings and validation.""" | |
| # Handle context_compaction serialization round-trip | |
| if isinstance(self.context_compaction, dict): | |
| def __post_init__(self) -> None: | |
| """Post-initialization processing with deprecation warnings and validation.""" | |
| # Validate retry configuration parameters (must run before any early returns) | |
| if self.retry_initial_delay <= 0: | |
| raise ValueError("ExecutionConfig.retry_initial_delay must be positive.") | |
| if self.retry_backoff_factor < 1.0: | |
| raise ValueError("ExecutionConfig.retry_backoff_factor must be >= 1.0.") | |
| if self.retry_jitter < 0: | |
| raise ValueError("ExecutionConfig.retry_jitter must be non-negative.") | |
| # Handle context_compaction serialization round-trip | |
| if isinstance(self.context_compaction, dict): |
Fixes #1809
Auto-opened from Claude triage branch
claude/issue-1809-20260603-0711.Made with Cursor