Description
Add a tool_pruning truncation strategy that strips tool_use/tool_result messages from prior turns.
When the harness passes conversation history from prior turns to the model for the current turn, it currently includes the full message sequence: user prompt, assistant tool_use messages, tool_result messages, and the final assistant response — for every prior turn. In agentic workloads where the agent invokes multiple tools per turn, this causes significant context bloat with information that is rarely useful for subsequent reasoning.
Please see attached md file for details. token_details_222-session-padded-to-meet-min-length-REDACTED.md
We propose a new truncation/pruning strategy (alongside the existing sliding_window and summarization) that retains only the user prompt and final assistant response from each prior turn, while stripping all intermediate tool_use and tool_result messages. Within the current turn, full tool call context is preserved as normal so the agent can reason over its in-progress work.
What is sent to the model under this strategy:
- System prompt
- For each prior turn: user message + final assistant response only (no tool_use/tool_result)
- RAG/long-term memory chunks (if applicable)
- Current turn: full context including all tool_use and tool_result messages
Where things get worse:
- RAG context compounds: If RAG injects 3,000 tokens at the start of a turn, every subsequent model call within that turn re-sends those 3,000 tokens alongside the growing tool_use/tool_result history. Then the entire compounded context carries forward to the next turn.
- High tool-count turns: The measured session uses 3–5 tools per turn. With 10+ tools (common in production agents), context growth within a single turn becomes severe.
- Large tool outputs: A single URL fetch or file read returning 2,000 tokens inflates every subsequent model call in that turn — and every model call in every future turn — by that amount.
What should be excluded from prior turns:
- Prior turns'
tool_use messages (assistant requesting tool calls)
- Prior turns'
tool_result messages (tool execution outputs)
Acceptance Criteria
- A new truncation strategy value (e.g.,
tool_pruning) is accepted by --truncation-strategy in the CLI and HarnessTruncationConfiguration.strategy in the API.
- When enabled, model calls for turn N include only user+assistant(final) pairs from turns 1..N-1, not intermediate tool_use/tool_result messages from those turns.
- The current turn (turn N) retains full tool_use and tool_result context for in-progress reasoning cycles.
- The strategy can be combined with
sliding_window or summarization (e.g., prune tool calls first, then apply sliding window on the pruned history).
- Token usage for turn N's first model call grows proportionally to the sum of (prompt + response) sizes of prior turns, not the sum of full tool-call histories.
- No change in agent behavior for single-turn invocations.
Additional Context
Measured data from a 3-question session (session ID: 222-session-padded-to-meet-min-length):
| Question |
Cycle 1 Input Tokens |
Final Cycle Input Tokens |
Tools Invoked |
| Q1 |
964 |
1,349 |
3 |
| Q2 |
1,537 |
2,370 |
4 |
| Q3 |
2,670 |
3,500 |
5 |
- Q1's full tool history adds 573 tokens to every model call in Q2 (4 cycles × 573 = 2,292 extra input tokens).
- Q1+Q2's full tool history adds 1,706 tokens to every model call in Q3 (3 cycles × 1,706 = 5,118 extra input tokens).
- With tool pruning, prior-turn context would be ~161 tokens (Q1 prompt+response) and ~268 tokens (Q2 prompt+response) instead of 573 and 1,133 respectively.
- Projected savings at turn 10: ~67% reduction in input tokens per model call from prior-turn context alone.
- Total session input tokens: 21,837 across 11 model calls for just 3 questions. This grows super-linearly with turn count.
Trace data source: CloudWatch Logs EMF entries (strands.event_loop.input.tokens / strands.event_loop.output.tokens) under namespace bedrock-agentcore.
Why this matters: In agentic workloads, tool results (shell output, file contents, API responses) are often large and ephemeral — they served their purpose in the turn they were generated. Carrying them forward adds noise, increases latency (more tokens to process), and increases cost without improving reasoning quality for subsequent turns.
Reduces pressure on other strategies: By pruning tool-call noise from every turn upfront, tool_pruning keeps the baseline context lean. This means sliding_window and summarization kick in later (or not at all for shorter sessions), since the context they operate on is already optimized. The strategies become complementary layers rather than competing last-resort mechanisms.
Description
Add a tool_pruning truncation strategy that strips tool_use/tool_result messages from prior turns.
When the harness passes conversation history from prior turns to the model for the current turn, it currently includes the full message sequence: user prompt, assistant tool_use messages, tool_result messages, and the final assistant response — for every prior turn. In agentic workloads where the agent invokes multiple tools per turn, this causes significant context bloat with information that is rarely useful for subsequent reasoning.
Please see attached md file for details. token_details_222-session-padded-to-meet-min-length-REDACTED.md
We propose a new truncation/pruning strategy (alongside the existing
sliding_windowandsummarization) that retains only the user prompt and final assistant response from each prior turn, while stripping all intermediatetool_useandtool_resultmessages. Within the current turn, full tool call context is preserved as normal so the agent can reason over its in-progress work.What is sent to the model under this strategy:
Where things get worse:
What should be excluded from prior turns:
tool_usemessages (assistant requesting tool calls)tool_resultmessages (tool execution outputs)Acceptance Criteria
tool_pruning) is accepted by--truncation-strategyin the CLI andHarnessTruncationConfiguration.strategyin the API.sliding_windoworsummarization(e.g., prune tool calls first, then apply sliding window on the pruned history).Additional Context
Measured data from a 3-question session (session ID:
222-session-padded-to-meet-min-length):Trace data source: CloudWatch Logs EMF entries (
strands.event_loop.input.tokens/strands.event_loop.output.tokens) under namespacebedrock-agentcore.Why this matters: In agentic workloads, tool results (shell output, file contents, API responses) are often large and ephemeral — they served their purpose in the turn they were generated. Carrying them forward adds noise, increases latency (more tokens to process), and increases cost without improving reasoning quality for subsequent turns.
Reduces pressure on other strategies: By pruning tool-call noise from every turn upfront,
tool_pruningkeeps the baseline context lean. This meanssliding_windowandsummarizationkick in later (or not at all for shorter sessions), since the context they operate on is already optimized. The strategies become complementary layers rather than competing last-resort mechanisms.