Optimize Context Window with Tool-Call Pruning Strategy for Cross-Turn Context

### Description

Add a tool_pruning truncation strategy that strips tool_use/tool_result messages from prior turns.

When the harness passes conversation history from prior turns to the model for the current turn, it currently includes the full message sequence: user prompt, assistant tool_use messages, tool_result messages, and the final assistant response — for every prior turn. In agentic workloads where the agent invokes multiple tools per turn, this causes significant context bloat with information that is rarely useful for subsequent reasoning.

Please see attached md file for details. [token_details_222-session-padded-to-meet-min-length-REDACTED.md](https://github.com/user-attachments/files/27689542/token_details_222-session-padded-to-meet-min-length-REDACTED.md)

We propose a new truncation/pruning strategy (alongside the existing `sliding_window` and `summarization`) that retains only the **user prompt** and **final assistant response** from each prior turn, while stripping all intermediate `tool_use` and `tool_result` messages. Within the **current turn**, full tool call context is preserved as normal so the agent can reason over its in-progress work.

**What is sent to the model under this strategy:**
- System prompt
- For each prior turn: user message + final assistant response only (no tool_use/tool_result)
- RAG/long-term memory chunks (if applicable)
- Current turn: full context including all tool_use and tool_result messages

**Where things get worse:**
- **RAG context compounds**: If RAG injects 3,000 tokens at the start of a turn, every subsequent model call within that turn re-sends those 3,000 tokens alongside the growing tool_use/tool_result history. Then the entire compounded context carries forward to the next turn.
- **High tool-count turns**: The measured session uses 3–5 tools per turn. With 10+ tools (common in production agents), context growth within a single turn becomes severe.
- **Large tool outputs**: A single URL fetch or file read returning 2,000 tokens inflates every subsequent model call in that turn — and every model call in every future turn — by that amount.

**What should be excluded from prior turns:**
- Prior turns' `tool_use` messages (assistant requesting tool calls)
- Prior turns' `tool_result` messages (tool execution outputs)


### Acceptance Criteria

1. A new truncation strategy value (e.g., `tool_pruning`) is accepted by `--truncation-strategy` in the CLI and `HarnessTruncationConfiguration.strategy` in the API.
2. When enabled, model calls for turn N include only user+assistant(final) pairs from turns 1..N-1, not intermediate tool_use/tool_result messages from those turns.
3. The current turn (turn N) retains full tool_use and tool_result context for in-progress reasoning cycles.
4. The strategy can be combined with `sliding_window` or `summarization` (e.g., prune tool calls first, then apply sliding window on the pruned history).
5. Token usage for turn N's first model call grows proportionally to the sum of (prompt + response) sizes of prior turns, not the sum of full tool-call histories.
6. No change in agent behavior for single-turn invocations.

### Additional Context

**Measured data from a 3-question session (session ID: `222-session-padded-to-meet-min-length`):**

| Question | Cycle 1 Input Tokens | Final Cycle Input Tokens | Tools Invoked |
|----------|---------------------|-------------------------|---------------|
| Q1 | 964 | 1,349 | 3 |
| Q2 | 1,537 | 2,370 | 4 |
| Q3 | 2,670 | 3,500 | 5 |

- Q1's full tool history adds **573 tokens** to every model call in Q2 (4 cycles × 573 = 2,292 extra input tokens).
- Q1+Q2's full tool history adds **1,706 tokens** to every model call in Q3 (3 cycles × 1,706 = 5,118 extra input tokens).
- With tool pruning, prior-turn context would be ~161 tokens (Q1 prompt+response) and ~268 tokens (Q2 prompt+response) instead of 573 and 1,133 respectively.
- **Projected savings at turn 10**: ~67% reduction in input tokens per model call from prior-turn context alone.
- Total session input tokens: **21,837** across 11 model calls for just 3 questions. This grows super-linearly with turn count.

**Trace data source**: CloudWatch Logs EMF entries (`strands.event_loop.input.tokens` / `strands.event_loop.output.tokens`) under namespace `bedrock-agentcore`.

**Why this matters**: In agentic workloads, tool results (shell output, file contents, API responses) are often large and ephemeral — they served their purpose in the turn they were generated. Carrying them forward adds noise, increases latency (more tokens to process), and increases cost without improving reasoning quality for subsequent turns.

**Reduces pressure on other strategies**: By pruning tool-call noise from every turn upfront, `tool_pruning` keeps the baseline context lean. This means `sliding_window` and `summarization` kick in later (or not at all for shorter sessions), since the context they operate on is already optimized. The strategies become complementary layers rather than competing last-resort mechanisms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Context Window with Tool-Call Pruning Strategy for Cross-Turn Context #1219

Description

Acceptance Criteria

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize Context Window with Tool-Call Pruning Strategy for Cross-Turn Context #1219

Description

Description

Acceptance Criteria

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions