Skip to content

Commit b692ec7

Browse files
committed
fix(core): size-based context pruning + artifact-wrapper override
Drop window-based compaction in favor of per-block size caps (8 KB for assistant text, toolCall.input, and toolResult content; 2 KB aggressive fallback). The window approach missed the dominant failure mode: the latest assistant message carrying a 9 MB artifact dump inside the "keep verbatim" window, which produced 3.97M-token requests and 400s. New philosophy: history is intent tracking, not payload storage. Large outputs get stubbed regardless of position — current state is always recoverable via view(). Also resolves the conflicting instructions that produced the 9 MB payload in the first place: base output-rules §"Artifact wrapper" and workflow step 7 both require the model to emit the full design as assistant text inside <artifact>, while the agent guidance says do NOT emit it. Adds an explicit OVERRIDE block at the top of AGENTIC_TOOL_ GUIDANCE so the tool-mode branch wins unambiguously, and tightens the final-turn note to spell out that pasting the file crashes the next request. - packages/core/src/context-prune.ts: rewrite around per-block caps - packages/core/src/context-prune.test.ts: 7 tests for new contract - packages/core/src/agent.ts: OVERRIDE block + final-turn warning
1 parent 1cdf006 commit b692ec7

3 files changed

Lines changed: 211 additions & 220 deletions

File tree

packages/core/src/agent.ts

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,17 @@ async function collectSkills(
307307
// ---------------------------------------------------------------------------
308308

309309
const AGENTIC_TOOL_GUIDANCE = [
310+
'## OVERRIDE: artifact-wrapper rules do not apply in this mode',
311+
'',
312+
'The base system prompt (output-rules §"Artifact wrapper", workflow step 7 ',
313+
'"Deliver — Output the artifact tag") instructs you to emit the design ',
314+
'inside an `<artifact>...</artifact>` tag as assistant text. **Those rules ',
315+
'are superseded by this section.** You have a `str_replace_based_edit_tool`; ',
316+
'the file is written via that tool and extracted from the virtual filesystem ',
317+
'by the host. Emitting the file contents as assistant text (either wrapped in ',
318+
'`<artifact>`, a ```jsx fence, or raw) duplicates the design, doubles token ',
319+
'cost, and blows past the LLM context limit on the next turn. Never do it.',
320+
'',
310321
'## Output format (STRICT — no exceptions)',
311322
'',
312323
'Your artifact lives in `index.html` and follows this template — write it via',
@@ -355,13 +366,17 @@ const AGENTIC_TOOL_GUIDANCE = [
355366
' 4. One short prose line reflecting on what landed ("Three KPIs in place — the deltas use mono tnum so they line up.").',
356367
' 5. Tick the matching todo via `set_todos`.',
357368
' That is **2 prose lines + 3 tool calls per turn**. Never batch multiple sections into a single str_replace; never run two str_replace tools in the same turn without a prose line in between.',
358-
'4. **Polish passes — interactive depth.** At least ONE dedicated turn that ADDS interactive depth, not just cosmetic tweaks. Before `done`, make sure:',
359-
' (a) ≥2 state changes actually wire up (tab switch in a section, accordion open, favorite toggle, dropdown on avatar, drawer "See details", inline-edit click)',
360-
' (b) ≥1 list / grid / table has a believable empty state component defined (icon + reason + CTA), even if the current data is non-empty',
361-
' (c) all buttons / cards have hover + press feedback (`transition: transform 120ms var(--ease-out), background-color 120ms`; press = `scale(0.96)`; cards lift 2px on hover)',
362-
' (d) data is real-sounding, not Lorem: varied names, realistic numbers, relative dates ("3h ago")',
363-
' Add an explicit `Interactive polish` todo item in `set_todos` so the user sees it ticked.',
364-
'5. **Final turn — summary.** 2–4 sentences of natural-language prose explaining 2–3 design decisions worth noting (e.g. "Used three distinct surface tones for depth"). Do NOT re-emit the file content; the host extracts it from the virtual fs.',
369+
'4. **Polish passes — interactive depth (MANDATORY, ≥2 dedicated turns).** The first polish turn wires interactions; the second adds small-detail craft. These are NOT optional — if the user sees static pixels where they expected live UI, the artifact fails. Before `done` every item on this list must be TRUE:',
370+
' (a) **≥3 functional state changes** that a user can trigger and observe. Tab switch revealing a different view, accordion open/close, drawer slide-in, favorite/like toggle that persists, dropdown/menu expand, inline-edit, filter chip toggle, modal open. Pure hover effects do NOT count toward this three.',
371+
' (b) **≥1 animated view/page transition** if there is any nav (tabs, sidebar, bottom bar, breadcrumbs). 180–260ms, opacity + small translate. A hard cut between views is a failure.',
372+
' (c) **Every `<button>` and `<a>` does something.** No decorative buttons. Wire a state change, open a modal, fire a toast, or remove it. Login / Sign-up / CTA buttons on marketing pages may open a modal stub — still real, not dead.',
373+
' (d) **Uniform hover + press + focus** across ALL clickable elements. Required cadence: `transition: transform 120ms var(--ease-out), background-color 120ms, box-shadow 160ms;` hover lifts 2px; press = `scale(0.96)`; focus = 2px offset ring in accent color (never rely on browser default outline).',
374+
' (e) **≥3 small-detail "craft-surplus" touches** from the craft-directives catalog. Pick from: stateful counter/badge with pop animation, keyboard shortcut chip (`⌘K`, `/`, `esc`), inline-editable field, copy-to-clipboard with "Copied ✓" feedback, dismissible toast/banner, contextual tooltip with directional arrow, scroll-linked header shrink, relative-time tick ("3m ago"), segmented control with weighted active state, thoughtful empty-state SVG scene, expandable accordion inside a card, a deliberate visual rhythm-break section. Adding a gradient and shadow does NOT count.',
375+
' (f) **≥1 empty-state variant** visible or coded (icon + one-sentence reason + CTA) on a list/grid/table, even when current data is non-empty.',
376+
' (g) **Active nav indicator uses weight/shape, not color alone** — underline, inset background, side-accent bar, or pill — so color-blind users can tell where they are.',
377+
' (h) Data reads real: varied names, non-round numbers (87 %, $14.2k), relative dates ("3h ago", "yesterday"), not Lorem / 100 % / Jan 1 2020.',
378+
' Break this into TWO todo items: `Interactive wiring (state + transitions)` and `Craft surplus (small details)`. Tick them explicitly so the user can see both phases landed.',
379+
'5. **Final turn — summary.** 2–4 sentences of natural-language prose explaining 2–3 design decisions worth noting (e.g. "Used three distinct surface tones for depth"). Do NOT re-emit the file content; the host extracts it from the virtual fs. Pasting the full file here wastes ~2M tokens on the next turn and will crash the request — this is a hard failure, not a style nit.',
365380
'',
366381
'### File output policy (STRICT)',
367382
"- Use `str_replace_based_edit_tool` for ALL file content. Do NOT emit `<artifact>` tags or fenced ```jsx/```html blocks containing the source in your prose — the host extracts the artifact from the virtual fs and any inline source spams the user's chat.",
@@ -714,7 +729,7 @@ export async function generateViaAgent(
714729
// Without this, assistant.toolCall.input + big view results grow O(N²)
715730
// in LLM-facing size across a long tool-using run and blow past 1 M
716731
// tokens. See context-prune.ts for the full strategy.
717-
transformContext: buildTransformContext(),
732+
transformContext: buildTransformContext(log),
718733
getApiKey: () => input.apiKey || 'open-codesign-keyless',
719734
});
720735

packages/core/src/context-prune.test.ts

Lines changed: 68 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,12 @@ function userMsg(text: string): AgentMessage {
99
} as unknown as AgentMessage;
1010
}
1111

12-
function assistantWithToolCall(toolCallId: string, big: string): AgentMessage {
12+
function assistantWithToolCall(toolCallId: string, inputArg: string): AgentMessage {
1313
return {
1414
role: 'assistant',
1515
content: [
1616
{ type: 'text', text: 'ok' },
17-
{ type: 'toolCall', id: toolCallId, name: 'str_replace_based_edit_tool', input: { big } },
17+
{ type: 'toolCall', id: toolCallId, name: 'str_replace_based_edit_tool', input: { inputArg } },
1818
],
1919
} as unknown as AgentMessage;
2020
}
@@ -34,8 +34,8 @@ function assistantText(text: string): AgentMessage {
3434
} as unknown as AgentMessage;
3535
}
3636

37-
describe('buildTransformContext — sliding-window compaction', () => {
38-
it('is a no-op when well under the cap', async () => {
37+
describe('buildTransformContext — size-based block compaction', () => {
38+
it('is a no-op when every block is under its cap', async () => {
3939
const transform = buildTransformContext();
4040
const messages: AgentMessage[] = [
4141
userMsg('hi'),
@@ -47,117 +47,98 @@ describe('buildTransformContext — sliding-window compaction', () => {
4747
expect(out).toEqual(messages);
4848
});
4949

50-
it('leaves the last N tool-use rounds verbatim, stubs older toolResult content', async () => {
50+
it('stubs a large assistant text block even on the LATEST message', async () => {
51+
// The production bug: model streamed a 9MB artifact as assistant text
52+
// on the final turn. v1 window-based prune preserved it verbatim.
5153
const transform = buildTransformContext();
52-
const messages: AgentMessage[] = [userMsg('build this')];
53-
const bulk = 'x'.repeat(2_000);
54-
// 10 rounds — default window is 6, so first ~4 should be compacted.
55-
for (let i = 0; i < 10; i += 1) {
56-
messages.push(assistantWithToolCall(`t${i}`, `args-${i}`));
57-
messages.push(toolResult(`t${i}`, `result body ${i} ${bulk}`));
58-
}
54+
const huge = 'x'.repeat(50_000);
55+
const messages: AgentMessage[] = [
56+
userMsg('build it'),
57+
assistantText(huge),
58+
];
5959
const out = await transform(messages);
60-
const resultRows = out.filter((m) => m.role === 'toolResult');
61-
expect(resultRows).toHaveLength(10);
62-
const early = resultRows.slice(0, 3);
63-
const recent = resultRows.slice(-6);
64-
for (const row of early) {
65-
const first = (row as { content: Array<{ text?: string }> }).content[0]?.text ?? '';
66-
expect(first.startsWith('[dropped')).toBe(true);
67-
}
68-
for (const row of recent) {
69-
const first = (row as { content: Array<{ text?: string }> }).content[0]?.text ?? '';
70-
expect(first.startsWith('result body')).toBe(true);
71-
}
60+
const last = out[out.length - 1] as { content: Array<{ text?: string }> };
61+
const text = last.content[0]?.text ?? '';
62+
expect(text.startsWith('[prior assistant output dropped')).toBe(true);
63+
expect(text).toContain('50000B');
7264
});
7365

74-
it('compacts assistant.toolCall.input on old rounds but preserves name + id', async () => {
66+
it('summarizes a large toolCall.input, preserving name + id', async () => {
7567
const transform = buildTransformContext();
76-
const messages: AgentMessage[] = [userMsg('build')];
77-
const bulk = 'a'.repeat(4_000);
78-
// 10 rounds with big toolCall args — older ones should have args summarized.
79-
for (let i = 0; i < 10; i += 1) {
80-
messages.push(assistantWithToolCall(`call-${i}`, bulk));
81-
messages.push(toolResult(`call-${i}`, 'ok'));
82-
}
68+
const bulk = 'a'.repeat(20_000);
69+
const messages: AgentMessage[] = [
70+
userMsg('build'),
71+
assistantWithToolCall('call-0', bulk),
72+
toolResult('call-0', 'ok'),
73+
];
8374
const out = await transform(messages);
84-
// Oldest assistant message's toolCall block should have summarized input.
85-
const oldest = out.find(
86-
(m) =>
87-
m.role === 'assistant' &&
88-
Array.isArray((m as { content?: unknown }).content) &&
89-
(m as { content: Array<{ type?: string; id?: string }> }).content.some(
90-
(c) => c?.id === 'call-0',
91-
),
92-
) as { content: Array<{ id?: string; name?: string; input?: unknown }> } | undefined;
93-
expect(oldest).toBeDefined();
94-
const tc = oldest?.content.find((c) => c.id === 'call-0');
75+
const a = out[1] as { content: Array<{ type?: string; id?: string; name?: string; input?: unknown }> };
76+
const tc = a.content.find((c) => c.type === 'toolCall');
77+
expect(tc?.id).toBe('call-0');
9578
expect(tc?.name).toBe('str_replace_based_edit_tool');
96-
const input = tc?.input as { _summarized?: boolean; _origBytes?: number } | undefined;
79+
const input = tc?.input as { _summarized?: boolean; _origBytes?: number };
9780
expect(input?._summarized).toBe(true);
98-
expect(input?._origBytes).toBeGreaterThan(1_000);
81+
expect(input?._origBytes).toBeGreaterThan(10_000);
9982
});
10083

101-
it('keeps the toolCallId on stubbed toolResult rows (pi-ai shape requirement)', async () => {
84+
it('stubs a large toolResult body, keeping toolCallId for pi-ai shape', async () => {
10285
const transform = buildTransformContext();
103-
const messages: AgentMessage[] = [userMsg('x')];
104-
const bulk = 'y'.repeat(3_000);
105-
for (let i = 0; i < 10; i += 1) {
106-
messages.push(assistantWithToolCall(`call-${i}`, 'a'));
107-
messages.push(toolResult(`call-${i}`, `body ${bulk}`));
108-
}
86+
const bulk = 'y'.repeat(20_000);
87+
const messages: AgentMessage[] = [
88+
userMsg('x'),
89+
assistantWithToolCall('call-0', 'a'),
90+
toolResult('call-0', bulk),
91+
];
10992
const out = await transform(messages);
110-
const first = out.find(
111-
(m) => m.role === 'toolResult' && (m as { toolCallId?: string }).toolCallId === 'call-0',
112-
) as { toolCallId?: string; content: Array<{ text?: string }> } | undefined;
113-
expect(first).toBeDefined();
114-
expect(first?.toolCallId).toBe('call-0');
115-
expect(first?.content[0]?.text?.startsWith('[dropped')).toBe(true);
93+
const tr = out[2] as { toolCallId?: string; content: Array<{ text?: string }> };
94+
expect(tr.toolCallId).toBe('call-0');
95+
expect(tr.content[0]?.text?.startsWith('[tool result dropped')).toBe(true);
11696
});
11797

118-
it('preserves user messages and assistant-text messages unchanged', async () => {
98+
it('leaves small blocks untouched regardless of position', async () => {
11999
const transform = buildTransformContext();
120-
const bulk = 'z'.repeat(3_000);
121-
const openingUser = userMsg('initial brief, do not mangle');
122-
const openingNote = assistantText('I will start now.');
123-
const messages: AgentMessage[] = [openingUser, openingNote];
124-
for (let i = 0; i < 10; i += 1) {
125-
messages.push(assistantWithToolCall(`c${i}`, 'op'));
126-
messages.push(toolResult(`c${i}`, `r ${bulk}`));
100+
const messages: AgentMessage[] = [userMsg('go')];
101+
for (let i = 0; i < 20; i += 1) {
102+
messages.push(assistantWithToolCall(`t${i}`, 'tiny'));
103+
messages.push(toolResult(`t${i}`, `tiny result ${i}`));
127104
}
128-
messages.push(assistantText('final summary line'));
129105
const out = await transform(messages);
130-
expect(out.find((m) => m.role === 'user')).toBe(openingUser);
131-
const textOnlyAssistants = out.filter(
132-
(m) =>
133-
m.role === 'assistant' &&
134-
(m as { content: Array<{ type: string }> }).content.every((c) => c.type === 'text'),
135-
);
136-
expect(textOnlyAssistants.length).toBeGreaterThanOrEqual(2);
106+
expect(out).toEqual(messages);
107+
});
108+
109+
it('never modifies user messages', async () => {
110+
const transform = buildTransformContext();
111+
const opening = userMsg('x'.repeat(50_000));
112+
const messages: AgentMessage[] = [opening, assistantText('ok')];
113+
const out = await transform(messages);
114+
expect(out[0]).toBe(opening);
137115
});
138116

139-
it('tightens to the aggressive window when HARD_CAP_BYTES is exceeded', async () => {
117+
it('tightens to aggressive caps when HARD_CAP_BYTES is exceeded', async () => {
140118
const transform = buildTransformContext();
141119
const messages: AgentMessage[] = [userMsg('go')];
142-
const hugeArgs = 'p'.repeat(40_000);
143-
for (let i = 0; i < 10; i += 1) {
144-
messages.push(assistantWithToolCall(`big-${i}`, hugeArgs));
145-
messages.push(toolResult(`big-${i}`, 'small-response'));
120+
// 30 rounds with tool input just over the 8KB cap = 30 summarized at first
121+
// pass, but the metadata itself adds up. Force the hard cap by also adding
122+
// many text blocks between 2KB and 8KB — first pass keeps them, aggressive
123+
// compacts them.
124+
const midText = 'p'.repeat(6_000);
125+
for (let i = 0; i < 40; i += 1) {
126+
messages.push(assistantText(midText));
127+
messages.push(assistantWithToolCall(`t${i}`, 'p'.repeat(10_000)));
128+
messages.push(toolResult(`t${i}`, 'p'.repeat(10_000)));
146129
}
147130
const out = await transform(messages);
148-
// In aggressive mode only the last 3 rounds stay verbatim. Count
149-
// assistant toolCall blocks with summarized input.
150-
let summarizedCount = 0;
131+
let droppedTextCount = 0;
151132
for (const m of out) {
152133
if (m.role !== 'assistant') continue;
153-
const content = (m as { content: Array<{ type?: string; input?: unknown }> }).content;
134+
const content = (m as { content: Array<{ type?: string; text?: string }> }).content;
154135
for (const c of content) {
155-
if (c.type === 'toolCall') {
156-
const input = c.input as { _summarized?: boolean } | undefined;
157-
if (input?._summarized === true) summarizedCount += 1;
136+
if (c.type === 'text' && c.text?.startsWith('[prior assistant output dropped')) {
137+
droppedTextCount += 1;
158138
}
159139
}
160140
}
161-
expect(summarizedCount).toBeGreaterThanOrEqual(7);
141+
// Aggressive cap is 2KB — the 6KB midText blocks should all be stubbed.
142+
expect(droppedTextCount).toBeGreaterThanOrEqual(35);
162143
});
163144
});

0 commit comments

Comments
 (0)