fix(core): size-based context pruning + artifact-wrapper override

hqhq1025 · hqhq1025 · commit b692ec76564f · 2026-04-21T01:28:35.000+08:00
Drop window-based compaction in favor of per-block size caps (8 KB for
assistant text, toolCall.input, and toolResult content; 2 KB aggressive
fallback). The window approach missed the dominant failure mode: the
latest assistant message carrying a 9 MB artifact dump inside the "keep
verbatim" window, which produced 3.97M-token requests and 400s.

New philosophy: history is intent tracking, not payload storage. Large
outputs get stubbed regardless of position — current state is always
recoverable via view().

Also resolves the conflicting instructions that produced the 9 MB
payload in the first place: base output-rules §"Artifact wrapper" and
workflow step 7 both require the model to emit the full design as
assistant text inside &lt;artifact&gt;, while the agent guidance says do NOT
emit it. Adds an explicit OVERRIDE block at the top of AGENTIC_TOOL_
GUIDANCE so the tool-mode branch wins unambiguously, and tightens the
final-turn note to spell out that pasting the file crashes the next
request.

- packages/core/src/context-prune.ts: rewrite around per-block caps
- packages/core/src/context-prune.test.ts: 7 tests for new contract
- packages/core/src/agent.ts: OVERRIDE block + final-turn warning
diff --git a/packages/core/src/agent.ts b/packages/core/src/agent.ts
@@ -307,6 +307,17 @@ async function collectSkills(
 // ---------------------------------------------------------------------------
 
 const AGENTIC_TOOL_GUIDANCE = [
+  '## OVERRIDE: artifact-wrapper rules do not apply in this mode',
+  '',
+  'The base system prompt (output-rules §"Artifact wrapper", workflow step 7 ',
+  '"Deliver — Output the artifact tag") instructs you to emit the design ',
+  'inside an `<artifact>...</artifact>` tag as assistant text. **Those rules ',
+  'are superseded by this section.** You have a `str_replace_based_edit_tool`; ',
+  'the file is written via that tool and extracted from the virtual filesystem ',
+  'by the host. Emitting the file contents as assistant text (either wrapped in ',
+  '`<artifact>`, a ```jsx fence, or raw) duplicates the design, doubles token ',
+  'cost, and blows past the LLM context limit on the next turn. Never do it.',
+  '',
   '## Output format (STRICT — no exceptions)',
   '',
   'Your artifact lives in `index.html` and follows this template — write it via',
@@ -355,13 +366,17 @@ const AGENTIC_TOOL_GUIDANCE = [
   '   4. One short prose line reflecting on what landed ("Three KPIs in place — the deltas use mono tnum so they line up.").',
   '   5. Tick the matching todo via `set_todos`.',
   '   That is **2 prose lines + 3 tool calls per turn**. Never batch multiple sections into a single str_replace; never run two str_replace tools in the same turn without a prose line in between.',
-  '4. **Polish passes — interactive depth.** At least ONE dedicated turn that ADDS interactive depth, not just cosmetic tweaks. Before `done`, make sure:',
-  '   (a) ≥2 state changes actually wire up (tab switch in a section, accordion open, favorite toggle, dropdown on avatar, drawer "See details", inline-edit click)',
-  '   (b) ≥1 list / grid / table has a believable empty state component defined (icon + reason + CTA), even if the current data is non-empty',
-  '   (c) all buttons / cards have hover + press feedback (`transition: transform 120ms var(--ease-out), background-color 120ms`; press = `scale(0.96)`; cards lift 2px on hover)',
-  '   (d) data is real-sounding, not Lorem: varied names, realistic numbers, relative dates ("3h ago")',
-  '   Add an explicit `Interactive polish` todo item in `set_todos` so the user sees it ticked.',
-  '5. **Final turn — summary.** 2–4 sentences of natural-language prose explaining 2–3 design decisions worth noting (e.g. "Used three distinct surface tones for depth"). Do NOT re-emit the file content; the host extracts it from the virtual fs.',
+  '4. **Polish passes — interactive depth (MANDATORY, ≥2 dedicated turns).** The first polish turn wires interactions; the second adds small-detail craft. These are NOT optional — if the user sees static pixels where they expected live UI, the artifact fails. Before `done` every item on this list must be TRUE:',
+  '   (a) **≥3 functional state changes** that a user can trigger and observe. Tab switch revealing a different view, accordion open/close, drawer slide-in, favorite/like toggle that persists, dropdown/menu expand, inline-edit, filter chip toggle, modal open. Pure hover effects do NOT count toward this three.',
+  '   (b) **≥1 animated view/page transition** if there is any nav (tabs, sidebar, bottom bar, breadcrumbs). 180–260ms, opacity + small translate. A hard cut between views is a failure.',
+  '   (c) **Every `<button>` and `<a>` does something.** No decorative buttons. Wire a state change, open a modal, fire a toast, or remove it. Login / Sign-up / CTA buttons on marketing pages may open a modal stub — still real, not dead.',
+  '   (d) **Uniform hover + press + focus** across ALL clickable elements. Required cadence: `transition: transform 120ms var(--ease-out), background-color 120ms, box-shadow 160ms;` hover lifts 2px; press = `scale(0.96)`; focus = 2px offset ring in accent color (never rely on browser default outline).',
+  '   (e) **≥3 small-detail "craft-surplus" touches** from the craft-directives catalog. Pick from: stateful counter/badge with pop animation, keyboard shortcut chip (`⌘K`, `/`, `esc`), inline-editable field, copy-to-clipboard with "Copied ✓" feedback, dismissible toast/banner, contextual tooltip with directional arrow, scroll-linked header shrink, relative-time tick ("3m ago"), segmented control with weighted active state, thoughtful empty-state SVG scene, expandable accordion inside a card, a deliberate visual rhythm-break section. Adding a gradient and shadow does NOT count.',
+  '   (f) **≥1 empty-state variant** visible or coded (icon + one-sentence reason + CTA) on a list/grid/table, even when current data is non-empty.',
+  '   (g) **Active nav indicator uses weight/shape, not color alone** — underline, inset background, side-accent bar, or pill — so color-blind users can tell where they are.',
+  '   (h) Data reads real: varied names, non-round numbers (87 %, $14.2k), relative dates ("3h ago", "yesterday"), not Lorem / 100 % / Jan 1 2020.',
+  '   Break this into TWO todo items: `Interactive wiring (state + transitions)` and `Craft surplus (small details)`. Tick them explicitly so the user can see both phases landed.',
+  '5. **Final turn — summary.** 2–4 sentences of natural-language prose explaining 2–3 design decisions worth noting (e.g. "Used three distinct surface tones for depth"). Do NOT re-emit the file content; the host extracts it from the virtual fs. Pasting the full file here wastes ~2M tokens on the next turn and will crash the request — this is a hard failure, not a style nit.',
   '',
   '### File output policy (STRICT)',
   "- Use `str_replace_based_edit_tool` for ALL file content. Do NOT emit `<artifact>` tags or fenced ```jsx/```html blocks containing the source in your prose — the host extracts the artifact from the virtual fs and any inline source spams the user's chat.",
@@ -714,7 +729,7 @@ export async function generateViaAgent(
     // Without this, assistant.toolCall.input + big view results grow O(N²)
     // in LLM-facing size across a long tool-using run and blow past 1 M
     // tokens. See context-prune.ts for the full strategy.
-    transformContext: buildTransformContext(),
+    transformContext: buildTransformContext(log),
     getApiKey: () => input.apiKey || 'open-codesign-keyless',
   });
 
diff --git a/packages/core/src/context-prune.test.ts b/packages/core/src/context-prune.test.ts
@@ -9,12 +9,12 @@ function userMsg(text: string): AgentMessage {
   } as unknown as AgentMessage;
 }
 
-function assistantWithToolCall(toolCallId: string, big: string): AgentMessage {
+function assistantWithToolCall(toolCallId: string, inputArg: string): AgentMessage {
   return {
     role: 'assistant',
     content: [
       { type: 'text', text: 'ok' },
-      { type: 'toolCall', id: toolCallId, name: 'str_replace_based_edit_tool', input: { big } },
+      { type: 'toolCall', id: toolCallId, name: 'str_replace_based_edit_tool', input: { inputArg } },
     ],
   } as unknown as AgentMessage;
 }
@@ -34,8 +34,8 @@ function assistantText(text: string): AgentMessage {
   } as unknown as AgentMessage;
 }
 
-describe('buildTransformContext — sliding-window compaction', () => {
-  it('is a no-op when well under the cap', async () => {
+describe('buildTransformContext — size-based block compaction', () => {
+  it('is a no-op when every block is under its cap', async () => {
     const transform = buildTransformContext();
     const messages: AgentMessage[] = [
       userMsg('hi'),
@@ -47,117 +47,98 @@ describe('buildTransformContext — sliding-window compaction', () => {
     expect(out).toEqual(messages);
   });
 
-  it('leaves the last N tool-use rounds verbatim, stubs older toolResult content', async () => {
+  it('stubs a large assistant text block even on the LATEST message', async () => {
+    // The production bug: model streamed a 9MB artifact as assistant text
+    // on the final turn. v1 window-based prune preserved it verbatim.
     const transform = buildTransformContext();
-    const messages: AgentMessage[] = [userMsg('build this')];
-    const bulk = 'x'.repeat(2_000);
-    // 10 rounds — default window is 6, so first ~4 should be compacted.
-    for (let i = 0; i < 10; i += 1) {
-      messages.push(assistantWithToolCall(`t${i}`, `args-${i}`));
-      messages.push(toolResult(`t${i}`, `result body ${i} ${bulk}`));
-    }
+    const huge = 'x'.repeat(50_000);
+    const messages: AgentMessage[] = [
+      userMsg('build it'),
+      assistantText(huge),
+    ];
     const out = await transform(messages);
-    const resultRows = out.filter((m) => m.role === 'toolResult');
-    expect(resultRows).toHaveLength(10);
-    const early = resultRows.slice(0, 3);
-    const recent = resultRows.slice(-6);
-    for (const row of early) {
-      const first = (row as { content: Array<{ text?: string }> }).content[0]?.text ?? '';
-      expect(first.startsWith('[dropped')).toBe(true);
-    }
-    for (const row of recent) {
-      const first = (row as { content: Array<{ text?: string }> }).content[0]?.text ?? '';
-      expect(first.startsWith('result body')).toBe(true);
-    }
+    const last = out[out.length - 1] as { content: Array<{ text?: string }> };
+    const text = last.content[0]?.text ?? '';
+    expect(text.startsWith('[prior assistant output dropped')).toBe(true);
+    expect(text).toContain('50000B');
   });
 
-  it('compacts assistant.toolCall.input on old rounds but preserves name + id', async () => {
+  it('summarizes a large toolCall.input, preserving name + id', async () => {
     const transform = buildTransformContext();
-    const messages: AgentMessage[] = [userMsg('build')];
-    const bulk = 'a'.repeat(4_000);
-    // 10 rounds with big toolCall args — older ones should have args summarized.
-    for (let i = 0; i < 10; i += 1) {
-      messages.push(assistantWithToolCall(`call-${i}`, bulk));
-      messages.push(toolResult(`call-${i}`, 'ok'));
-    }
+    const bulk = 'a'.repeat(20_000);
+    const messages: AgentMessage[] = [
+      userMsg('build'),
+      assistantWithToolCall('call-0', bulk),
+      toolResult('call-0', 'ok'),
+    ];
     const out = await transform(messages);
-    // Oldest assistant message's toolCall block should have summarized input.
-    const oldest = out.find(
-      (m) =>
-        m.role === 'assistant' &&
-        Array.isArray((m as { content?: unknown }).content) &&
-        (m as { content: Array<{ type?: string; id?: string }> }).content.some(
-          (c) => c?.id === 'call-0',
-        ),
-    ) as { content: Array<{ id?: string; name?: string; input?: unknown }> } | undefined;
-    expect(oldest).toBeDefined();
-    const tc = oldest?.content.find((c) => c.id === 'call-0');
+    const a = out[1] as { content: Array<{ type?: string; id?: string; name?: string; input?: unknown }> };
+    const tc = a.content.find((c) => c.type === 'toolCall');
+    expect(tc?.id).toBe('call-0');
     expect(tc?.name).toBe('str_replace_based_edit_tool');
-    const input = tc?.input as { _summarized?: boolean; _origBytes?: number } | undefined;
+    const input = tc?.input as { _summarized?: boolean; _origBytes?: number };
     expect(input?._summarized).toBe(true);
-    expect(input?._origBytes).toBeGreaterThan(1_000);
+    expect(input?._origBytes).toBeGreaterThan(10_000);
   });
 
-  it('keeps the toolCallId on stubbed toolResult rows (pi-ai shape requirement)', async () => {
+  it('stubs a large toolResult body, keeping toolCallId for pi-ai shape', async () => {
     const transform = buildTransformContext();
-    const messages: AgentMessage[] = [userMsg('x')];
-    const bulk = 'y'.repeat(3_000);
-    for (let i = 0; i < 10; i += 1) {
-      messages.push(assistantWithToolCall(`call-${i}`, 'a'));
-      messages.push(toolResult(`call-${i}`, `body ${bulk}`));
-    }
+    const bulk = 'y'.repeat(20_000);
+    const messages: AgentMessage[] = [
+      userMsg('x'),
+      assistantWithToolCall('call-0', 'a'),
+      toolResult('call-0', bulk),
+    ];
     const out = await transform(messages);
-    const first = out.find(
-      (m) => m.role === 'toolResult' && (m as { toolCallId?: string }).toolCallId === 'call-0',
-    ) as { toolCallId?: string; content: Array<{ text?: string }> } | undefined;
-    expect(first).toBeDefined();
-    expect(first?.toolCallId).toBe('call-0');
-    expect(first?.content[0]?.text?.startsWith('[dropped')).toBe(true);
+    const tr = out[2] as { toolCallId?: string; content: Array<{ text?: string }> };
+    expect(tr.toolCallId).toBe('call-0');
+    expect(tr.content[0]?.text?.startsWith('[tool result dropped')).toBe(true);
   });
 
-  it('preserves user messages and assistant-text messages unchanged', async () => {
+  it('leaves small blocks untouched regardless of position', async () => {
     const transform = buildTransformContext();
-    const bulk = 'z'.repeat(3_000);
-    const openingUser = userMsg('initial brief, do not mangle');
-    const openingNote = assistantText('I will start now.');
-    const messages: AgentMessage[] = [openingUser, openingNote];
-    for (let i = 0; i < 10; i += 1) {
-      messages.push(assistantWithToolCall(`c${i}`, 'op'));
-      messages.push(toolResult(`c${i}`, `r ${bulk}`));
+    const messages: AgentMessage[] = [userMsg('go')];
+    for (let i = 0; i < 20; i += 1) {
+      messages.push(assistantWithToolCall(`t${i}`, 'tiny'));
+      messages.push(toolResult(`t${i}`, `tiny result ${i}`));
     }
-    messages.push(assistantText('final summary line'));
     const out = await transform(messages);
-    expect(out.find((m) => m.role === 'user')).toBe(openingUser);
-    const textOnlyAssistants = out.filter(
-      (m) =>
-        m.role === 'assistant' &&
-        (m as { content: Array<{ type: string }> }).content.every((c) => c.type === 'text'),
-    );
-    expect(textOnlyAssistants.length).toBeGreaterThanOrEqual(2);
+    expect(out).toEqual(messages);
+  });
+
+  it('never modifies user messages', async () => {
+    const transform = buildTransformContext();
+    const opening = userMsg('x'.repeat(50_000));
+    const messages: AgentMessage[] = [opening, assistantText('ok')];
+    const out = await transform(messages);
+    expect(out[0]).toBe(opening);
   });
 
-  it('tightens to the aggressive window when HARD_CAP_BYTES is exceeded', async () => {
+  it('tightens to aggressive caps when HARD_CAP_BYTES is exceeded', async () => {
     const transform = buildTransformContext();
     const messages: AgentMessage[] = [userMsg('go')];
-    const hugeArgs = 'p'.repeat(40_000);
-    for (let i = 0; i < 10; i += 1) {
-      messages.push(assistantWithToolCall(`big-${i}`, hugeArgs));
-      messages.push(toolResult(`big-${i}`, 'small-response'));
+    // 30 rounds with tool input just over the 8KB cap = 30 summarized at first
+    // pass, but the metadata itself adds up. Force the hard cap by also adding
+    // many text blocks between 2KB and 8KB — first pass keeps them, aggressive
+    // compacts them.
+    const midText = 'p'.repeat(6_000);
+    for (let i = 0; i < 40; i += 1) {
+      messages.push(assistantText(midText));
+      messages.push(assistantWithToolCall(`t${i}`, 'p'.repeat(10_000)));
+      messages.push(toolResult(`t${i}`, 'p'.repeat(10_000)));
     }
     const out = await transform(messages);
-    // In aggressive mode only the last 3 rounds stay verbatim. Count
-    // assistant toolCall blocks with summarized input.
-    let summarizedCount = 0;
+    let droppedTextCount = 0;
     for (const m of out) {
       if (m.role !== 'assistant') continue;
-      const content = (m as { content: Array<{ type?: string; input?: unknown }> }).content;
+      const content = (m as { content: Array<{ type?: string; text?: string }> }).content;
       for (const c of content) {
-        if (c.type === 'toolCall') {
-          const input = c.input as { _summarized?: boolean } | undefined;
-          if (input?._summarized === true) summarizedCount += 1;
+        if (c.type === 'text' && c.text?.startsWith('[prior assistant output dropped')) {
+          droppedTextCount += 1;
         }
       }
     }
-    expect(summarizedCount).toBeGreaterThanOrEqual(7);
+    // Aggressive cap is 2KB — the 6KB midText blocks should all be stubbed.
+    expect(droppedTextCount).toBeGreaterThanOrEqual(35);
   });
 });
diff --git a/packages/core/src/context-prune.ts b/packages/core/src/context-prune.ts