feat(design-audit): make reference-grounded redesign job-first, not aesthetic-first#124
Conversation
…esthetic-first The engine grounded every page in an exemplar's visual DNA and judged on visual craft, so it regressed functional pages into generic brochures (docs lost its table of contents + density for marketing cards; an aggregator dropped 30 items to 9; a dashboard shed services into spacious cards). - generate/prompt.ts: persona art director -> product designer; hard rules in priority order — task-first, preserve affordances (never delete nav/ToC), preserve density where it is the value, right-size (don't reskin a tool into a landing page), exemplar is craft-only not a template. Plus a data-driven FUNCTIONAL CONTRACT derived from the page's own DNA (nav count, density, archetype) — density required only when the page is measured dense. - judge/prompt.ts: score task fitness + functional preservation before visual craft; a polished direction that strips nav or density loses. Validated by re-running the regressed pages: docs keeps its ToC + nav + dense code; HN keeps all 30 stories; the dashboard stays a dense service grid.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — a31d2d42
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:01:30Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 2 (1 medium-concern, 1 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 117.2s (2 bridge agents) |
| Total | 117.2s |
💰 Value — sound
Reframes the reference-grounded redesign engine from aesthetic-first to job-first via system-prompt rules and a data-driven functional contract, with matching judge priority order; tested, linted, and changeset'd.
- What it does: Changes two prompt builders and their tests. In
src/design/audit/reference/generate/prompt.ts, the generator persona shifts from "world-class art director" to "senior product designer", the exemplar is explicitly craft-only (not a structural template), and hard rules are reordered so task fitness and preserving functional affordances/density outrank visual polish. It also adds `renderFunctionalC - Goals it achieves: Stops the redesign engine from regressing functional pages (docs, dashboards, aggregators) into generic, sparse marketing pages when grounded against tasteful but structurally different exemplars. It makes "keep navigation/ToC/density" an explicit, measured constraint rather than an optional aesthetic preference, and re-aligns the LLM judge so it does not reward prettier-but-less-usable directions
- Assessment: Good change. It is narrowly scoped to the prompt layer, builds directly on existing
DesignDNAfields and the pure prompt-builder pattern already in the codebase, adds regression tests, includes a changeset, and passespnpm lintandpnpm check:boundaries. The data-driven density gating (density === 'dense') avoids forcing sparse pages to stay sparse, which matches the stated problem. - Better / existing approach: none — this is the right approach. I checked the existing rubric/anchor system (
src/design/audit/rubric/anchors/docs.yaml,src/design/audit/rubric/anchors/dashboard.yaml,src/design/audit/rubric/fragments/type-docs.md,src/design/audit/rubric/rollup-weights.ts) and it already encodes page-type priorities, butrubricBodyis optional in bothGenerationContextandJudgePairInput, so the - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
A well-integrated job-first reframe of the redesign+judge prompts that fixes a real AI-judging-AI regression; the one data-driven piece (density gate) is keyed to a signal that misclassifies the cited content-dense pages.
- Integration: Fully reachable. buildDirectionPrompt is consumed by generate/generator.ts:78; buildPairwisePrompt/buildQualityPrompt by judge/text-judge.ts:46-47 and judge/vision-judge.ts:162-163. The new renderFunctionalContract is a private helper invoked inline at prompt.ts:193 — no new public surface, no dead code.
- Fit with existing patterns: Follows the established pure-prompt-builder grain exactly. renderFunctionalContract mirrors the existing renderConstraints/renderExemplarBlock helpers (deterministic, pure, section-joined). All referenced DNA fields (layout.density, layout.archetype, components.nav) exist as required fields on DesignDNA (contracts.ts:271,277,288); Density is 'sparse'|'balanced'|'dense' (contracts.ts:85). No compet
- Real-world viability: Mostly holds: persona reframe, unconditional system-prompt priority rules, nav preservation, and judge priority ordering are signal-independent and will fire on every call. The density gate does not — see finding.
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: todo added src/design/audit/reference/generate/prompt.ts
- ' facts or use placeholders like "TODO" or "lorem ipsum".',
🎯 Usefulness Audit
🟠 Density gate is keyed to component-pattern variety, not information density — no-ops on the cited regressions [problem-fit] ``
renderFunctionalContract gates the 'This page is DENSE: keep at least as many items/rows' directive on density === 'dense' (prompt.ts:139). But layout.density is derived at dna/derive.ts:308 as deriveDensity(buttons+inputs+cards+nav) — and deriveDensity (line 109-118) is called WITHOUT whitespaceRatio (hardcoded undefined at derive.ts:343), so it runs purely off distinctComponentCount, which returns set.size of component FINGERPRINTS (line 226) = pattern variety, not instance count. A page needs
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
Problem
The reference-grounded redesign engine grounded every page in a world-class exemplar's visual DNA and the ranker judged visual craft — so it optimized for "looks tasteful," not "serves the user's task." On functional pages that regressed them into generic brochures:
A blind LLM-judge panel scored these "redesigns" as decisive wins — because LLM judges share the aesthetic LLMs generate, and the rubric discounted density. The metric was an AI judging AI; the redesigns would make real apps worse.
Fix — job-first, not aesthetic-first
reference/generate/prompt.ts): persona art director → product designer. Hard rules in priority order: task-first → preserve functional affordances (never delete navigation/ToC/search to look cleaner) → preserve density where it is the value (docs/dashboards/feeds keep their item count) → right-size the intervention (never turn one kind of page into another) → the exemplar is visual craft only, never a structural template.reference/judge/prompt.ts): scores task fitness + functional preservation before visual craft; a polished direction that strips nav or density loses.Validation — re-ran the regressed pages with the fixed engine
Not an AI beauty panel — a check of whether the brief now keeps what makes each page work:
Gates:
tsc --noEmitclean ·check:boundariespass · full suite 1898 pass (2 pre-existingtelemetry-rollup-remotenvm-shim failures, unrelated). Regression tests added across generator + judge.Changeset:
minor.