feat: snippet skeletonization + content-aware rendering#17
Conversation
Ports headroom's AST structure handler + StructureMask idea into the retrieval layer: focus skeletons (signatures + matched lines, bodies elided) so more results fit the token budget. Reversible, content-aware (code/markdown/structured), retrieval-time only, raw-fallback safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llback Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Edit canonical skill_template/SKILL.md + sync all copies; CHANGELOG entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Goldens gain additive skeletonized/elided_lines fields; budget.py types the compactor as Optional[Callable[[Candidate], Compacted]]; tidy test imports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d58c231c27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for t in _TERM_RE.findall(query): | ||
| tl = t.lower() | ||
| if len(tl) >= 3 and tl not in _STOPWORDS: | ||
| out.append(tl) |
There was a problem hiding this comment.
Preserve matches found through identifier subtokens
When a camelCase query retrieves a snake_case hit via the existing FTS subtoken expansion, this extractor only focus-keeps the unsplit lowercased token (for example refreshAccessToken), so the actual matching body line refresh_access_token can still be elided. That breaks the feature's guarantee that the query-matching line is preserved and makes cross-style identifier searches return skeletons that omit the line that caused the hit; please expand query terms the same way retrieval does before applying focus.
Useful? React with 👍 / 👎.
| Scans forward (bounded) for the line that opens the body so multi-line | ||
| signatures stay visible; defaults to ``start`` when nothing matches. | ||
| """ | ||
| limit = min(end, start + _MAX_SIG_SCAN) |
There was a problem hiding this comment.
Preserve multi-line signatures past five lines
For declarations whose body opener is more than five lines after the start line, this cap makes _signature_end fall back to start; _classify_code then treats the rest of the declaration as body and elides parameter/type lines. This is common for long typed Python/TS/Go signatures and contradicts the skeleton contract that signatures stay visible, so either scan to the actual body opener within the symbol span or fall back to the raw snippet when the opener is not found.
Useful? React with 👍 / 👎.
What & why
Ports the core idea behind
headroomlabs-ai/headroom— separating structural tokens from compressible ones (itsStructureMask+ ASTcode_handler) — into the retrieval layer, adapted for a code-search tool rather than a generic compression proxy.Retrieval snippets are now focus skeletons: imports/signatures/class headers and the query-matching line are kept, while function/method bodies collapse to a marker like
... 24 lines elided (read 88-134). Result: more ranked results fit the sametoken_budget. On a real source file the transform cut a snippet from 2211 → 896 tokens (59%) while keeping every signature.Key adaptations vs. headroom (it's a proxy; this is a retriever):
detect_language(path)— no ML content detector; we already know the path.recommended_reads/line_start-line_endremain the expand path.compactor=None/--rawreproduce current output byte-for-byte.How it works
pipeline.searchbuilds acompactor(intent → context width;query→ focus terms) and injects it intoapply_budget. Per candidate: classify lines → render skeleton → redact → budget on the reduced token estimate. Newskeleton.pyroutes code (AST via existingparse_file), markdown (headings), and structured config (key lines); everything else is untouched.Surface
skeletonized: bool,elided_lines: int.retrieval.compact_snippets(defaulttrue),retrieval.compact_min_reduction(default0.25) — retrieval-time only, no reindex.--raw(CLIsearch/explain) orraw: true(MCPsearch_code/explain_code).SKILL.mdupdated (all synced copies) so the agent interprets the new fields.Tests
TDD throughout.
tests/test_skeleton.py(render/classify/focus/guard/fallback/determinism), budget injection, pipeline on/off, config (incl.config_hashunchanged), CLI/MCP flag, regenerated CLI+MCP goldens (additive fields only). 440 passed / 2 skipped, 85.38% coverage; ruff + mypy clean.Design & plan:
docs/superpowers/specs/2026-06-24-snippet-skeletonization-design.md,docs/superpowers/plans/2026-06-24-snippet-skeletonization.md.🤖 Generated with Claude Code