Skip to content

feat: snippet skeletonization + content-aware rendering#17

Merged
denfry merged 11 commits into
mainfrom
feat/snippet-skeletonization
Jun 24, 2026
Merged

feat: snippet skeletonization + content-aware rendering#17
denfry merged 11 commits into
mainfrom
feat/snippet-skeletonization

Conversation

@denfry

@denfry denfry commented Jun 24, 2026

Copy link
Copy Markdown
Owner

What & why

Ports the core idea behind headroomlabs-ai/headroom — separating structural tokens from compressible ones (its StructureMask + AST code_handler) — into the retrieval layer, adapted for a code-search tool rather than a generic compression proxy.

Retrieval snippets are now focus skeletons: imports/signatures/class headers and the query-matching line are kept, while function/method bodies collapse to a marker like ... 24 lines elided (read 88-134). Result: more ranked results fit the same token_budget. On a real source file the transform cut a snippet from 2211 → 896 tokens (59%) while keeping every signature.

Key adaptations vs. headroom (it's a proxy; this is a retriever):

  • Focus skeleton — the matched line is never elided (a body line is often the answer).
  • Routing by detect_language(path) — no ML content detector; we already know the path.
  • Line-granularity mask — robust for partial window chunks; matches the codebase's line-range idiom.
  • Reversiblerecommended_reads / line_start-line_end remain the expand path.
  • Lossless-safe — a savings guard (≥25%) and a tree-sitter→regex→raw fallback chain mean output is never worse than today; compactor=None / --raw reproduce current output byte-for-byte.

How it works

pipeline.search builds a compactor (intent → context width; query → focus terms) and injects it into apply_budget. Per candidate: classify lines → render skeleton → redact → budget on the reduced token estimate. New skeleton.py routes code (AST via existing parse_file), markdown (headings), and structured config (key lines); everything else is untouched.

Surface

  • New result fields: skeletonized: bool, elided_lines: int.
  • Config: retrieval.compact_snippets (default true), retrieval.compact_min_reduction (default 0.25) — retrieval-time only, no reindex.
  • Disable per-call: --raw (CLI search/explain) or raw: true (MCP search_code/explain_code).
  • SKILL.md updated (all synced copies) so the agent interprets the new fields.

Tests

TDD throughout. tests/test_skeleton.py (render/classify/focus/guard/fallback/determinism), budget injection, pipeline on/off, config (incl. config_hash unchanged), CLI/MCP flag, regenerated CLI+MCP goldens (additive fields only). 440 passed / 2 skipped, 85.38% coverage; ruff + mypy clean.

Design & plan: docs/superpowers/specs/2026-06-24-snippet-skeletonization-design.md, docs/superpowers/plans/2026-06-24-snippet-skeletonization.md.

🤖 Generated with Claude Code

denfry and others added 11 commits June 24, 2026 08:22
Ports headroom's AST structure handler + StructureMask idea into the
retrieval layer: focus skeletons (signatures + matched lines, bodies
elided) so more results fit the token budget. Reversible, content-aware
(code/markdown/structured), retrieval-time only, raw-fallback safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llback

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Edit canonical skill_template/SKILL.md + sync all copies; CHANGELOG entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Goldens gain additive skeletonized/elided_lines fields; budget.py types the
compactor as Optional[Callable[[Candidate], Compacted]]; tidy test imports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@denfry denfry merged commit 08dd2ff into main Jun 24, 2026
10 checks passed
@denfry denfry deleted the feat/snippet-skeletonization branch June 24, 2026 06:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d58c231c27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +232 to +235
for t in _TERM_RE.findall(query):
tl = t.lower()
if len(tl) >= 3 and tl not in _STOPWORDS:
out.append(tl)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve matches found through identifier subtokens

When a camelCase query retrieves a snake_case hit via the existing FTS subtoken expansion, this extractor only focus-keeps the unsplit lowercased token (for example refreshAccessToken), so the actual matching body line refresh_access_token can still be elided. That breaks the feature's guarantee that the query-matching line is preserved and makes cross-style identifier searches return skeletons that omit the line that caused the hit; please expand query terms the same way retrieval does before applying focus.

Useful? React with 👍 / 👎.

Scans forward (bounded) for the line that opens the body so multi-line
signatures stay visible; defaults to ``start`` when nothing matches.
"""
limit = min(end, start + _MAX_SIG_SCAN)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve multi-line signatures past five lines

For declarations whose body opener is more than five lines after the start line, this cap makes _signature_end fall back to start; _classify_code then treats the rest of the declaration as body and elides parameter/type lines. This is common for long typed Python/TS/Go signatures and contradicts the skeleton contract that signatures stay visible, so either scan to the actual body opener within the symbol span or fall back to the raw snippet when the opener is not found.

Useful? React with 👍 / 👎.

@denfry denfry mentioned this pull request Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant