Skip to content

feat(scan): add opt-in transitive reference scanning#225

Open
rodboev wants to merge 5 commits into
NVIDIA:mainfrom
rodboev:pr/transitive-external-reference-scanning
Open

feat(scan): add opt-in transitive reference scanning#225
rodboev wants to merge 5 commits into
NVIDIA:mainfrom
rodboev:pr/transitive-external-reference-scanning

Conversation

@rodboev

@rodboev rodboev commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds opt-in transitive scanning for source references inside scanned skill content, including depth and allow or deny controls for follow-up traversal.

Closes #97

Root cause

The current scan pipeline resolves one input into one local directory, builds context from that tree, and writes the graph result from that single invocation. Recursive mode only repeats that one-hop scan for immediate local child skill directories, and report emitters assume every finding belongs to the directly requested source.

Diff Notes

  • Add a shared transitive traversal module that extracts source-like external references from file_cache, filters out adjacent non-source URLs, canonicalizes source identities, enforces the depth budget, and owns visited-set mutation.
  • Add --transitive, --transitive-depth, and repeated allow or deny prefix controls to skillspector scan, then route both single-skill and recursive multi-skill entrypoints through the shared traversal helper.
  • Match allow or deny prefixes on canonical path boundaries, preserve valid repo names that happen to contain UI-reserved words, route GitHub archive ZIP links down the file-download path, then seed the shared visited set with the root external source and reuse it across recursive sibling scans so self-references and sibling overlaps do not trigger redundant rescans.
  • Reuse InputHandler for approved external targets so existing host allowlists, SSRF checks, clone or download handling, and archive protections stay authoritative, including archive URLs on allowed Git hosts.
  • Add transitive_depth and source_url to Finding, preserve the root cleanup path through both the merged and zero-depth transitive paths, and include transitive provenance in baseline fingerprints so direct baselines do not suppress external findings with the same rule, file, and line span.
  • Add focused tests for default one-hop behavior, bounded transitive traversal, circular-reference blocking, non-scannable URL negative space, transitive failure isolation, and report provenance.

Scope

This change stays on source types already supported by InputHandler, visited-set safety, traversal depth, allow or deny prefix controls, and provenance in reports. It does not add a web crawler, new allowed hosts, MCP behavior, or any default behavior change when --transitive is absent.

Verification

  • .\.venv\Scripts\python.exe -m pytest tests\unit\test_transitive.py tests\unit\test_cli.py tests\nodes\test_report.py tests\nodes\test_sarif_rules_and_empty_findings.py -v
    (via the required headless command block; 89 passed)
  • uv run ruff check src/ tests/
  • uv run ruff format --check src/ tests/

rodboev added 5 commits June 28, 2026 23:09
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: transitive link scanning — follow external repos/URLs referenced inside skill files

1 participant