Skip to content

feat: integrate deduplicator into scan response#113

Open
Praneeth2711 wants to merge 6 commits into
ionfwsrijan:mainfrom
Praneeth2711:feat/integrate-deduplicator-scan-response
Open

feat: integrate deduplicator into scan response#113
Praneeth2711 wants to merge 6 commits into
ionfwsrijan:mainfrom
Praneeth2711:feat/integrate-deduplicator-scan-response

Conversation

@Praneeth2711

@Praneeth2711 Praneeth2711 commented Jun 15, 2026

Copy link
Copy Markdown

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #85

What this PR does

This PR integrates the deduplication logic into the backend scan process. It runs the deduplicator on aggregated findings to collapse duplicate findings based on text embeddings. It supports configuring the deduplication via environment variables (DISABLE_DEDUP and DEDUP_EPSILON) and handles cases where sentence-transformers is not installed.

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

  • Modified �ackend/app/main.py to integrate deduplication into the scanning background task before writing to the database.
  • Read environment variables DISABLE_DEDUP and DEDUP_EPSILON to configure deduplication behavior.
  • Added test coverage in �ackend/tests/test_scan_dedup.py to verify deduplication with different configurations.

Frontend

New dependencies

Database / schema changes


Testing

How did you test this?

Tested by running the backend pytest suite, specifically targeting est_scan_dedup.py. Checked that the deduplication runs successfully and filters out duplicate findings, and falls back gracefully when sentence-transformers is unavailable or if DISABLE_DEDUP is enabled.

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • [x]
    equirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Please check the integration within _run_single_scan_task background task.

Screenshots (if UI changed)

@github-actions

Copy link
Copy Markdown

⚠️ Automated Check: This PR does not strictly follow the required template. Please ensure you have not deleted any checkboxes or mandatory headings, and that you have written explanations under What this PR does and How did you test this?.

@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Fix failing checks and PR description.

Join our dc server to connect with fellow contributors and mentors.

https://discord.gg/FcXuyw2Rs

Make sure to star the repo.

- apply deduplication after scan aggregation

- add raw_finding_count and finding_count

- support DISABLE_DEDUP

- support configurable DEDUP_EPSILON

- gracefully skip deduplication when sentence-transformers is unavailable
@Praneeth2711 Praneeth2711 force-pushed the feat/integrate-deduplicator-scan-response branch from f43229b to c2c5dcc Compare June 15, 2026 11:04
@Praneeth2711

Praneeth2711 commented Jun 15, 2026

Copy link
Copy Markdown
Author

Linked issue

Closes #85

What this PR does

This PR integrates the semantic embedding-based deduplicator into the /scan pipeline. Findings are deduplicated before being returned by the API. It adds support for DISABLE_DEDUP=true, configurable DEDUP_EPSILON, graceful fallback when optional ML dependencies are unavailable, and stores deduplicated findings in the database to keep downstream endpoints and reports consistent.

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

  • Added embedding-based deduplication utility.
  • Added DeduplicatedScanResponse.
  • Added raw_finding_count and finding_count fields to /scan responses.
  • Added support for DISABLE_DEDUP=true.
  • Added support for configurable DEDUP_EPSILON.
  • Gracefully skips deduplication when sentence-transformers is unavailable.
  • Stores deduplicated findings in the database.

Frontend

  • None.

New dependencies

  • None.

Database / schema changes

  • Added DeduplicatedScanResponse response schema.
  • Deduplicated findings are persisted instead of raw findings.

Testing

How did you test this?

Added integration tests covering:

  • Deduplication enabled.
  • DISABLE_DEDUP=true.
  • Missing sentence-transformers.
  • Invalid DEDUP_EPSILON.
  • Backend test suite execution.

Ran the backend test suite locally and verified all tests pass.

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Please focus on the lazy loading of the SentenceTransformer model, deduplication logic, and handling of DISABLE_DEDUP and DEDUP_EPSILON.

Screenshots (if UI changed)

Not applicable (backend-only change).

@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Update PR description

@ionfwsrijan

Copy link
Copy Markdown
Owner

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #

What this PR does

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

Frontend

New dependencies

Database / schema changes


Testing

How did you test this?

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Screenshots (if UI changed)

@Praneeth2711 Strictly use this template as it is

@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Please fix failing checks

@github-actions github-actions Bot added backend Backend issues feature New feature labels Jun 21, 2026
@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Fix failing checks and PR desc

@arpit2006 arpit2006 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praneeth2711 !
Thanks for the contribution. Overall, the implementation is in good shape, and several aspects were particularly well executed:

  • ✅ Deduplication algorithm behavior appears correct.
  • ✅ Fallback logic is handled cleanly when embeddings are unavailable.
  • ✅ Environment variable handling is sensible and well documented.
  • ✅ Test coverage is comprehensive for the introduced functionality.
  • ✅ The download_to_path fix addresses the previously identified issue correctly.

That said, there are two blocking issues that should be resolved before this PR can be merged.


🔴 1. Missing sentence-transformers Dependency

The implementation relies on sentence-transformers for semantic deduplication, but the dependency is not declared anywhere in requirements.txt.

As a result, in a clean production environment the embedding-based deduplication path will never be available because the library is not installed.

This effectively causes semantic deduplication to silently disable itself outside of development environments.

Required Fix

  • Add sentence-transformers to the project dependencies, or
  • Explicitly document it as an optional dependency with clear installation instructions and comments explaining the fallback behavior.

At minimum, the dependency relationship should be visible and intentional.


🔴 2. DeduplicatedScanResponse Is Currently Dead Code

DeduplicatedScanResponse is:

  • Defined in models.py
  • Imported in main.py

However, it is not currently used by any endpoint.

This leaves an unused model in the codebase and creates uncertainty about the intended API contract.

Required Fix

Please choose one of the following:

Option A — Use It

Wire it into the relevant endpoint, for example:

@app.get(
    "/jobs/{job_id}/findings",
    response_model=DeduplicatedScanResponse
)

if that reflects the intended response structure.

Option B — Remove It

If the model is not yet needed, remove it from the PR and introduce it when an endpoint actually consumes it.

Unused API models tend to become stale over time and create maintenance overhead.


🟡 Minor Suggestion

Narrow the find_spec Monkeypatch Scope

The tests currently monkeypatch:

importlib.util.find_spec

at a global level.

While this works, it can unintentionally affect unrelated imports during test execution.

A safer approach would be to scope the patch to the module under test:

app.utils.deduplicator.importlib.util.find_spec

This keeps the mock isolated and reduces the risk of test interference.


Summary

Must Fix Before Merge

  • Add or explicitly document the sentence-transformers dependency.
  • Either wire DeduplicatedScanResponse into an endpoint or remove it.

Recommended Improvement

  • Narrow the scope of the find_spec monkeypatch in tests.

Once those two blockers are addressed, this PR should be in good shape for approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate deduplicator into /scan response

3 participants