feat: integrate deduplicator into scan response by Praneeth2711 · Pull Request #113 · ionfwsrijan/PatchPilot

Praneeth2711 · 2026-06-15T10:46:11Z

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #85

What this PR does

This PR integrates the deduplication logic into the backend scan process. It runs the deduplicator on aggregated findings to collapse duplicate findings based on text embeddings. It supports configuring the deduplication via environment variables (DISABLE_DEDUP and DEDUP_EPSILON) and handles cases where sentence-transformers is not installed.

Type of change

ML tier (if applicable)

Tier 1 — Triage
Tier 2 — Predictive
Tier 3 — Autonomous
Not ML-related

Stack affected

Backend
Frontend
Both

Changes

Backend

Modified �ackend/app/main.py to integrate deduplication into the scanning background task before writing to the database.
Read environment variables DISABLE_DEDUP and DEDUP_EPSILON to configure deduplication behavior.
Added test coverage in �ackend/tests/test_scan_dedup.py to verify deduplication with different configurations.

Frontend

New dependencies

Database / schema changes

Testing

How did you test this?

Tested by running the backend pytest suite, specifically targeting est_scan_dedup.py. Checked that the deduplication runs successfully and filters out duplicate findings, and falls back gracefully when sentence-transformers is unavailable or if DISABLE_DEDUP is enabled.

Checklist

Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
New ML model falls back gracefully when model file is absent
No new console.error or unhandled Python exceptions introduced
Added or updated tests where applicable
[x]
equirements.txt / package.json updated if new dependencies added
New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Please check the integration within _run_single_scan_task background task.

Screenshots (if UI changed)

github-actions · 2026-06-15T10:46:23Z

⚠️ Automated Check: This PR does not strictly follow the required template. Please ensure you have not deleted any checkboxes or mandatory headings, and that you have written explanations under What this PR does and How did you test this?.

ionfwsrijan · 2026-06-15T10:51:35Z

@Praneeth2711 Fix failing checks and PR description.

Join our dc server to connect with fellow contributors and mentors.

https://discord.gg/FcXuyw2Rs

Make sure to star the repo.

- apply deduplication after scan aggregation - add raw_finding_count and finding_count - support DISABLE_DEDUP - support configurable DEDUP_EPSILON - gracefully skip deduplication when sentence-transformers is unavailable

Praneeth2711 · 2026-06-15T11:17:32Z

ionfwsrijan · 2026-06-15T11:23:14Z

@Praneeth2711 Update PR description

ionfwsrijan · 2026-06-15T11:42:17Z

ionfwsrijan · 2026-06-18T18:21:37Z

@Praneeth2711 Please fix failing checks

…sion issue

ionfwsrijan · 2026-06-21T07:54:17Z

@Praneeth2711 Fix failing checks and PR desc

arpit2006

@Praneeth2711 !
Thanks for the contribution. Overall, the implementation is in good shape, and several aspects were particularly well executed:

✅ Deduplication algorithm behavior appears correct.
✅ Fallback logic is handled cleanly when embeddings are unavailable.
✅ Environment variable handling is sensible and well documented.
✅ Test coverage is comprehensive for the introduced functionality.
✅ The download_to_path fix addresses the previously identified issue correctly.

That said, there are two blocking issues that should be resolved before this PR can be merged.

🔴 1. Missing `sentence-transformers` Dependency

The implementation relies on sentence-transformers for semantic deduplication, but the dependency is not declared anywhere in requirements.txt.

As a result, in a clean production environment the embedding-based deduplication path will never be available because the library is not installed.

This effectively causes semantic deduplication to silently disable itself outside of development environments.

Required Fix

Add sentence-transformers to the project dependencies, or
Explicitly document it as an optional dependency with clear installation instructions and comments explaining the fallback behavior.

At minimum, the dependency relationship should be visible and intentional.

🔴 2. `DeduplicatedScanResponse` Is Currently Dead Code

DeduplicatedScanResponse is:

Defined in models.py
Imported in main.py

However, it is not currently used by any endpoint.

This leaves an unused model in the codebase and creates uncertainty about the intended API contract.

Required Fix

Please choose one of the following:

Option A — Use It

Wire it into the relevant endpoint, for example:

@app.get(
    "/jobs/{job_id}/findings",
    response_model=DeduplicatedScanResponse
)

if that reflects the intended response structure.

Option B — Remove It

If the model is not yet needed, remove it from the PR and introduce it when an endpoint actually consumes it.

Unused API models tend to become stale over time and create maintenance overhead.

🟡 Minor Suggestion

Narrow the `find_spec` Monkeypatch Scope

The tests currently monkeypatch:

importlib.util.find_spec

at a global level.

While this works, it can unintentionally affect unrelated imports during test execution.

A safer approach would be to scope the patch to the module under test:

app.utils.deduplicator.importlib.util.find_spec

This keeps the mock isolated and reduces the risk of test interference.

Summary

Must Fix Before Merge

Add or explicitly document the sentence-transformers dependency.
Either wire DeduplicatedScanResponse into an endpoint or remove it.

Recommended Improvement

Narrow the scope of the find_spec monkeypatch in tests.

Once those two blockers are addressed, this PR should be in good shape for approval.

github-actions Bot added SSoC26 needs-work Work needed labels Jun 15, 2026

feat: integrate deduplicator into scan response

c2c5dcc

- apply deduplication after scan aggregation - add raw_finding_count and finding_count - support DISABLE_DEDUP - support configurable DEDUP_EPSILON - gracefully skip deduplication when sentence-transformers is unavailable

Praneeth2711 force-pushed the feat/integrate-deduplicator-scan-response branch from f43229b to c2c5dcc Compare June 15, 2026 11:04

Praneeth2711 added 2 commits June 15, 2026 16:59

test: fix dedup tests when sentence-transformers is unavailable

178a359

chore: retrigger CI

b687ee6

chore: retrigger guardrails

679ccad

chore: retrigger guardrails

d5573c1

github-actions Bot added backend Backend issues feature New feature labels Jun 21, 2026

fix: resolve merge conflict with upstream main and fix Windows permis…

684cfaf

…sion issue

arpit2006 added merge-conflicts checks-failing labels Jun 24, 2026

arpit2006 requested changes Jun 24, 2026

View reviewed changes

arpit2006 added the changes-required label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate deduplicator into scan response#113

feat: integrate deduplicator into scan response#113
Praneeth2711 wants to merge 6 commits into
ionfwsrijan:mainfrom
Praneeth2711:feat/integrate-deduplicator-scan-response

Praneeth2711 commented Jun 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 15, 2026

Uh oh!

Praneeth2711 commented Jun 15, 2026 •

edited

Loading

Uh oh!

ionfwsrijan commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 18, 2026

Uh oh!

ionfwsrijan commented Jun 21, 2026

Uh oh!

arpit2006 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Praneeth2711 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

What this PR does

Type of change

ML tier (if applicable)

Stack affected

Changes

Backend

Frontend

New dependencies

Database / schema changes

Testing

Anything reviewers should focus on

Screenshots (if UI changed)

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 15, 2026

Uh oh!

Praneeth2711 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

What this PR does

Type of change

ML tier (if applicable)

Stack affected

Changes

Backend

Frontend

New dependencies

Database / schema changes

Testing

Anything reviewers should focus on

Screenshots (if UI changed)

Uh oh!

ionfwsrijan commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 15, 2026

Linked issue

What this PR does

Type of change

ML tier (if applicable)

Stack affected

Changes

Backend

Frontend

New dependencies

Database / schema changes

Testing

Anything reviewers should focus on

Screenshots (if UI changed)

Uh oh!

ionfwsrijan commented Jun 18, 2026

Uh oh!

ionfwsrijan commented Jun 21, 2026

Uh oh!

arpit2006 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🔴 1. Missing sentence-transformers Dependency

🔴 2. DeduplicatedScanResponse Is Currently Dead Code

🟡 Minor Suggestion

Narrow the find_spec Monkeypatch Scope

Summary

Must Fix Before Merge

Recommended Improvement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Praneeth2711 commented Jun 15, 2026 •

edited

Loading

Praneeth2711 commented Jun 15, 2026 •

edited

Loading

arpit2006 left a comment •

edited

Loading

🔴 1. Missing `sentence-transformers` Dependency

🔴 2. `DeduplicatedScanResponse` Is Currently Dead Code

Narrow the `find_spec` Monkeypatch Scope