feat: integrate deduplicator into scan response#113
Conversation
|
|
|
@Praneeth2711 Fix failing checks and PR description. Join our dc server to connect with fellow contributors and mentors. Make sure to star the repo. |
- apply deduplication after scan aggregation - add raw_finding_count and finding_count - support DISABLE_DEDUP - support configurable DEDUP_EPSILON - gracefully skip deduplication when sentence-transformers is unavailable
f43229b to
c2c5dcc
Compare
Linked issueCloses #85 What this PR doesThis PR integrates the semantic embedding-based deduplicator into the Type of change
ML tier (if applicable)
Stack affected
ChangesBackend
Frontend
New dependencies
Database / schema changes
TestingHow did you test this? Added integration tests covering:
Ran the backend test suite locally and verified all tests pass. Checklist
Anything reviewers should focus onPlease focus on the lazy loading of the SentenceTransformer model, deduplication logic, and handling of Screenshots (if UI changed)Not applicable (backend-only change). |
|
@Praneeth2711 Update PR description |
Linked issueCloses # What this PR doesType of change
ML tier (if applicable)
Stack affected
ChangesBackendFrontendNew dependenciesDatabase / schema changesTestingHow did you test this? Checklist
Anything reviewers should focus onScreenshots (if UI changed)@Praneeth2711 Strictly use this template as it is |
|
@Praneeth2711 Please fix failing checks |
|
@Praneeth2711 Fix failing checks and PR desc |
There was a problem hiding this comment.
@Praneeth2711 !
Thanks for the contribution. Overall, the implementation is in good shape, and several aspects were particularly well executed:
- ✅ Deduplication algorithm behavior appears correct.
- ✅ Fallback logic is handled cleanly when embeddings are unavailable.
- ✅ Environment variable handling is sensible and well documented.
- ✅ Test coverage is comprehensive for the introduced functionality.
- ✅ The
download_to_pathfix addresses the previously identified issue correctly.
That said, there are two blocking issues that should be resolved before this PR can be merged.
🔴 1. Missing sentence-transformers Dependency
The implementation relies on sentence-transformers for semantic deduplication, but the dependency is not declared anywhere in requirements.txt.
As a result, in a clean production environment the embedding-based deduplication path will never be available because the library is not installed.
This effectively causes semantic deduplication to silently disable itself outside of development environments.
Required Fix
- Add
sentence-transformersto the project dependencies, or - Explicitly document it as an optional dependency with clear installation instructions and comments explaining the fallback behavior.
At minimum, the dependency relationship should be visible and intentional.
🔴 2. DeduplicatedScanResponse Is Currently Dead Code
DeduplicatedScanResponse is:
- Defined in
models.py - Imported in
main.py
However, it is not currently used by any endpoint.
This leaves an unused model in the codebase and creates uncertainty about the intended API contract.
Required Fix
Please choose one of the following:
Option A — Use It
Wire it into the relevant endpoint, for example:
@app.get(
"/jobs/{job_id}/findings",
response_model=DeduplicatedScanResponse
)if that reflects the intended response structure.
Option B — Remove It
If the model is not yet needed, remove it from the PR and introduce it when an endpoint actually consumes it.
Unused API models tend to become stale over time and create maintenance overhead.
🟡 Minor Suggestion
Narrow the find_spec Monkeypatch Scope
The tests currently monkeypatch:
importlib.util.find_specat a global level.
While this works, it can unintentionally affect unrelated imports during test execution.
A safer approach would be to scope the patch to the module under test:
app.utils.deduplicator.importlib.util.find_specThis keeps the mock isolated and reduces the risk of test interference.
Summary
Must Fix Before Merge
- Add or explicitly document the
sentence-transformersdependency. - Either wire
DeduplicatedScanResponseinto an endpoint or remove it.
Recommended Improvement
- Narrow the scope of the
find_specmonkeypatch in tests.
Once those two blockers are addressed, this PR should be in good shape for approval.
Linked issue
Closes #85
What this PR does
This PR integrates the deduplication logic into the backend scan process. It runs the deduplicator on aggregated findings to collapse duplicate findings based on text embeddings. It supports configuring the deduplication via environment variables (DISABLE_DEDUP and DEDUP_EPSILON) and handles cases where sentence-transformers is not installed.
Type of change
ML tier (if applicable)
Stack affected
Changes
Backend
Frontend
New dependencies
Database / schema changes
Testing
How did you test this?
Tested by running the backend pytest suite, specifically targeting est_scan_dedup.py. Checked that the deduplication runs successfully and filters out duplicate findings, and falls back gracefully when sentence-transformers is unavailable or if DISABLE_DEDUP is enabled.
Checklist
equirements.txt / package.json updated if new dependencies added
Anything reviewers should focus on
Please check the integration within _run_single_scan_task background task.
Screenshots (if UI changed)