Feature/root cause grouping#150
Conversation
|
Correct PR TemplatePlease copy and paste the raw template below into your PR description and fill it out: > **Before opening:** make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.
## Linked issue
Closes #
## What this PR does
## Type of change
- [ ] Bug fix
- [ ] New feature
- [ ] ML model / training pipeline
- [ ] Refactor (no behaviour change)
- [ ] Documentation
- [ ] Tests only
## ML tier (if applicable)
- [ ] Tier 1 — Triage
- [ ] Tier 2 — Predictive
- [ ] Tier 3 — Autonomous
- [ ] Not ML-related
## Stack affected
- [ ] Backend
- [ ] Frontend
- [ ] Both
---
## Changes
### Backend
-
### Frontend
-
### New dependencies
-
### Database / schema changes
-
---
## Testing
**How did you test this?**
**Checklist**
- [ ] Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
- [ ] New ML model falls back gracefully when model file is absent
- [ ] No new `console.error` or unhandled Python exceptions introduced
- [ ] Added or updated tests where applicable
- [ ] `requirements.txt` / `package.json` updated if new dependencies added
- [ ] New model files (`.pkl`, `.pt`, etc.) are gitignored, not committed
---
## Anything reviewers should focus on
## Screenshots (if UI changed)
|
arpit2006
left a comment
There was a problem hiding this comment.
@KolaSailaja !
Thanks for the contribution and the detailed implementation effort. After reviewing the changes, I'm requesting changes before this PR can be merged.
❌ Request Changes
While the root-cause grouping feature is a valuable addition, there are several critical issues, regressions, and missing tests that need to be addressed.
🔴 Critical Issues
1. Broken Import Path in engine.py
from ..db import get_dbengine.py resides under ml/root_cause/, so this import resolves to ml/db, which does not exist.
Please update the import to reference the correct module path (e.g. app.db) to avoid runtime ImportErrors.
2. Embedding Model Loading Is Incorrect
_MODEL = joblib.load(MODEL_PATH / "model.joblib")Sentence Transformer models are not loaded using joblib.load(). They should be initialized via:
SentenceTransformer(model_path)Additionally, there is currently no fallback or graceful error handling when the model directory is missing. As written, the /jobs/{job_id}/root-cause-groups endpoint will return a 500 error whenever the model is unavailable.
Please implement proper loading and failure handling.
3. Deprecated AgglomerativeClustering API
AgglomerativeClustering(
affinity="cosine",
...
)The affinity parameter has been deprecated and removed in recent versions of scikit-learn.
Please replace it with:
metric="cosine"to maintain compatibility with supported versions.
4. Unrelated Functional Regressions
This PR introduces several changes that are unrelated to root-cause grouping and remove existing functionality:
- Removes deduplication logic (
DISABLE_DEDUP,DEDUP_EPSILON,deduplicate()) - Removes
ml_scoreschema support and related queries - Removes
raw_finding_countandfinding_countfields from job responses - Removes
PATCH /findings/{finding_id}/status - Removes
FindingStatusUpdate - Disables "Accept Risk" and "Ignore Finding" actions in the UI
- Removes status filtering
- Removes tool filtering controls
These are significant behavioral changes and regressions that fall outside the scope of this feature.
Please restore existing functionality or move these changes into separate, clearly scoped PRs.
5. Unused Root Cause Tables
The schema introduces:
root_cause_groupsroot_cause_group_finding
However, the grouping engine never persists data to these tables.
Groups are recomputed on every request, which makes the new tables effectively unused.
Please either:
- Persist and consume group data from the database, or
- Remove the unused schema additions.
🟡 Moderate Issues
6. Root Cause Inference Is Currently a Stub
The implementation currently derives root causes by selecting the most common words from finding titles:
most_common = Counter(tokens).most_common(3)This frequently produces low-value outputs such as:
SQL Injection SQL
The PR description claims to identify likely root causes and source locations, but the current implementation is closer to keyword frequency analysis.
Please either improve the heuristic or adjust expectations and documentation accordingly.
7. job_id Is Hardcoded
RootCauseGroup(
id=str(uuid.uuid4()),
job_id="",
...
)Groups are created with an empty job_id, and there is no clear point where this value is corrected before being returned.
Please ensure the generated groups contain the correct job identifier.
8. Missing Tests
The feature introduces substantial new functionality but adds no corresponding tests for:
engine.pyclustering.pyembedding_service.py/jobs/{job_id}/root-cause-groups
At the same time, existing tests covering deduplication and finding counts have been removed.
Please add appropriate unit and integration tests before merge.
9. Removal of PATCHPILOT_DB_PATH Support
The previous implementation allowed database path configuration via environment variables:
DB_PATH = os.environ.get("PATCHPILOT_DB_PATH", ...)This has been replaced with a hardcoded path:
DB_PATH = os.path.join(...)This reduces deployment flexibility and appears unrelated to the feature being introduced.
Please restore the existing configuration behavior.
10. Async Conversion Requires Verification
build_evidence_pack() has been converted to an async function.
Please verify that all call sites have been updated to properly await it. If any callers still invoke it synchronously, this will result in incorrect behavior or runtime warnings.
Summary
The root-cause grouping concept is promising, but there are several blocking issues:
Must Fix Before Merge
- Broken imports
- Incorrect embedding model loading
- Deprecated clustering API usage
- Unrelated feature regressions
- Missing persistence despite new schema additions
Should Be Addressed
- Improve root-cause inference quality
- Fix
job_idhandling - Restore removed configuration support
- Add test coverage
- Verify async call paths
Please address the above items and resubmit for review. I'd be happy to take another look once the implementation is updated.
|
@KolaSailaja , also fix the Template to proceed for further review! |
Linked issue
Closes #144
What this PR does
Implements AI-Powered Root Cause Grouping for Security Findings.
This feature analyzes related security findings, clusters them based on shared characteristics, and identifies likely root causes responsible for multiple vulnerabilities. The goal is to help users prioritize fixes by addressing the underlying source rather than remediating findings individually.
The PR introduces:
Type of change
ML tier (if applicable)
Stack affected
Changes
Backend
/jobs/{job_id}/root-cause-groupsFrontend
New dependencies
Database / schema changes
Testing
How did you test this?
Checklist
Anything reviewers should focus on
Screenshots (if UI changed)
N/A