Skip to content

Feature/root cause grouping#150

Open
KolaSailaja wants to merge 2 commits into
ionfwsrijan:mainfrom
KolaSailaja:feature/root-cause-grouping
Open

Feature/root cause grouping#150
KolaSailaja wants to merge 2 commits into
ionfwsrijan:mainfrom
KolaSailaja:feature/root-cause-grouping

Conversation

@KolaSailaja

@KolaSailaja KolaSailaja commented Jun 22, 2026

Copy link
Copy Markdown

Linked issue

Closes #144

What this PR does

Implements AI-Powered Root Cause Grouping for Security Findings.

This feature analyzes related security findings, clusters them based on shared characteristics, and identifies likely root causes responsible for multiple vulnerabilities. The goal is to help users prioritize fixes by addressing the underlying source rather than remediating findings individually.

The PR introduces:

  • Root Cause Grouping Engine
  • Root Cause Grouping API Endpoint
  • Evidence Pack Integration
  • ML-based clustering workflow
  • Structured root-cause analysis output

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

  • Added root-cause grouping engine
  • Added clustering support for findings
  • Added /jobs/{job_id}/root-cause-groups
  • Added evidence pack export support
  • Added root-cause analysis artifacts

Frontend

  • No frontend changes

New dependencies

  • sentence-transformers
  • scikit-learn

Database / schema changes

  • Added root_cause_groups table
  • Added root_cause_group_finding table

Testing

How did you test this?

  • Ran backend locally
  • Generated findings using sample scans
  • Called root-cause endpoint
  • Verified evidence-pack generation
  • Verified grouping output structure

Checklist

  • Tested locally end-to-end
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt updated if new dependencies added
  • New model files are gitignored

Anything reviewers should focus on

  • Clustering quality
  • Root-cause inference logic
  • Evidence-pack integration
  • Database persistence behavior

Screenshots (if UI changed)

N/A

@github-actions github-actions Bot added backend Backend issues SSoC26 needs-work Work needed labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown

⚠️ Automated Check: This PR does not strictly follow the required template. Please ensure you have not deleted any checkboxes or mandatory headings, and that you have written explanations under What this PR does and How did you test this?.

Correct PR Template

Please copy and paste the raw template below into your PR description and fill it out:

> **Before opening:** make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

## Linked issue

Closes #

## What this PR does



## Type of change

- [ ] Bug fix
- [ ] New feature
- [ ] ML model / training pipeline
- [ ] Refactor (no behaviour change)
- [ ] Documentation
- [ ] Tests only

## ML tier (if applicable)

- [ ] Tier 1 — Triage
- [ ] Tier 2 — Predictive
- [ ] Tier 3 — Autonomous
- [ ] Not ML-related

## Stack affected

- [ ] Backend
- [ ] Frontend
- [ ] Both

---

## Changes

### Backend



-

### Frontend



-

### New dependencies



-

### Database / schema changes



-

---

## Testing

**How did you test this?**



**Checklist**

- [ ] Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
- [ ] New ML model falls back gracefully when model file is absent
- [ ] No new `console.error` or unhandled Python exceptions introduced
- [ ] Added or updated tests where applicable
- [ ] `requirements.txt` / `package.json` updated if new dependencies added
- [ ] New model files (`.pkl`, `.pt`, etc.) are gitignored, not committed

---

## Anything reviewers should focus on



## Screenshots (if UI changed)

@arpit2006 arpit2006 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KolaSailaja !
Thanks for the contribution and the detailed implementation effort. After reviewing the changes, I'm requesting changes before this PR can be merged.

❌ Request Changes

While the root-cause grouping feature is a valuable addition, there are several critical issues, regressions, and missing tests that need to be addressed.

🔴 Critical Issues

1. Broken Import Path in engine.py

from ..db import get_db

engine.py resides under ml/root_cause/, so this import resolves to ml/db, which does not exist.

Please update the import to reference the correct module path (e.g. app.db) to avoid runtime ImportErrors.


2. Embedding Model Loading Is Incorrect

_MODEL = joblib.load(MODEL_PATH / "model.joblib")

Sentence Transformer models are not loaded using joblib.load(). They should be initialized via:

SentenceTransformer(model_path)

Additionally, there is currently no fallback or graceful error handling when the model directory is missing. As written, the /jobs/{job_id}/root-cause-groups endpoint will return a 500 error whenever the model is unavailable.

Please implement proper loading and failure handling.


3. Deprecated AgglomerativeClustering API

AgglomerativeClustering(
    affinity="cosine",
    ...
)

The affinity parameter has been deprecated and removed in recent versions of scikit-learn.

Please replace it with:

metric="cosine"

to maintain compatibility with supported versions.


4. Unrelated Functional Regressions

This PR introduces several changes that are unrelated to root-cause grouping and remove existing functionality:

  • Removes deduplication logic (DISABLE_DEDUP, DEDUP_EPSILON, deduplicate())
  • Removes ml_score schema support and related queries
  • Removes raw_finding_count and finding_count fields from job responses
  • Removes PATCH /findings/{finding_id}/status
  • Removes FindingStatusUpdate
  • Disables "Accept Risk" and "Ignore Finding" actions in the UI
  • Removes status filtering
  • Removes tool filtering controls

These are significant behavioral changes and regressions that fall outside the scope of this feature.

Please restore existing functionality or move these changes into separate, clearly scoped PRs.


5. Unused Root Cause Tables

The schema introduces:

  • root_cause_groups
  • root_cause_group_finding

However, the grouping engine never persists data to these tables.

Groups are recomputed on every request, which makes the new tables effectively unused.

Please either:

  • Persist and consume group data from the database, or
  • Remove the unused schema additions.

🟡 Moderate Issues

6. Root Cause Inference Is Currently a Stub

The implementation currently derives root causes by selecting the most common words from finding titles:

most_common = Counter(tokens).most_common(3)

This frequently produces low-value outputs such as:

SQL Injection SQL

The PR description claims to identify likely root causes and source locations, but the current implementation is closer to keyword frequency analysis.

Please either improve the heuristic or adjust expectations and documentation accordingly.


7. job_id Is Hardcoded

RootCauseGroup(
    id=str(uuid.uuid4()),
    job_id="",
    ...
)

Groups are created with an empty job_id, and there is no clear point where this value is corrected before being returned.

Please ensure the generated groups contain the correct job identifier.


8. Missing Tests

The feature introduces substantial new functionality but adds no corresponding tests for:

  • engine.py
  • clustering.py
  • embedding_service.py
  • /jobs/{job_id}/root-cause-groups

At the same time, existing tests covering deduplication and finding counts have been removed.

Please add appropriate unit and integration tests before merge.


9. Removal of PATCHPILOT_DB_PATH Support

The previous implementation allowed database path configuration via environment variables:

DB_PATH = os.environ.get("PATCHPILOT_DB_PATH", ...)

This has been replaced with a hardcoded path:

DB_PATH = os.path.join(...)

This reduces deployment flexibility and appears unrelated to the feature being introduced.

Please restore the existing configuration behavior.


10. Async Conversion Requires Verification

build_evidence_pack() has been converted to an async function.

Please verify that all call sites have been updated to properly await it. If any callers still invoke it synchronously, this will result in incorrect behavior or runtime warnings.


Summary

The root-cause grouping concept is promising, but there are several blocking issues:

Must Fix Before Merge

  • Broken imports
  • Incorrect embedding model loading
  • Deprecated clustering API usage
  • Unrelated feature regressions
  • Missing persistence despite new schema additions

Should Be Addressed

  • Improve root-cause inference quality
  • Fix job_id handling
  • Restore removed configuration support
  • Add test coverage
  • Verify async call paths

Please address the above items and resubmit for review. I'd be happy to take another look once the implementation is updated.

@arpit2006

Copy link
Copy Markdown
Collaborator

@KolaSailaja , also fix the Template to proceed for further review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] AI-Powered Root Cause Grouping for Security Findings

2 participants