feat: Implement standardized SARIF telemetry generation for external platform compliance (#107) by prasiddhi-105 · Pull Request #115 · ionfwsrijan/PatchPilot

prasiddhi-105 · 2026-06-15T13:42:23Z

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #107

What this PR does

This PR implements a standardized SARIF translation pipeline that processes internal vulnerability metrics into an enterprise-compliant format. It also introduces a dedicated export controller endpoint to dynamically serve the telemetry payload to external clients as a downloadable file attachment. This ensures seamless interoperability with third-party application security compliance dashboards like GitHub Advanced Security and SonarQube.

Type of change

ML tier (if applicable)

Tier 1 — Triage
Tier 2 — Predictive
Tier 3 — Autonomous
Not ML-related

Stack affected

Backend
Frontend
Both

Changes

Backend

Added SarifTranslator class: Ingests standard JSON triage array elements, handles schema-compliant mappings (rule IDs, locations, line/column tracking), and normalizes tool severities to official lowercase SARIF compliance levels (error, warning, note).
Added Export Controller Endpoint: Exposed a FastAPI GET endpoint at /api/export/sarif that executes the parsing pipeline on compilation metrics and streams the output directly as a .sarif file attachment.

Testing

How did you test this?

Verified the pipeline end-to-end using a local validation execution sequence. Tested with mock Semgrep raw JSON logs to ensure the keys mapped flawlessly into the target layout spec (https://json.schemastore.org/sarif-2.1.0.json) and verified the FastAPI route triggers a proper browser file download.

Checklist

Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
New ML model falls back gracefully when model file is absent
No new console.error or unhandled Python exceptions introduced
Added or updated tests where applicable
requirements.txt / package.json updated if new dependencies added
New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Reviewers can verify that the severity mapping handling logic cleanly downcases tool levels into valid SARIF levels without throwing key errors on unexpected inputs.

…nfwsrijan#106

prasiddhi-105 · 2026-06-15T13:43:52Z

hii would love to work under your projects and contribute under your guidance, please assign me!!

ionfwsrijan · 2026-06-18T18:22:20Z

@prasiddhi-105 Please fix failing checks

prasiddhi-105 · 2026-06-18T22:58:30Z

@ionfwsrijan yes, I'll do it by today itself

ionfwsrijan · 2026-06-19T05:01:50Z

@prasiddhi-105 Failing checks

arpit2006

❌ Request Changes — PR #115 (SARIF Export)

Thank you for the contribution. I reviewed the implementation in detail. While the SARIF translator itself is a reasonable starting point, this PR introduces several critical regressions affecting security, reliability, and existing ML functionality. These issues must be addressed before the feature can be considered for merge.

🔴 Critical Blockers

1. Zip Slip Protection Removed (`utils/fs.py`)

This PR removes the path traversal protection that was specifically introduced to prevent Zip Slip vulnerabilities during archive extraction.

Previous behavior:

for member_name in z.namelist():
    member_path = (out_path / member_name).resolve()
    if out_path not in member_path.parents and member_path != out_path:
        raise ValueError(f"Zip Slip blocked: ...")

Current behavior:

z.extractall(out_dir)

As a result, a malicious archive can now write files outside the intended extraction directory.

Additionally, test_zipslip.py was removed, eliminating coverage for this security control.

Required Action

Restore Zip Slip validation.
Restore and maintain test coverage.
Ensure archive extraction remains constrained to the target directory.

2. Upload Size Enforcement Removed (`main.py`)

The PR removes existing upload size protections, including:

MAX_UPLOAD_SIZE
MAX_UPLOAD_MB
Content-Length validation
Streamed byte-count enforcement

These safeguards were designed to prevent oversized uploads and resource exhaustion attacks.

The accompanying test coverage (test_upload_security.py) was also removed.

Impact

Unbounded uploads are now accepted.
Increased risk of denial-of-service through large file submissions.
Existing security guarantees are no longer enforced.

Required Action

Restore upload size validation.
Restore associated security tests.

3. ML Deduplication Pipeline Deleted

The PR removes the existing semantic deduplication pipeline:

ml/deduplicator.py
ml/embedder.py
deduplicate()

It also removes:

raw_finding_count
finding_count

tracking and reporting.

This is not related to SARIF export functionality and removes an existing Tier 1 ML capability from the platform.

Impact

Semantic deduplication no longer runs.
Finding metrics become inaccurate.
Existing ML functionality regresses without discussion or migration plan.

Required Action

Restore the deduplication pipeline.
Restore finding count metrics.
Scope SARIF export separately from ML feature removals.

4. Input Validation Removed (`models.py`)

Validation constraints on github_username were removed:

max_length=39
pattern=...

These constraints were previously added as a security hardening measure and align with GitHub username rules.

Removing them reopens the field to arbitrary input.

Required Action

Restore validation constraints.
Preserve existing input hardening behavior.

5. HTTP Timeout Removed (`main.py`)

The configured GitHub API timeout:

httpx.Timeout(30.0, connect=10.0)

has been removed.

Without explicit timeouts, requests may hang indefinitely when GitHub becomes slow or unresponsive.

Impact

Worker exhaustion risk
Degraded reliability
Potential request pileups

Required Action

Restore explicit timeout configuration.

🟡 SARIF Feature Design Issues

6. Static `semgrep_out.json` Dependency

The export endpoint currently reads from:

backend/semgrep_out.json

which appears to be a committed fixture file rather than actual scan output.

This creates several problems:

Results are not tied to a scan.
Different users receive identical data.
Export does not reflect live findings.

Required Action

Generate SARIF from findings stored in the database.
Drive exports using scan/job identifiers.
Exclude raw scanner artifacts from source control when appropriate.

7. Missing `job_id` Context

The endpoint currently resembles:

GET /api/export/sarif

with no identifier specifying which scan to export.

This prevents exporting SARIF for a particular job.

Required Action
Implement a job-specific export path such as:

GET /api/export/sarif/{job_id}

or equivalent.

8. Incomplete SARIF Structure

The generated payload omits:

tool.driver.rules

which should contain metadata for every referenced ruleId.

Many SARIF consumers, including GitHub Advanced Security, expect this section.

Impact

Validation warnings
Potential ingestion failures
Reduced interoperability

Required Action
Populate the rules array based on discovered findings.

9. Test Is Not Integrated with CI

backend/test_sarif.py is currently a standalone script:

Not under the test suite
Not collected by pytest
Depends on local filesystem state

This means CI will not automatically validate the feature.

Required Action

Move tests into the standard test directory.
Convert to pytest-based tests.
Use mocked or fixture-based data.

10. Import Placement Violates Project Conventions

The following import appears near the end of main.py:

from app.sarif_translator import SarifTranslator

rather than alongside the rest of the module imports.

This breaks established project structure and reduces maintainability.

Required Action
Move imports to the top of the file and follow existing conventions.

Summary

Final Verdict

❌ Request Changes

The SARIF translator implementation itself is a reasonable foundation. However, the PR currently introduces multiple security regressions, removes existing ML functionality, weakens reliability guarantees, and couples the export pipeline to a static fixture file.

Before this feature can be merged:

Restore all removed security protections.
Restore ML deduplication and finding metrics.
Reintroduce timeout and validation safeguards.
Decouple SARIF generation from static files.
Implement proper job-based exports and standards-compliant SARIF output.
Add CI-backed automated tests.

Once these blockers are resolved, I'd be happy to review the updated implementation.

prasiddhi-105 added 2 commits June 15, 2026 18:15

feat: implement core delta analysis engine for dependency auditing io…

873ed33

…nfwsrijan#106

Implement SARIF telemetry export

0feca34

github-actions Bot added backend Backend issues feature New feature SSoC26 labels Jun 15, 2026

arpit2006 added needs-work Work needed under-review labels Jun 24, 2026

arpit2006 assigned prasiddhi-105 Jun 24, 2026

arpit2006 requested changes Jun 24, 2026

View reviewed changes

arpit2006 added checks-failing changes-required labels Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement standardized SARIF telemetry generation for external platform compliance (#107)#115

feat: Implement standardized SARIF telemetry generation for external platform compliance (#107)#115
prasiddhi-105 wants to merge 2 commits into
ionfwsrijan:mainfrom
prasiddhi-105:feature/sarif-export

prasiddhi-105 commented Jun 15, 2026

Uh oh!

prasiddhi-105 commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 18, 2026

Uh oh!

prasiddhi-105 commented Jun 18, 2026

Uh oh!

ionfwsrijan commented Jun 19, 2026

Uh oh!

arpit2006 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

prasiddhi-105 commented Jun 15, 2026

Linked issue

What this PR does

Type of change

ML tier (if applicable)

Stack affected

Changes

Backend

Testing

Anything reviewers should focus on

Uh oh!

prasiddhi-105 commented Jun 15, 2026

Uh oh!

ionfwsrijan commented Jun 18, 2026

Uh oh!

prasiddhi-105 commented Jun 18, 2026

Uh oh!

ionfwsrijan commented Jun 19, 2026

Uh oh!

arpit2006 left a comment

Choose a reason for hiding this comment

❌ Request Changes — PR #115 (SARIF Export)

🔴 Critical Blockers

1. Zip Slip Protection Removed (utils/fs.py)

2. Upload Size Enforcement Removed (main.py)

3. ML Deduplication Pipeline Deleted

4. Input Validation Removed (models.py)

5. HTTP Timeout Removed (main.py)

🟡 SARIF Feature Design Issues

6. Static semgrep_out.json Dependency

7. Missing job_id Context

8. Incomplete SARIF Structure

9. Test Is Not Integrated with CI

10. Import Placement Violates Project Conventions

Summary

Final Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Zip Slip Protection Removed (`utils/fs.py`)

2. Upload Size Enforcement Removed (`main.py`)

4. Input Validation Removed (`models.py`)

5. HTTP Timeout Removed (`main.py`)

6. Static `semgrep_out.json` Dependency

7. Missing `job_id` Context