Skip to content

feat: Implement standardized SARIF telemetry generation for external platform compliance (#107)#115

Open
prasiddhi-105 wants to merge 2 commits into
ionfwsrijan:mainfrom
prasiddhi-105:feature/sarif-export
Open

feat: Implement standardized SARIF telemetry generation for external platform compliance (#107)#115
prasiddhi-105 wants to merge 2 commits into
ionfwsrijan:mainfrom
prasiddhi-105:feature/sarif-export

Conversation

@prasiddhi-105

Copy link
Copy Markdown

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #107

What this PR does

This PR implements a standardized SARIF translation pipeline that processes internal vulnerability metrics into an enterprise-compliant format. It also introduces a dedicated export controller endpoint to dynamically serve the telemetry payload to external clients as a downloadable file attachment. This ensures seamless interoperability with third-party application security compliance dashboards like GitHub Advanced Security and SonarQube.

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

  • Added SarifTranslator class: Ingests standard JSON triage array elements, handles schema-compliant mappings (rule IDs, locations, line/column tracking), and normalizes tool severities to official lowercase SARIF compliance levels (error, warning, note).
  • Added Export Controller Endpoint: Exposed a FastAPI GET endpoint at /api/export/sarif that executes the parsing pipeline on compilation metrics and streams the output directly as a .sarif file attachment.

Testing

How did you test this?

Verified the pipeline end-to-end using a local validation execution sequence. Tested with mock Semgrep raw JSON logs to ensure the keys mapped flawlessly into the target layout spec (https://json.schemastore.org/sarif-2.1.0.json) and verified the FastAPI route triggers a proper browser file download.

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Reviewers can verify that the severity mapping handling logic cleanly downcases tool levels into valid SARIF levels without throwing key errors on unexpected inputs.

@github-actions github-actions Bot added backend Backend issues feature New feature SSoC26 labels Jun 15, 2026
@prasiddhi-105

Copy link
Copy Markdown
Author

hii would love to work under your projects and contribute under your guidance, please assign me!!

@ionfwsrijan

Copy link
Copy Markdown
Owner

@prasiddhi-105 Please fix failing checks

@prasiddhi-105

Copy link
Copy Markdown
Author

@ionfwsrijan yes, I'll do it by today itself

@ionfwsrijan

Copy link
Copy Markdown
Owner

@prasiddhi-105 Failing checks

@arpit2006 arpit2006 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Request Changes — PR #115 (SARIF Export)

Thank you for the contribution. I reviewed the implementation in detail. While the SARIF translator itself is a reasonable starting point, this PR introduces several critical regressions affecting security, reliability, and existing ML functionality. These issues must be addressed before the feature can be considered for merge.


🔴 Critical Blockers

1. Zip Slip Protection Removed (utils/fs.py)

This PR removes the path traversal protection that was specifically introduced to prevent Zip Slip vulnerabilities during archive extraction.

Previous behavior:

for member_name in z.namelist():
    member_path = (out_path / member_name).resolve()
    if out_path not in member_path.parents and member_path != out_path:
        raise ValueError(f"Zip Slip blocked: ...")

Current behavior:

z.extractall(out_dir)

As a result, a malicious archive can now write files outside the intended extraction directory.

Additionally, test_zipslip.py was removed, eliminating coverage for this security control.

Required Action

  • Restore Zip Slip validation.

  • Restore and maintain test coverage.

  • Ensure archive extraction remains constrained to the target directory.


2. Upload Size Enforcement Removed (main.py)

The PR removes existing upload size protections, including:

  • MAX_UPLOAD_SIZE

  • MAX_UPLOAD_MB

  • Content-Length validation

  • Streamed byte-count enforcement

These safeguards were designed to prevent oversized uploads and resource exhaustion attacks.

The accompanying test coverage (test_upload_security.py) was also removed.

Impact

  • Unbounded uploads are now accepted.

  • Increased risk of denial-of-service through large file submissions.

  • Existing security guarantees are no longer enforced.

Required Action

  • Restore upload size validation.

  • Restore associated security tests.


3. ML Deduplication Pipeline Deleted

The PR removes the existing semantic deduplication pipeline:

ml/deduplicator.py
ml/embedder.py
deduplicate()

It also removes:

raw_finding_count
finding_count

tracking and reporting.

This is not related to SARIF export functionality and removes an existing Tier 1 ML capability from the platform.

Impact

  • Semantic deduplication no longer runs.

  • Finding metrics become inaccurate.

  • Existing ML functionality regresses without discussion or migration plan.

Required Action

  • Restore the deduplication pipeline.

  • Restore finding count metrics.

  • Scope SARIF export separately from ML feature removals.


4. Input Validation Removed (models.py)

Validation constraints on github_username were removed:

max_length=39
pattern=...

These constraints were previously added as a security hardening measure and align with GitHub username rules.

Removing them reopens the field to arbitrary input.

Required Action

  • Restore validation constraints.

  • Preserve existing input hardening behavior.


5. HTTP Timeout Removed (main.py)

The configured GitHub API timeout:

httpx.Timeout(30.0, connect=10.0)

has been removed.

Without explicit timeouts, requests may hang indefinitely when GitHub becomes slow or unresponsive.

Impact

  • Worker exhaustion risk

  • Degraded reliability

  • Potential request pileups

Required Action

  • Restore explicit timeout configuration.


🟡 SARIF Feature Design Issues

6. Static semgrep_out.json Dependency

The export endpoint currently reads from:

backend/semgrep_out.json

which appears to be a committed fixture file rather than actual scan output.

This creates several problems:

  • Results are not tied to a scan.

  • Different users receive identical data.

  • Export does not reflect live findings.

Required Action

  • Generate SARIF from findings stored in the database.

  • Drive exports using scan/job identifiers.

  • Exclude raw scanner artifacts from source control when appropriate.


7. Missing job_id Context

The endpoint currently resembles:

GET /api/export/sarif

with no identifier specifying which scan to export.

This prevents exporting SARIF for a particular job.

Required Action
Implement a job-specific export path such as:

GET /api/export/sarif/{job_id}

or equivalent.


8. Incomplete SARIF Structure

The generated payload omits:

tool.driver.rules

which should contain metadata for every referenced ruleId.

Many SARIF consumers, including GitHub Advanced Security, expect this section.

Impact

  • Validation warnings

  • Potential ingestion failures

  • Reduced interoperability

Required Action
Populate the rules array based on discovered findings.


9. Test Is Not Integrated with CI

backend/test_sarif.py is currently a standalone script:

  • Not under the test suite

  • Not collected by pytest

  • Depends on local filesystem state

This means CI will not automatically validate the feature.

Required Action

  • Move tests into the standard test directory.

  • Convert to pytest-based tests.

  • Use mocked or fixture-based data.


10. Import Placement Violates Project Conventions

The following import appears near the end of main.py:

from app.sarif_translator import SarifTranslator

rather than alongside the rest of the module imports.

This breaks established project structure and reduces maintainability.

Required Action
Move imports to the top of the file and follow existing conventions.


Summary

Category | Count -- | -- 🔴 Security Regressions | 3 🔴 ML Feature Regressions | 1 🔴 Reliability Regressions | 1 🟡 SARIF Design / Architecture Issues | 5

Final Verdict

❌ Request Changes

The SARIF translator implementation itself is a reasonable foundation. However, the PR currently introduces multiple security regressions, removes existing ML functionality, weakens reliability guarantees, and couples the export pipeline to a static fixture file.

Before this feature can be merged:

  1. Restore all removed security protections.

  2. Restore ML deduplication and finding metrics.

  3. Reintroduce timeout and validation safeguards.

  4. Decouple SARIF generation from static files.

  5. Implement proper job-based exports and standards-compliant SARIF output.

  6. Add CI-backed automated tests.

Once these blockers are resolved, I'd be happy to review the updated implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardized SARIF telemetry generation for external platform compliance

3 participants