feat: Implement standardized SARIF telemetry generation for external platform compliance (#107)#115
Conversation
|
hii would love to work under your projects and contribute under your guidance, please assign me!! |
|
@prasiddhi-105 Please fix failing checks |
|
@ionfwsrijan yes, I'll do it by today itself |
|
@prasiddhi-105 Failing checks |
arpit2006
left a comment
There was a problem hiding this comment.
❌ Request Changes — PR #115 (SARIF Export)
Thank you for the contribution. I reviewed the implementation in detail. While the SARIF translator itself is a reasonable starting point, this PR introduces several critical regressions affecting security, reliability, and existing ML functionality. These issues must be addressed before the feature can be considered for merge.
🔴 Critical Blockers
1. Zip Slip Protection Removed (utils/fs.py)
This PR removes the path traversal protection that was specifically introduced to prevent Zip Slip vulnerabilities during archive extraction.
Previous behavior:
for member_name in z.namelist():
member_path = (out_path / member_name).resolve()
if out_path not in member_path.parents and member_path != out_path:
raise ValueError(f"Zip Slip blocked: ...")
Current behavior:
z.extractall(out_dir)
As a result, a malicious archive can now write files outside the intended extraction directory.
Additionally, test_zipslip.py was removed, eliminating coverage for this security control.
Required Action
Restore Zip Slip validation.
Restore and maintain test coverage.
Ensure archive extraction remains constrained to the target directory.
2. Upload Size Enforcement Removed (main.py)
The PR removes existing upload size protections, including:
MAX_UPLOAD_SIZEMAX_UPLOAD_MBContent-Length validation
Streamed byte-count enforcement
These safeguards were designed to prevent oversized uploads and resource exhaustion attacks.
The accompanying test coverage (test_upload_security.py) was also removed.
Impact
Unbounded uploads are now accepted.
Increased risk of denial-of-service through large file submissions.
Existing security guarantees are no longer enforced.
Required Action
Restore upload size validation.
Restore associated security tests.
3. ML Deduplication Pipeline Deleted
The PR removes the existing semantic deduplication pipeline:
ml/deduplicator.py
ml/embedder.py
deduplicate()
It also removes:
raw_finding_count
finding_count
tracking and reporting.
This is not related to SARIF export functionality and removes an existing Tier 1 ML capability from the platform.
Impact
Semantic deduplication no longer runs.
Finding metrics become inaccurate.
Existing ML functionality regresses without discussion or migration plan.
Required Action
Restore the deduplication pipeline.
Restore finding count metrics.
Scope SARIF export separately from ML feature removals.
4. Input Validation Removed (models.py)
Validation constraints on github_username were removed:
max_length=39
pattern=...
These constraints were previously added as a security hardening measure and align with GitHub username rules.
Removing them reopens the field to arbitrary input.
Required Action
Restore validation constraints.
Preserve existing input hardening behavior.
5. HTTP Timeout Removed (main.py)
The configured GitHub API timeout:
httpx.Timeout(30.0, connect=10.0)
has been removed.
Without explicit timeouts, requests may hang indefinitely when GitHub becomes slow or unresponsive.
Impact
Worker exhaustion risk
Degraded reliability
Potential request pileups
Required Action
Restore explicit timeout configuration.
🟡 SARIF Feature Design Issues
6. Static semgrep_out.json Dependency
The export endpoint currently reads from:
backend/semgrep_out.json
which appears to be a committed fixture file rather than actual scan output.
This creates several problems:
Results are not tied to a scan.
Different users receive identical data.
Export does not reflect live findings.
Required Action
Generate SARIF from findings stored in the database.
Drive exports using scan/job identifiers.
Exclude raw scanner artifacts from source control when appropriate.
7. Missing job_id Context
The endpoint currently resembles:
GET /api/export/sarif
with no identifier specifying which scan to export.
This prevents exporting SARIF for a particular job.
Required Action
Implement a job-specific export path such as:
GET /api/export/sarif/{job_id}
or equivalent.
8. Incomplete SARIF Structure
The generated payload omits:
tool.driver.rules
which should contain metadata for every referenced ruleId.
Many SARIF consumers, including GitHub Advanced Security, expect this section.
Impact
Validation warnings
Potential ingestion failures
Reduced interoperability
Required Action
Populate the rules array based on discovered findings.
9. Test Is Not Integrated with CI
backend/test_sarif.py is currently a standalone script:
Not under the test suite
Not collected by pytest
Depends on local filesystem state
This means CI will not automatically validate the feature.
Required Action
Move tests into the standard test directory.
Convert to pytest-based tests.
Use mocked or fixture-based data.
10. Import Placement Violates Project Conventions
The following import appears near the end of main.py:
from app.sarif_translator import SarifTranslator
rather than alongside the rest of the module imports.
This breaks established project structure and reduces maintainability.
Required Action
Move imports to the top of the file and follow existing conventions.
Summary
Category | Count -- | -- 🔴 Security Regressions | 3 🔴 ML Feature Regressions | 1 🔴 Reliability Regressions | 1 🟡 SARIF Design / Architecture Issues | 5Final Verdict
❌ Request Changes
The SARIF translator implementation itself is a reasonable foundation. However, the PR currently introduces multiple security regressions, removes existing ML functionality, weakens reliability guarantees, and couples the export pipeline to a static fixture file.
Before this feature can be merged:
Restore all removed security protections.
Restore ML deduplication and finding metrics.
Reintroduce timeout and validation safeguards.
Decouple SARIF generation from static files.
Implement proper job-based exports and standards-compliant SARIF output.
Add CI-backed automated tests.
Once these blockers are resolved, I'd be happy to review the updated implementation.
Linked issue
Closes #107
What this PR does
This PR implements a standardized SARIF translation pipeline that processes internal vulnerability metrics into an enterprise-compliant format. It also introduces a dedicated export controller endpoint to dynamically serve the telemetry payload to external clients as a downloadable file attachment. This ensures seamless interoperability with third-party application security compliance dashboards like GitHub Advanced Security and SonarQube.
Type of change
ML tier (if applicable)
Stack affected
Changes
Backend
SarifTranslatorclass: Ingests standard JSON triage array elements, handles schema-compliant mappings (rule IDs, locations, line/column tracking), and normalizes tool severities to official lowercase SARIF compliance levels (error,warning,note)./api/export/sarifthat executes the parsing pipeline on compilation metrics and streams the output directly as a.sariffile attachment.Testing
How did you test this?
Verified the pipeline end-to-end using a local validation execution sequence. Tested with mock Semgrep raw JSON logs to ensure the keys mapped flawlessly into the target layout spec (
https://json.schemastore.org/sarif-2.1.0.json) and verified the FastAPI route triggers a proper browser file download.Checklist
console.erroror unhandled Python exceptions introducedrequirements.txt/package.jsonupdated if new dependencies added.pkl,.pt, etc.) are gitignored, not committedAnything reviewers should focus on
Reviewers can verify that the severity mapping handling logic cleanly downcases tool levels into valid SARIF levels without throwing key errors on unexpected inputs.