Conversation
📝 WalkthroughWalkthroughRefactors KB origin enrichment to a bulk scan-job flow: adds a KB client that submits file hashes and polls for results, updates SourceItem to accept a kb_origin_urls mapping and apply origin info, and modifies CLI to collect hashes and fetch/apply origins in bulk. ChangesKB Scan Job Integration
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI:merge_results
participant Collector as _collect_kb_file_hashes
participant KBClient as fetch_origin_urls_via_scan_job
participant KBServer as KB Scan Job API
participant Item as SourceItem.set_oss_item
CLI->>Collector: walk files, compute MD5 hashes
Collector-->>CLI: hashes + extra candidates
CLI->>KBClient: submit hashes for bulk lookup
KBClient->>KBServer: POST /scan/jobs (hash list)
KBServer-->>KBClient: returns job_id
KBClient->>KBServer: poll GET /scan/jobs/{job_id}
KBServer-->>KBClient: job status/results (rows)
KBClient-->>CLI: file_hash -> origin_url mapping
CLI->>Item: set_oss_item(kb_origin_urls=mapping)
Item->>Item: _apply_kb_origin_url for each matched hash
Item-->>CLI: enriched OssItem instances
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/fosslight_source/_kb_client.py (1)
73-75: 💤 Low valueConsider catching more specific exceptions instead of bare
Exception.The
(json.JSONDecodeError, Exception)catch effectively catches everything. This is intentional for network resilience, but could mask unexpected programming errors. Consider catchingOSError(covers socket/connection issues) alongsidejson.JSONDecodeErrorfor clearer intent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_kb_client.py` around lines 73 - 75, Replace the overly-broad except (json.JSONDecodeError, Exception) with a narrower exception tuple to avoid masking programming errors: catch json.JSONDecodeError and OSError (or OSError plus any specific network library exceptions you use) instead, i.e. change the except to except (json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB scan job create failed: {e}") and return {} so only JSON and OS/network issues are swallowed while other exceptions propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 440-446: The scancode_paths set is unused dead code: remove the
creation and any updates to scancode_paths in the block that checks run_kb
(i.e., delete the line creating scancode_paths and the scancode_paths.add(...)
call) and keep the existing logic that iterates extra_candidates and appends
extra_item to scancode_result when extra_item.download_location is present; note
that duplicate filtering is already handled by _collect_kb_file_hashes, so no
replacement logic is necessary.
---
Nitpick comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 73-75: Replace the overly-broad except (json.JSONDecodeError,
Exception) with a narrower exception tuple to avoid masking programming errors:
catch json.JSONDecodeError and OSError (or OSError plus any specific network
library exceptions you use) instead, i.e. change the except to except
(json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB
scan job create failed: {e}") and return {} so only JSON and OS/network issues
are swallowed while other exceptions propagate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d9e026aa-46e8-4156-9780-2acc625081c3
📒 Files selected for processing (3)
src/fosslight_source/_kb_client.pysrc/fosslight_source/_scan_item.pysrc/fosslight_source/cli.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 82-92: The code calls _estimate_job_wait_timeout(int(accepted))
but accepted can be None or non-numeric; validate and coerce accepted before
using it. Replace direct int(accepted) with a small coercion block: read
accepted = created.get("accepted", created.get("total", len(unique_hashes))); if
accepted is not an int, try to convert with int(...) inside a try/except
catching ValueError/TypeError and fall back to created.get("total",
len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when
calling _estimate_job_wait_timeout, and optionally log a warning if coercion
failed. Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
- Around line 24-43: The _kb_request function currently builds a URL from
caller-controlled kb_url and calls urllib.request.urlopen without validating the
scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in
("http","https") before creating the Request or calling urlopen, raising a
ValueError (or returning an error) for non-HTTP(S) schemes. Also update
fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by
validating accepted is not None and numeric (wrap conversion in try/except
catching TypeError/ValueError) and handle invalid values gracefully
(log/raise/return early) instead of allowing an uncaught exception to abort
polling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fdd9a941-d65c-4c99-8eea-2dbfed3537b6
📒 Files selected for processing (3)
src/fosslight_source/_kb_client.pysrc/fosslight_source/_scan_item.pysrc/fosslight_source/cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/fosslight_source/cli.py
- src/fosslight_source/_scan_item.py
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/fosslight_source/_scan_item.py (1)
181-197:⚠️ Potential issue | 🟠 Major | ⚡ Quick winApply KB metadata before the
download_locationbranch.Because Line 181 short-circuits on existing
self.download_location, the KB lookup on Lines 189-197 never runs for those items. That leaveskb_origin_urlandkb_evidenceempty, soget_kb_reference_to_print()insrc/fosslight_source/cli.py, Lines 304-323 drops them from the KB reference output.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_scan_item.py` around lines 181 - 197, The KB-origin lookup must run before short-circuiting on self.download_location so KB metadata (kb_origin_url/kb_evidence) is applied to all OssItem instances; move the md5/hash and kb_origin_urls lookup (the logic that uses self._cached_kb_md5, self._get_hash(path_to_scan), kb_origin_urls.get(...), and self._apply_kb_origin_url(...)) to execute prior to the branch that iterates self.download_location, then use the resulting oss_name/oss_version/download_url when constructing OssItem inside both the download_location loop and the fallback branch (while preserving the existing behavior for multiple download URLs and respecting self.is_license_text).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 102-105: The polling logic should return immediately when the
server accepted zero hashes to avoid a no-op wait; after computing/issuing the
existing warning for skipped (variable skipped) and before computing deadline or
entering the poll loop (before deadline = time.monotonic() +
_estimate_job_wait_timeout(accepted)), add an early return when accepted == 0
(e.g., return an empty dict) so the function exits instead of waiting the
minimum timeout; locate this change in the same function that uses variables
skipped and accepted in _kb_client.py.
In `@src/fosslight_source/_scan_item.py`:
- Around line 189-193: The code currently checks "if kb_origin_urls is not None
and not self.is_license_text" which still runs _get_hash when kb_origin_urls is
an empty dict; change the condition to only run the KB-related hashing when the
KB map is non-empty (e.g. use "if kb_origin_urls and not self.is_license_text"
or bool(kb_origin_urls)) so that _cached_kb_md5/_get_hash(path_to_scan) are
skipped when kb_origin_urls is {} and avoid unnecessary file reads; update the
check around _cached_kb_md5 and _get_hash accordingly.
---
Outside diff comments:
In `@src/fosslight_source/_scan_item.py`:
- Around line 181-197: The KB-origin lookup must run before short-circuiting on
self.download_location so KB metadata (kb_origin_url/kb_evidence) is applied to
all OssItem instances; move the md5/hash and kb_origin_urls lookup (the logic
that uses self._cached_kb_md5, self._get_hash(path_to_scan),
kb_origin_urls.get(...), and self._apply_kb_origin_url(...)) to execute prior to
the branch that iterates self.download_location, then use the resulting
oss_name/oss_version/download_url when constructing OssItem inside both the
download_location loop and the fallback branch (while preserving the existing
behavior for multiple download URLs and respecting self.is_license_text).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2686cb27-b7fd-4076-8862-ce68167de03c
📒 Files selected for processing (3)
src/fosslight_source/_kb_client.pysrc/fosslight_source/_scan_item.pysrc/fosslight_source/cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/fosslight_source/cli.py
| if skipped: | ||
| logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") | ||
|
|
||
| deadline = time.monotonic() + _estimate_job_wait_timeout(accepted) |
There was a problem hiding this comment.
Return early when the server accepts zero hashes.
When Line 103 reports skipped hashes and Line 105 sees accepted == 0, there is nothing left to fetch, but this still enters the poll loop with the 300-second minimum wait. Returning {} immediately avoids a no-op wait on fully skipped/rate-limited jobs.
Proposed fix
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
+ if accepted == 0:
+ return {}
deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if skipped: | |
| logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") | |
| deadline = time.monotonic() + _estimate_job_wait_timeout(accepted) | |
| if skipped: | |
| logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") | |
| if accepted == 0: | |
| return {} | |
| deadline = time.monotonic() + _estimate_job_wait_timeout(accepted) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/fosslight_source/_kb_client.py` around lines 102 - 105, The polling logic
should return immediately when the server accepted zero hashes to avoid a no-op
wait; after computing/issuing the existing warning for skipped (variable
skipped) and before computing deadline or entering the poll loop (before
deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)), add an
early return when accepted == 0 (e.g., return an empty dict) so the function
exits instead of waiting the minimum timeout; locate this change in the same
function that uses variables skipped and accepted in _kb_client.py.
| if kb_origin_urls is not None and not self.is_license_text: | ||
| md5_hash = getattr(self, "_cached_kb_md5", "") | ||
| if not md5_hash: | ||
| md5_hash, _wfp = self._get_hash(path_to_scan) | ||
| if md5_hash: |
There was a problem hiding this comment.
Don't hash files when the KB map is empty.
Line 189 checks is not None, but merge_results() in src/fosslight_source/cli.py, Lines 426-443 always passes a dict. That means {} still triggers _get_hash() for every non-license file even when KB is disabled or the bulk lookup returned no matches, adding an unnecessary file-read/WFP pass to normal scans.
Proposed fix
- if kb_origin_urls is not None and not self.is_license_text:
+ if kb_origin_urls and not self.is_license_text:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if kb_origin_urls is not None and not self.is_license_text: | |
| md5_hash = getattr(self, "_cached_kb_md5", "") | |
| if not md5_hash: | |
| md5_hash, _wfp = self._get_hash(path_to_scan) | |
| if md5_hash: | |
| if kb_origin_urls and not self.is_license_text: | |
| md5_hash = getattr(self, "_cached_kb_md5", "") | |
| if not md5_hash: | |
| md5_hash, _wfp = self._get_hash(path_to_scan) | |
| if md5_hash: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/fosslight_source/_scan_item.py` around lines 189 - 193, The code
currently checks "if kb_origin_urls is not None and not self.is_license_text"
which still runs _get_hash when kb_origin_urls is an empty dict; change the
condition to only run the KB-related hashing when the KB map is non-empty (e.g.
use "if kb_origin_urls and not self.is_license_text" or bool(kb_origin_urls)) so
that _cached_kb_md5/_get_hash(path_to_scan) are skipped when kb_origin_urls is
{} and avoid unnecessary file reads; update the check around _cached_kb_md5 and
_get_hash accordingly.
Description
Summary by CodeRabbit
New Features
Refactor