Skip to content

Switch KB lookup from query to scan jobs polling#274

Open
soimkim wants to merge 1 commit into
mainfrom
prj
Open

Switch KB lookup from query to scan jobs polling#274
soimkim wants to merge 1 commit into
mainfrom
prj

Conversation

@soimkim
Copy link
Copy Markdown
Contributor

@soimkim soimkim commented May 28, 2026

Description

Summary by CodeRabbit

  • New Features

    • Added batch KB origin-URL retrieval via scan jobs with polling, timeout and partial-result handling; returns a mapping of file-hash to origin URL for bulk enrichment.
  • Refactor

    • Replaced per-file KB lookups with a bulk enrichment flow that caches file hashes, applies returned origin URLs to items, and more efficiently processes discovered extra files.

@soimkim soimkim self-assigned this May 28, 2026
@soimkim soimkim added the enhancement [PR/Issue] New feature or request label May 28, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Refactors KB origin enrichment to a bulk scan-job flow: adds a KB client that submits file hashes and polls for results, updates SourceItem to accept a kb_origin_urls mapping and apply origin info, and modifies CLI to collect hashes and fetch/apply origins in bulk.

Changes

KB Scan Job Integration

Layer / File(s) Summary
KB Scan Job API Client
src/fosslight_source/_kb_client.py
New HTTP client module with _kb_request helper, polling/timeout constants, and fetch_origin_urls_via_scan_job that creates a remote scan job, polls for completion, and returns a file-hash→origin-url mapping.
SourceItem KB Origin URL Application
src/fosslight_source/_scan_item.py
SourceItem gains kb_origin_url state and _apply_kb_origin_url(); imports hashlib; set_oss_item() signature now accepts `kb_origin_urls: dict[str,str]
CLI Hash Collection and Bulk Origin Lookup
src/fosslight_source/cli.py
Adds _collect_kb_file_hashes() to compute/cache MD5s for scancode items and extra files, imports the KB client, calls fetch_origin_urls_via_scan_job() once when KB is enabled, and applies the returned mapping to items and extra candidates, replacing the prior per-file KB enrichment loop.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI:merge_results
  participant Collector as _collect_kb_file_hashes
  participant KBClient as fetch_origin_urls_via_scan_job
  participant KBServer as KB Scan Job API
  participant Item as SourceItem.set_oss_item

  CLI->>Collector: walk files, compute MD5 hashes
  Collector-->>CLI: hashes + extra candidates
  CLI->>KBClient: submit hashes for bulk lookup
  KBClient->>KBServer: POST /scan/jobs (hash list)
  KBServer-->>KBClient: returns job_id
  KBClient->>KBServer: poll GET /scan/jobs/{job_id}
  KBServer-->>KBClient: job status/results (rows)
  KBClient-->>CLI: file_hash -> origin_url mapping
  CLI->>Item: set_oss_item(kb_origin_urls=mapping)
  Item->>Item: _apply_kb_origin_url for each matched hash
  Item-->>CLI: enriched OssItem instances
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

"A rabbit in the scanner hops with cheer,
Gathering hashes, one by one, my dear.
I post a job and patiently wait,
Polling the KB for each origin's fate.
When results arrive I nibble and apply — hooray!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main architectural change: shifting KB lookup from direct queries to asynchronous scan jobs with polling, which is reflected across all three modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch prj

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/fosslight_source/_kb_client.py (1)

73-75: 💤 Low value

Consider catching more specific exceptions instead of bare Exception.

The (json.JSONDecodeError, Exception) catch effectively catches everything. This is intentional for network resilience, but could mask unexpected programming errors. Consider catching OSError (covers socket/connection issues) alongside json.JSONDecodeError for clearer intent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 73 - 75, Replace the
overly-broad except (json.JSONDecodeError, Exception) with a narrower exception
tuple to avoid masking programming errors: catch json.JSONDecodeError and
OSError (or OSError plus any specific network library exceptions you use)
instead, i.e. change the except to except (json.JSONDecodeError, OSError) as e:,
keep the existing logger.warning(f"KB scan job create failed: {e}") and return
{} so only JSON and OS/network issues are swallowed while other exceptions
propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 440-446: The scancode_paths set is unused dead code: remove the
creation and any updates to scancode_paths in the block that checks run_kb
(i.e., delete the line creating scancode_paths and the scancode_paths.add(...)
call) and keep the existing logic that iterates extra_candidates and appends
extra_item to scancode_result when extra_item.download_location is present; note
that duplicate filtering is already handled by _collect_kb_file_hashes, so no
replacement logic is necessary.

---

Nitpick comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 73-75: Replace the overly-broad except (json.JSONDecodeError,
Exception) with a narrower exception tuple to avoid masking programming errors:
catch json.JSONDecodeError and OSError (or OSError plus any specific network
library exceptions you use) instead, i.e. change the except to except
(json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB
scan job create failed: {e}") and return {} so only JSON and OS/network issues
are swallowed while other exceptions propagate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9e026aa-46e8-4156-9780-2acc625081c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62f153e and 255fa59.

📒 Files selected for processing (3)
  • src/fosslight_source/_kb_client.py
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py

Comment thread src/fosslight_source/cli.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 82-92: The code calls _estimate_job_wait_timeout(int(accepted))
but accepted can be None or non-numeric; validate and coerce accepted before
using it. Replace direct int(accepted) with a small coercion block: read
accepted = created.get("accepted", created.get("total", len(unique_hashes))); if
accepted is not an int, try to convert with int(...) inside a try/except
catching ValueError/TypeError and fall back to created.get("total",
len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when
calling _estimate_job_wait_timeout, and optionally log a warning if coercion
failed. Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
- Around line 24-43: The _kb_request function currently builds a URL from
caller-controlled kb_url and calls urllib.request.urlopen without validating the
scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in
("http","https") before creating the Request or calling urlopen, raising a
ValueError (or returning an error) for non-HTTP(S) schemes. Also update
fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by
validating accepted is not None and numeric (wrap conversion in try/except
catching TypeError/ValueError) and handle invalid values gracefully
(log/raise/return early) instead of allowing an uncaught exception to abort
polling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fdd9a941-d65c-4c99-8eea-2dbfed3537b6

📥 Commits

Reviewing files that changed from the base of the PR and between 255fa59 and 8af5be9.

📒 Files selected for processing (3)
  • src/fosslight_source/_kb_client.py
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/fosslight_source/cli.py
  • src/fosslight_source/_scan_item.py

Comment thread src/fosslight_source/_kb_client.py
Comment thread src/fosslight_source/_kb_client.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/fosslight_source/_scan_item.py (1)

181-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply KB metadata before the download_location branch.

Because Line 181 short-circuits on existing self.download_location, the KB lookup on Lines 189-197 never runs for those items. That leaves kb_origin_url and kb_evidence empty, so get_kb_reference_to_print() in src/fosslight_source/cli.py, Lines 304-323 drops them from the KB reference output.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_scan_item.py` around lines 181 - 197, The KB-origin
lookup must run before short-circuiting on self.download_location so KB metadata
(kb_origin_url/kb_evidence) is applied to all OssItem instances; move the
md5/hash and kb_origin_urls lookup (the logic that uses self._cached_kb_md5,
self._get_hash(path_to_scan), kb_origin_urls.get(...), and
self._apply_kb_origin_url(...)) to execute prior to the branch that iterates
self.download_location, then use the resulting oss_name/oss_version/download_url
when constructing OssItem inside both the download_location loop and the
fallback branch (while preserving the existing behavior for multiple download
URLs and respecting self.is_license_text).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 102-105: The polling logic should return immediately when the
server accepted zero hashes to avoid a no-op wait; after computing/issuing the
existing warning for skipped (variable skipped) and before computing deadline or
entering the poll loop (before deadline = time.monotonic() +
_estimate_job_wait_timeout(accepted)), add an early return when accepted == 0
(e.g., return an empty dict) so the function exits instead of waiting the
minimum timeout; locate this change in the same function that uses variables
skipped and accepted in _kb_client.py.

In `@src/fosslight_source/_scan_item.py`:
- Around line 189-193: The code currently checks "if kb_origin_urls is not None
and not self.is_license_text" which still runs _get_hash when kb_origin_urls is
an empty dict; change the condition to only run the KB-related hashing when the
KB map is non-empty (e.g. use "if kb_origin_urls and not self.is_license_text"
or bool(kb_origin_urls)) so that _cached_kb_md5/_get_hash(path_to_scan) are
skipped when kb_origin_urls is {} and avoid unnecessary file reads; update the
check around _cached_kb_md5 and _get_hash accordingly.

---

Outside diff comments:
In `@src/fosslight_source/_scan_item.py`:
- Around line 181-197: The KB-origin lookup must run before short-circuiting on
self.download_location so KB metadata (kb_origin_url/kb_evidence) is applied to
all OssItem instances; move the md5/hash and kb_origin_urls lookup (the logic
that uses self._cached_kb_md5, self._get_hash(path_to_scan),
kb_origin_urls.get(...), and self._apply_kb_origin_url(...)) to execute prior to
the branch that iterates self.download_location, then use the resulting
oss_name/oss_version/download_url when constructing OssItem inside both the
download_location loop and the fallback branch (while preserving the existing
behavior for multiple download URLs and respecting self.is_license_text).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2686cb27-b7fd-4076-8862-ce68167de03c

📥 Commits

Reviewing files that changed from the base of the PR and between 8af5be9 and 25be4a1.

📒 Files selected for processing (3)
  • src/fosslight_source/_kb_client.py
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/fosslight_source/cli.py

Comment on lines +102 to +105
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")

deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return early when the server accepts zero hashes.

When Line 103 reports skipped hashes and Line 105 sees accepted == 0, there is nothing left to fetch, but this still enters the poll loop with the 300-second minimum wait. Returning {} immediately avoids a no-op wait on fully skipped/rate-limited jobs.

Proposed fix
     if skipped:
         logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
+    if accepted == 0:
+        return {}

     deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)
if skipped:
logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
if accepted == 0:
return {}
deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 102 - 105, The polling logic
should return immediately when the server accepted zero hashes to avoid a no-op
wait; after computing/issuing the existing warning for skipped (variable
skipped) and before computing deadline or entering the poll loop (before
deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)), add an
early return when accepted == 0 (e.g., return an empty dict) so the function
exits instead of waiting the minimum timeout; locate this change in the same
function that uses variables skipped and accepted in _kb_client.py.

Comment on lines +189 to 193
if kb_origin_urls is not None and not self.is_license_text:
md5_hash = getattr(self, "_cached_kb_md5", "")
if not md5_hash:
md5_hash, _wfp = self._get_hash(path_to_scan)
if md5_hash:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't hash files when the KB map is empty.

Line 189 checks is not None, but merge_results() in src/fosslight_source/cli.py, Lines 426-443 always passes a dict. That means {} still triggers _get_hash() for every non-license file even when KB is disabled or the bulk lookup returned no matches, adding an unnecessary file-read/WFP pass to normal scans.

Proposed fix
-            if kb_origin_urls is not None and not self.is_license_text:
+            if kb_origin_urls and not self.is_license_text:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if kb_origin_urls is not None and not self.is_license_text:
md5_hash = getattr(self, "_cached_kb_md5", "")
if not md5_hash:
md5_hash, _wfp = self._get_hash(path_to_scan)
if md5_hash:
if kb_origin_urls and not self.is_license_text:
md5_hash = getattr(self, "_cached_kb_md5", "")
if not md5_hash:
md5_hash, _wfp = self._get_hash(path_to_scan)
if md5_hash:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_scan_item.py` around lines 189 - 193, The code
currently checks "if kb_origin_urls is not None and not self.is_license_text"
which still runs _get_hash when kb_origin_urls is an empty dict; change the
condition to only run the KB-related hashing when the KB map is non-empty (e.g.
use "if kb_origin_urls and not self.is_license_text" or bool(kb_origin_urls)) so
that _cached_kb_md5/_get_hash(path_to_scan) are skipped when kb_origin_urls is
{} and avoid unnecessary file reads; update the check around _cached_kb_md5 and
_get_hash accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement [PR/Issue] New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant