Switch KB lookup from query to scan jobs polling by soimkim · Pull Request #274 · fosslight/fosslight_source_scanner

soimkim · 2026-05-28T03:07:08Z

Description

Summary by CodeRabbit

New Features
- Added batch KB origin-URL retrieval via scan jobs with polling, timeout and partial-result handling; returns a mapping of file-hash to origin URL for bulk enrichment.
Refactor
- Replaced per-file KB lookups with a bulk enrichment flow that caches file hashes, applies returned origin URLs to items, and more efficiently processes discovered extra files.

coderabbitai · 2026-05-28T03:09:34Z

📝 Walkthrough

Walkthrough

Refactors KB origin enrichment to a bulk scan-job flow: adds a KB client that submits file hashes and polls for results, updates SourceItem to accept a kb_origin_urls mapping and apply origin info, and modifies CLI to collect hashes and fetch/apply origins in bulk.

Changes

KB Scan Job Integration

Layer / File(s)	Summary
KB Scan Job API Client `src/fosslight_source/_kb_client.py`	New HTTP client module with `_kb_request` helper, polling/timeout constants, and `fetch_origin_urls_via_scan_job` that creates a remote scan job, polls for completion, and returns a file-hash→origin-url mapping.
SourceItem KB Origin URL Application `src/fosslight_source/_scan_item.py`	`SourceItem` gains `kb_origin_url` state and `_apply_kb_origin_url()`; imports `hashlib`; `set_oss_item()` signature now accepts `kb_origin_urls: dict[str,str]
CLI Hash Collection and Bulk Origin Lookup `src/fosslight_source/cli.py`	Adds `_collect_kb_file_hashes()` to compute/cache MD5s for scancode items and extra files, imports the KB client, calls `fetch_origin_urls_via_scan_job()` once when KB is enabled, and applies the returned mapping to items and extra candidates, replacing the prior per-file KB enrichment loop.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI:merge_results
  participant Collector as _collect_kb_file_hashes
  participant KBClient as fetch_origin_urls_via_scan_job
  participant KBServer as KB Scan Job API
  participant Item as SourceItem.set_oss_item

  CLI->>Collector: walk files, compute MD5 hashes
  Collector-->>CLI: hashes + extra candidates
  CLI->>KBClient: submit hashes for bulk lookup
  KBClient->>KBServer: POST /scan/jobs (hash list)
  KBServer-->>KBClient: returns job_id
  KBClient->>KBServer: poll GET /scan/jobs/{job_id}
  KBServer-->>KBClient: job status/results (rows)
  KBClient-->>CLI: file_hash -> origin_url mapping
  CLI->>Item: set_oss_item(kb_origin_urls=mapping)
  Item->>Item: _apply_kb_origin_url for each matched hash
  Item-->>CLI: enriched OssItem instances

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fosslight/fosslight_source_scanner#254: Modifies CLI KB execution control that affects when the new bulk KB merge path runs.
fosslight/fosslight_source_scanner#273: Earlier change introducing per-file KB lookup plumbing that this PR replaces with a bulk scan-job client.
fosslight/fosslight_source_scanner#257: Changes merge_results KB-enrichment flow and file-exclusion logic related to CLI KB processing.

Poem

"A rabbit in the scanner hops with cheer,
Gathering hashes, one by one, my dear.
I post a job and patiently wait,
Polling the KB for each origin's fate.
When results arrive I nibble and apply — hooray!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main architectural change: shifting KB lookup from direct queries to asynchronous scan jobs with polling, which is reflected across all three modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch prj

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/fosslight_source/_kb_client.py (1)
73-75: 💤 Low value

Consider catching more specific exceptions instead of bare Exception.

The (json.JSONDecodeError, Exception) catch effectively catches everything. This is intentional for network resilience, but could mask unexpected programming errors. Consider catching OSError (covers socket/connection issues) alongside json.JSONDecodeError for clearer intent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_kb_client.py` around lines 73 - 75, Replace the
overly-broad except (json.JSONDecodeError, Exception) with a narrower exception
tuple to avoid masking programming errors: catch json.JSONDecodeError and
OSError (or OSError plus any specific network library exceptions you use)
instead, i.e. change the except to except (json.JSONDecodeError, OSError) as e:,
keep the existing logger.warning(f"KB scan job create failed: {e}") and return
{} so only JSON and OS/network issues are swallowed while other exceptions
propagate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 440-446: The scancode_paths set is unused dead code: remove the
creation and any updates to scancode_paths in the block that checks run_kb
(i.e., delete the line creating scancode_paths and the scancode_paths.add(...)
call) and keep the existing logic that iterates extra_candidates and appends
extra_item to scancode_result when extra_item.download_location is present; note
that duplicate filtering is already handled by _collect_kb_file_hashes, so no
replacement logic is necessary.

---

Nitpick comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 73-75: Replace the overly-broad except (json.JSONDecodeError,
Exception) with a narrower exception tuple to avoid masking programming errors:
catch json.JSONDecodeError and OSError (or OSError plus any specific network
library exceptions you use) instead, i.e. change the except to except
(json.JSONDecodeError, OSError) as e:, keep the existing logger.warning(f"KB
scan job create failed: {e}") and return {} so only JSON and OS/network issues
are swallowed while other exceptions propagate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9e026aa-46e8-4156-9780-2acc625081c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62f153e and 255fa59.

📒 Files selected for processing (3)

src/fosslight_source/_kb_client.py
src/fosslight_source/_scan_item.py
src/fosslight_source/cli.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 82-92: The code calls _estimate_job_wait_timeout(int(accepted))
but accepted can be None or non-numeric; validate and coerce accepted before
using it. Replace direct int(accepted) with a small coercion block: read
accepted = created.get("accepted", created.get("total", len(unique_hashes))); if
accepted is not an int, try to convert with int(...) inside a try/except
catching ValueError/TypeError and fall back to created.get("total",
len(unique_hashes)) or 0; use that safe integer (e.g., accepted_count) when
calling _estimate_job_wait_timeout, and optionally log a warning if coercion
failed. Ensure this logic is applied where accepted is used (the logger, skipped
handling, and deadline calculation) so _estimate_job_wait_timeout receives a
valid integer.
- Around line 24-43: The _kb_request function currently builds a URL from
caller-controlled kb_url and calls urllib.request.urlopen without validating the
scheme; parse kb_url (use urllib.parse.urlparse) and enforce scheme in
("http","https") before creating the Request or calling urlopen, raising a
ValueError (or returning an error) for non-HTTP(S) schemes. Also update
fetch_origin_urls_via_scan_job to guard the int(accepted) conversion by
validating accepted is not None and numeric (wrap conversion in try/except
catching TypeError/ValueError) and handle invalid values gracefully
(log/raise/return early) instead of allowing an uncaught exception to abort
polling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fdd9a941-d65c-4c99-8eea-2dbfed3537b6

📥 Commits

Reviewing files that changed from the base of the PR and between 255fa59 and 8af5be9.

📒 Files selected for processing (3)

src/fosslight_source/_kb_client.py
src/fosslight_source/_scan_item.py
src/fosslight_source/cli.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/fosslight_source/cli.py
src/fosslight_source/_scan_item.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/fosslight_source/_scan_item.py (1)
181-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply KB metadata before the download_location branch.

Because Line 181 short-circuits on existing self.download_location, the KB lookup on Lines 189-197 never runs for those items. That leaves kb_origin_url and kb_evidence empty, so get_kb_reference_to_print() in src/fosslight_source/cli.py, Lines 304-323 drops them from the KB reference output.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fosslight_source/_scan_item.py` around lines 181 - 197, The KB-origin
lookup must run before short-circuiting on self.download_location so KB metadata
(kb_origin_url/kb_evidence) is applied to all OssItem instances; move the
md5/hash and kb_origin_urls lookup (the logic that uses self._cached_kb_md5,
self._get_hash(path_to_scan), kb_origin_urls.get(...), and
self._apply_kb_origin_url(...)) to execute prior to the branch that iterates
self.download_location, then use the resulting oss_name/oss_version/download_url
when constructing OssItem inside both the download_location loop and the
fallback branch (while preserving the existing behavior for multiple download
URLs and respecting self.is_license_text).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/fosslight_source/_kb_client.py`:
- Around line 102-105: The polling logic should return immediately when the
server accepted zero hashes to avoid a no-op wait; after computing/issuing the
existing warning for skipped (variable skipped) and before computing deadline or
entering the poll loop (before deadline = time.monotonic() +
_estimate_job_wait_timeout(accepted)), add an early return when accepted == 0
(e.g., return an empty dict) so the function exits instead of waiting the
minimum timeout; locate this change in the same function that uses variables
skipped and accepted in _kb_client.py.

In `@src/fosslight_source/_scan_item.py`:
- Around line 189-193: The code currently checks "if kb_origin_urls is not None
and not self.is_license_text" which still runs _get_hash when kb_origin_urls is
an empty dict; change the condition to only run the KB-related hashing when the
KB map is non-empty (e.g. use "if kb_origin_urls and not self.is_license_text"
or bool(kb_origin_urls)) so that _cached_kb_md5/_get_hash(path_to_scan) are
skipped when kb_origin_urls is {} and avoid unnecessary file reads; update the
check around _cached_kb_md5 and _get_hash accordingly.

---

Outside diff comments:
In `@src/fosslight_source/_scan_item.py`:
- Around line 181-197: The KB-origin lookup must run before short-circuiting on
self.download_location so KB metadata (kb_origin_url/kb_evidence) is applied to
all OssItem instances; move the md5/hash and kb_origin_urls lookup (the logic
that uses self._cached_kb_md5, self._get_hash(path_to_scan),
kb_origin_urls.get(...), and self._apply_kb_origin_url(...)) to execute prior to
the branch that iterates self.download_location, then use the resulting
oss_name/oss_version/download_url when constructing OssItem inside both the
download_location loop and the fallback branch (while preserving the existing
behavior for multiple download URLs and respecting self.is_license_text).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2686cb27-b7fd-4076-8862-ce68167de03c

📥 Commits

Reviewing files that changed from the base of the PR and between 8af5be9 and 25be4a1.

📒 Files selected for processing (3)

src/fosslight_source/_kb_client.py
src/fosslight_source/_scan_item.py
src/fosslight_source/cli.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/fosslight_source/cli.py

coderabbitai · 2026-06-03T23:28:12Z

+    if skipped:
+        logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")
+
+    deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return early when the server accepts zero hashes.

When Line 103 reports skipped hashes and Line 105 sees accepted == 0, there is nothing left to fetch, but this still enters the poll loop with the 300-second minimum wait. Returning {} immediately avoids a no-op wait on fully skipped/rate-limited jobs.

Proposed fix

if skipped: logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server") + if accepted == 0: + return {} deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if skipped:

logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")

deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)

if skipped:

logger.warning(f"KB scan job rate-limited: {skipped} file_hash(es) skipped by server")

if accepted == 0:

return {}

deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_kb_client.py` around lines 102 - 105, The polling logic should return immediately when the server accepted zero hashes to avoid a no-op wait; after computing/issuing the existing warning for skipped (variable skipped) and before computing deadline or entering the poll loop (before deadline = time.monotonic() + _estimate_job_wait_timeout(accepted)), add an early return when accepted == 0 (e.g., return an empty dict) so the function exits instead of waiting the minimum timeout; locate this change in the same function that uses variables skipped and accepted in _kb_client.py.

coderabbitai · 2026-06-03T23:28:12Z

+            if kb_origin_urls is not None and not self.is_license_text:
+                md5_hash = getattr(self, "_cached_kb_md5", "")
+                if not md5_hash:
+                    md5_hash, _wfp = self._get_hash(path_to_scan)
                if md5_hash:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't hash files when the KB map is empty.

Line 189 checks is not None, but merge_results() in src/fosslight_source/cli.py, Lines 426-443 always passes a dict. That means {} still triggers _get_hash() for every non-license file even when KB is disabled or the bulk lookup returned no matches, adding an unnecessary file-read/WFP pass to normal scans.

Proposed fix

- if kb_origin_urls is not None and not self.is_license_text: + if kb_origin_urls and not self.is_license_text:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if kb_origin_urls is not None and not self.is_license_text:

md5_hash = getattr(self, "_cached_kb_md5", "")

if not md5_hash:

md5_hash, _wfp = self._get_hash(path_to_scan)

if md5_hash:

if kb_origin_urls and not self.is_license_text:

md5_hash = getattr(self, "_cached_kb_md5", "")

if not md5_hash:

md5_hash, _wfp = self._get_hash(path_to_scan)

if md5_hash:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fosslight_source/_scan_item.py` around lines 189 - 193, The code currently checks "if kb_origin_urls is not None and not self.is_license_text" which still runs _get_hash when kb_origin_urls is an empty dict; change the condition to only run the KB-related hashing when the KB map is non-empty (e.g. use "if kb_origin_urls and not self.is_license_text" or bool(kb_origin_urls)) so that _cached_kb_md5/_get_hash(path_to_scan) are skipped when kb_origin_urls is {} and avoid unnecessary file reads; update the check around _cached_kb_md5 and _get_hash accordingly.

soimkim self-assigned this May 28, 2026

soimkim added the enhancement [PR/Issue] New feature or request label May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/fosslight_source/cli.py Outdated

soimkim force-pushed the prj branch from 255fa59 to 8af5be9 Compare May 28, 2026 04:28

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/fosslight_source/_kb_client.py

Comment thread src/fosslight_source/_kb_client.py Outdated

feat: switch KB lookup from query to scan jobs polling

25be4a1

soimkim force-pushed the prj branch from 8af5be9 to 25be4a1 Compare June 3, 2026 23:17

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch KB lookup from query to scan jobs polling#274

Switch KB lookup from query to scan jobs polling#274
soimkim wants to merge 1 commit into
mainfrom
prj

soimkim commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

soimkim commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

soimkim commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading