Skip to content

bug: source page attribution uses naive string search, causing false-positive page matches #62

@SeanClay10

Description

@SeanClay10

Description

In src/llm/llm_client.py (lines 256–263), extracted numeric field values are stringified and searched in the full document text using .find(), which returns the first occurrence — not necessarily the one corresponding to the actual data field.

Problematic Code

value_str = str(value)
if value_str in original_text:
    pos = original_text.find(value_str)
    page_markers = re.findall(r'\[PAGE (\d+)\]', original_text[:pos])
    if page_markers:
        source_pages.add(int(page_markers[-1]))

What Goes Wrong

For numeric fields like sample_size = 5, num_empty_stomachs = 3, etc., the string "5" is searched across the entire document. If "5" appears first in e.g. "Figure 5" or a citation year, .find() returns that position — and the source page is attributed to the wrong location in the paper. No error is raised; the result JSON silently contains incorrect source_pages provenance data.

Expected Behavior

Source page attribution should only match the field value in a meaningful context (e.g. surrounded by word boundaries, or within a relevant section), not the first raw string occurrence in the document.

Suggested Fix

Use word-boundary regex matching (e.g. r'\b5\b') instead of plain string find(), and/or restrict the search to already-identified relevant sections rather than the full document text.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions