Skip to content

bug: page-priority selection in extract_key_sections is silently ignored #63

@SeanClay10

Description

@SeanClay10

Description

In src/llm/llm_text.py (lines 290–305), a page-priority selection algorithm classifies pages, sorts them by priority, computes a ranked selected list, and tracks a budget then immediately discards all of it by pivoting to a completely different section-based algorithm on line 306. None of the computed results are ever used.

Problematic Code

pages = split_into_pages(text)          # ← dead
scored: List[Tuple[int, int, str]] = []
for page_num, page_text in pages:
    useful, priority = classify_page(page_text)  # ← dead
    if useful:
        scored.append((priority, page_num, page_text))

scored.sort(key=lambda t: t[0])         # ← dead

selected: List[Tuple[int, str]] = []
budget = max_chars
for _priority, page_num, page_text in scored:
    page_with_marker = f"[PAGE {page_num}]\n{page_text}"
    if len(page_with_marker) <= budget:
        selected.append((page_num, page_with_marker))
        budget -= len(page_with_marker)
lines = text.split("\n")   # ← pivots to full original text; everything above is discarded

budget is then re-initialized at line 345:

budget = max_chars - len(preamble_text)

What Goes Wrong

  • split_into_pages, classify_page, scored.sort, selected, and the first budget are all computed but never referenced after line 305.
  • The page-priority ranking has no effect on the final output.
  • The LLM receives sections chosen by paragraph-level keyword scoring alone, which is the correct behavior for this project — but the dead block above runs on every call for nothing.
  • No error is raised, the pipeline runs normally but wastes computation on the dead block.

Fix

Remove lines 290–305 entirely. The section-based keyword paragraph approach (Phase 1 + Phase 2 below line 306) is the correct algorithm for this project: it scores individual paragraphs by extraction-relevant keywords, guaranteeing that data like stomach counts and sample sizes are included regardless of which page or section they appear in. The page-priority approach is coarser (whole-page granularity) and would waste the character budget on surrounding irrelevant content.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions