Skip to content

fix(fetcher): unescape HTML entities in extracted href links#171

Open
Shyam-723 wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
Shyam-723:main
Open

fix(fetcher): unescape HTML entities in extracted href links#171
Shyam-723 wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
Shyam-723:main

Conversation

@Shyam-723

@Shyam-723 Shyam-723 commented Jul 2, 2026

Copy link
Copy Markdown

Fixes #170

Fix:
Add html.unescape() on each captured href before it is resolved and stored. This is a stdlib.

One new test covering the &amp → & conversion has been added as well.

Files changed:

  • okf/src/reference_agent/web/fetcher.py — import and apply html.unescape on extracted hrefs
  • okf/tests/test_web_fetcher.py — test_fetch_and_parse_unescapes_html_entities_in_links

HTML attributes encode & as &, so a link like
href="https://example.com/search?q=foo&lang=en" was being
yielded verbatim — passing the literal & to servers that
expect a plain &. Add html.unescape() on each extracted href
before resolving it, and cover the case with a new test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec-conformant HTML pages with properly-encoded multi-parameter URLs hand the agent a broken URL

1 participant