fix(fetcher): unescape HTML entities in extracted href links by Shyam-723 · Pull Request #171 · GoogleCloudPlatform/knowledge-catalog

Shyam-723 · 2026-07-02T16:01:35Z

Fixes #170

Fix:
Add html.unescape() on each captured href before it is resolved and stored. This is a stdlib.

One new test covering the &amp → & conversion has been added as well.

Files changed:

okf/src/reference_agent/web/fetcher.py — import and apply html.unescape on extracted hrefs
okf/tests/test_web_fetcher.py — test_fetch_and_parse_unescapes_html_entities_in_links

HTML attributes encode & as &, so a link like href="https://example.com/search?q=foo&lang=en" was being yielded verbatim — passing the literal & to servers that expect a plain &. Add html.unescape() on each extracted href before resolving it, and cover the case with a new test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(fetcher): unescape HTML entities in extracted href links#171

fix(fetcher): unescape HTML entities in extracted href links#171
Shyam-723 wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
Shyam-723:main

Shyam-723 commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Shyam-723 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Shyam-723 commented Jul 2, 2026 •

edited

Loading