Improve scrapers content extraction#19
Conversation
Updated the HTML scraper to remove 'nav' tags in addition to 'header' and 'footer'.
|
|
||
| FIXTURES = Path(__file__).parent / "fixtures" | ||
|
|
||
| # run with python -m tests.ingestion.scraper_offline |
There was a problem hiding this comment.
Pytest's test file discovery expects filenames to start with test, e.g. test_repository.py, which is why running make test currently skips this file. Test functions also don't need to be called at the end of the file
- Rename this file to
test_scraper.py - Remove the function calls on lines 59-60, the functions just needs to be defined
- In addition to the SCS page fixtures, add fixtures for 1-2 ccss resource pages too (e.g. one faq and one article) since they have a different structure
|
|
||
| assert title == "Courses and Registration (B.Cyber.) - School of Computer Science" | ||
| assert "Electives and Prohibited Courses" in text | ||
| assert "Skip to Main Content" not in text |
There was a problem hiding this comment.
This assertion could be a bit more broad (e.g. some pages say "Skip to Content" instead of "Skip to Main Content", so even if that line isn't skipped, the assertion passes)
A better approach may be having a list of "junk" phrases that should be asserted to be skipped, e.g. :
for junk in ["Skip to Content", "Skip to Main Content", "Search this website"... <any other commonly present phrases>]:
assert junk not in text
| html = BeautifulSoup(html, "html.parser") | ||
| # removing elements by class | ||
| for el in html.select( | ||
| ".footer, .global-nav, .navigation, " ".topbar, .content__meta, .visuallyhidden" |
There was a problem hiding this comment.
Extracting this and the tag list on line 18 to separate constant lists at the top can make it easier to maintain if the lists grow in the future, something like
_BOILERPLATE_TAGS = ["header", "footer", "nav", "script", "style", "noscript"]
_BOILERPLATE_SELECTORS = (
".navbar-container, .navbar, .footer, .global-nav, "
".navigation, .topbar, .content__meta, .visuallyhidden"
)
| return text, title | ||
| import httpx | ||
| from bs4 import BeautifulSoup | ||
|
|
There was a problem hiding this comment.
The rewrite dropped the logger and logs for scrape starting/completing. For observability, please re add the logger and appropriate logs throughout the scraper code
| else: | ||
| link.string = f"{link.string} [Link: {href}]" | ||
|
|
||
| text = html.get_text("\n", strip=True) |
There was a problem hiding this comment.
get_text("\n", strip=True) seems to put every text node (including tags like <a>, <b>, <i> etc on their own new line, which can lead to sentences getting fragmented across lines. This could interact badly with the chunker's \n splitter and affect answer formatting.
I tried this and it normalizes the text content well:
Add a _BLOCK_TAGS list constant to the top for tags that can define new text blocks but skips common inline tags like <strong>, <i> etc. , along with the boilerplate list constants:
_BLOCK_TAGS = ["p", "div", "li", "h1", "h2", "h3", "h4", "h5", "h6",
"tr", "section", "article", "ul", "ol", "table", "br", "blockquote"]
(there may be more tags worth including)
Then:
for tag in root.find_all(_BLOCK_TAGS):
tag.append("\n") # append a newline as each block's last child
text = content_root.get_text(" ", strip=False) # flatten: space within blocks, \n between blocks
text = re.sub(r"[ \t]+", " ", text) # collapse runs of spaces/tabs (keep \n)
text = re.sub(r" *\n *", "\n", text) # strip spaces around each newline
text = re.sub(r"\n{3,}", "\n\n", text) # limit to 2 \n
text = text.strip()
| if href[-3:] == "pdf": | ||
| link.string = f"{link.string} [PDF: {href}]" | ||
| else: | ||
| link.string = f"{link.string} [Link: {href}]" |
There was a problem hiding this comment.
href[-3:] == "pdf"can miss cases like ".PDF"link.stringseems to be None when an tag wraps nested html tags (e.g. FAQ page's dates and regulations link renders asNone [Link: <the actual link>]- Using urllib can be useful here
This snippet should help:
from urllib.parse import urlsplit # at the top with the rest of the imports
for link in root.find_all("a"):
href = (link.get("href") or "").strip()
if href and not href.startswith("#"):
text = link.get_text(" ", strip=True)
is_pdf = urlsplit(href).path.lower().endswith(".pdf")
marker = "PDF" if is_pdf else "Link"
link.replace_with(f"{text} [{marker}: {href}]")
| cut_marker = "<!-- close main-wrapper" | ||
| if cut_marker in html: | ||
| html = html.split(cut_marker)[0] |
There was a problem hiding this comment.
The cut_marker approach may not always work (e.g. some SCS pages or CCSS pages don't have the <!-- close main-wrapper string.
From looking at a few pages, carleton webpages seem to always have a <main>...</main> block or div with id main-content with the actual page content. Scoping extraction to those for most ...carleton.ca/... links would be more robust.
CCSS resource pages however don't have a <main> block or id, so we should have fallback behaviour for those and any other pages without one.
This snippet should help:
soup = BeautifulSoup(html, "html.parser")
root = soup.select_one("main, [role=main], #main-content")
if root is None: # CCSS-style pages have no <main>
for t in soup(_BOILERPLATE_TAGS):
t.decompose()
for el in soup.select(_BOILERPLATE_SELECTORS):
el.decompose()
root = soup.body or soup
This would let you drop the cut_marker string. Out of curiosity though did you try and find issues with scoping extraction to the <main> block? Just want to be sure I'm not missing any reasoning there
Implemented scraping with bs4 instead of trifilatura.
Reduced unnecessary elements being scraped, and fixed necessary elements being ignored / improperly parsed such as links & accordions.
Added offline test files in
tests/ingestion/fixturesand a script to run assertions attests/ingestion/offline_scraper.pyUpdated dependency list to include bs4 via uv add beautifulsoup4
Linked to issue #6