Improve scrapers content extraction by mikesiez · Pull Request #19 · CarletonComputerScienceSociety/cs-assistant

mikesiez · 2026-06-20T15:26:52Z

Implemented scraping with bs4 instead of trifilatura.
Reduced unnecessary elements being scraped, and fixed necessary elements being ignored / improperly parsed such as links & accordions.
Added offline test files in tests/ingestion/fixtures and a script to run assertions at tests/ingestion/offline_scraper.py
Updated dependency list to include bs4 via uv add beautifulsoup4

Linked to issue #6

Updated the HTML scraper to remove 'nav' tags in addition to 'header' and 'footer'.

AJaccP · 2026-06-30T04:23:53Z

+
+FIXTURES = Path(__file__).parent / "fixtures"
+
+# run with python -m tests.ingestion.scraper_offline


Pytest's test file discovery expects filenames to start with test, e.g. test_repository.py, which is why running make test currently skips this file. Test functions also don't need to be called at the end of the file

Rename this file to test_scraper.py

Remove the function calls on lines 59-60, the functions just needs to be defined

In addition to the SCS page fixtures, add fixtures for 1-2 ccss resource pages too (e.g. one faq and one article) since they have a different structure

AJaccP · 2026-06-30T04:26:34Z

+
+    assert title == "Courses and Registration (B.Cyber.) - School of Computer Science"
+    assert "Electives and Prohibited Courses" in text
+    assert "Skip to Main Content" not in text


This assertion could be a bit more broad (e.g. some pages say "Skip to Content" instead of "Skip to Main Content", so even if that line isn't skipped, the assertion passes)

A better approach may be having a list of "junk" phrases that should be asserted to be skipped, e.g. :

for junk in ["Skip to Content", "Skip to Main Content", "Search this website"... <any other commonly present phrases>]: assert junk not in text

AJaccP · 2026-06-30T04:37:15Z

+    html = BeautifulSoup(html, "html.parser")
+    # removing elements by class
+    for el in html.select(
+        ".footer, .global-nav, .navigation, " ".topbar, .content__meta, .visuallyhidden"


Extracting this and the tag list on line 18 to separate constant lists at the top can make it easier to maintain if the lists grow in the future, something like

_BOILERPLATE_TAGS = ["header", "footer", "nav", "script", "style", "noscript"] _BOILERPLATE_SELECTORS = ( ".navbar-container, .navbar, .footer, .global-nav, " ".navigation, .topbar, .content__meta, .visuallyhidden" )

AJaccP · 2026-06-30T04:38:15Z

-    return text, title
+import httpx
+from bs4 import BeautifulSoup
+


The rewrite dropped the logger and logs for scrape starting/completing. For observability, please re add the logger and appropriate logs throughout the scraper code

AJaccP · 2026-06-30T05:12:30Z

+            else:
+                link.string = f"{link.string} [Link: {href}]"
+
+    text = html.get_text("\n", strip=True)


get_text("\n", strip=True) seems to put every text node (including tags like <a>, <b>, <i> etc on their own new line, which can lead to sentences getting fragmented across lines. This could interact badly with the chunker's \n splitter and affect answer formatting.

I tried this and it normalizes the text content well:

Add a _BLOCK_TAGS list constant to the top for tags that can define new text blocks but skips common inline tags like <strong>, <i> etc. , along with the boilerplate list constants:

_BLOCK_TAGS = ["p", "div", "li", "h1", "h2", "h3", "h4", "h5", "h6", "tr", "section", "article", "ul", "ol", "table", "br", "blockquote"]

(there may be more tags worth including)

Then:

for tag in root.find_all(_BLOCK_TAGS): tag.append("\n") # append a newline as each block's last child text = content_root.get_text(" ", strip=False) # flatten: space within blocks, \n between blocks text = re.sub(r"[ \t]+", " ", text) # collapse runs of spaces/tabs (keep \n) text = re.sub(r" *\n *", "\n", text) # strip spaces around each newline text = re.sub(r"\n{3,}", "\n\n", text) # limit to 2 \n text = text.strip()

AJaccP · 2026-06-30T05:23:11Z

+            if href[-3:] == "pdf":
+                link.string = f"{link.string} [PDF: {href}]"
+            else:
+                link.string = f"{link.string} [Link: {href}]"


href[-3:] == "pdf" can miss cases like ".PDF"

link.string seems to be None when an tag wraps nested html tags (e.g. FAQ page's dates and regulations link renders as None [Link: <the actual link>]

Using urllib can be useful here

This snippet should help:

from urllib.parse import urlsplit # at the top with the rest of the imports for link in root.find_all("a"): href = (link.get("href") or "").strip() if href and not href.startswith("#"): text = link.get_text(" ", strip=True) is_pdf = urlsplit(href).path.lower().endswith(".pdf") marker = "PDF" if is_pdf else "Link" link.replace_with(f"{text} [{marker}: {href}]")

AJaccP · 2026-06-30T05:28:19Z

+    cut_marker = "<!-- close main-wrapper"
+    if cut_marker in html:
+        html = html.split(cut_marker)[0]


The cut_marker approach may not always work (e.g. some SCS pages or CCSS pages don't have the <!-- close main-wrapper string.

From looking at a few pages, carleton webpages seem to always have a <main>...</main> block or div with id main-content with the actual page content. Scoping extraction to those for most ...carleton.ca/... links would be more robust.

CCSS resource pages however don't have a <main> block or id, so we should have fallback behaviour for those and any other pages without one.

This snippet should help:

soup = BeautifulSoup(html, "html.parser") root = soup.select_one("main, [role=main], #main-content") if root is None: # CCSS-style pages have no <main> for t in soup(_BOILERPLATE_TAGS): t.decompose() for el in soup.select(_BOILERPLATE_SELECTORS): el.decompose() root = soup.body or soup

This would let you drop the cut_marker string. Out of curiosity though did you try and find issues with scoping extraction to the <main> block? Just want to be sure I'm not missing any reasoning there

mikesiez and others added 3 commits June 14, 2026 20:12

Completed scraping, meets all requirements

8ebe583

better nav identification to be removed

b61e058

Updated the HTML scraper to remove 'nav' tags in addition to 'header' and 'footer'.

Added bs4, fixed scraper, added offline testing

69b7f5b

mikesiez linked an issue Jun 20, 2026 that may be closed by this pull request

Improve scraper's content extraction #6

Open

michael added 3 commits June 20, 2026 11:28

linter fixes

b6feb41

more linter fixes:

89af866

linter fixes fixes

6394893

mikesiez requested a review from AJaccP June 20, 2026 15:37

AJaccP requested changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve scrapers content extraction#19

Improve scrapers content extraction#19
mikesiez wants to merge 6 commits into
mainfrom
improve-scrapers-content-extraction

mikesiez commented Jun 20, 2026

Uh oh!

Uh oh!

AJaccP Jun 30, 2026

Uh oh!

AJaccP Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

AJaccP Jun 30, 2026

Uh oh!

AJaccP Jun 30, 2026

Uh oh!

AJaccP Jun 30, 2026

Uh oh!

AJaccP Jun 30, 2026

Uh oh!

AJaccP Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		FIXTURES = Path(__file__).parent / "fixtures"

		# run with python -m tests.ingestion.scraper_offline

Uh oh!

Conversation

mikesiez commented Jun 20, 2026

Uh oh!

Uh oh!

AJaccP Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

AJaccP Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AJaccP Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

AJaccP Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

AJaccP Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

AJaccP Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

AJaccP Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AJaccP Jun 30, 2026 •

edited

Loading