Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ dependencies = [
"celery",
"redis",
"discord-py>=2.7.1",
"beautifulsoup4>=4.15.0",
]

[dependency-groups]
Expand Down
89 changes: 46 additions & 43 deletions src/ingestion/scrapers/html_scraper.py
Original file line number Diff line number Diff line change
@@ -1,43 +1,46 @@
import httpx
import trafilatura

from src.config.logger import get_logger

log = get_logger(__name__)


def scrape(url: str) -> tuple[str, str | None]:
"""Fetch a URL and extract clean text + title.

Returns (text, title). Title may be None if not found. Text may be empty
if trafilatura couldn't extract anything; the caller should check and skip.
"""
log.info("scrape_started", url=url)
with httpx.Client(timeout=30, follow_redirects=True) as client:
response = client.get(url)
response.raise_for_status()

# Known limitations (tracked as a separate ticket):
# - Misses content inside aria-hidden="true" accordions (common on FAQ pages)
# - Misses question text but gets answer text on other FAQ pages, we should get both
# - Link formatting is markdown-style and may need cleanup downstream
# Improving extraction quality is a tuning ticket, not a blocker for development.
text = (
trafilatura.extract(
response.text,
favor_recall=True,
include_links=True,
include_tables=True,
include_formatting=False,
include_comments=False,
deduplicate=True,
output_format="txt",
)
or ""
)

metadata = trafilatura.extract_metadata(response.text)
title = metadata.title if metadata else None

log.info("scrape_complete", url=url, chars=len(text), title=title)
return text, title
import httpx
from bs4 import BeautifulSoup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite dropped the logger and logs for scrape starting/completing. For observability, please re add the logger and appropriate logs throughout the scraper code


def _extract_from_html(html: str) -> tuple[str, str | None]:
cut_marker = "<!-- close main-wrapper"
if cut_marker in html:
html = html.split(cut_marker)[0]
Comment on lines +6 to +8

@AJaccP AJaccP Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cut_marker approach may not always work (e.g. some SCS pages or CCSS pages don't have the <!-- close main-wrapper string.

From looking at a few pages, carleton webpages seem to always have a <main>...</main> block or div with id main-content with the actual page content. Scoping extraction to those for most ...carleton.ca/... links would be more robust.

CCSS resource pages however don't have a <main> block or id, so we should have fallback behaviour for those and any other pages without one.

This snippet should help:

soup = BeautifulSoup(html, "html.parser")
root = soup.select_one("main, [role=main], #main-content")
if root is None:  # CCSS-style pages have no <main>
    for t in soup(_BOILERPLATE_TAGS):
        t.decompose()
    for el in soup.select(_BOILERPLATE_SELECTORS):
        el.decompose()
    root = soup.body or soup

This would let you drop the cut_marker string. Out of curiosity though did you try and find issues with scoping extraction to the <main> block? Just want to be sure I'm not missing any reasoning there


html = BeautifulSoup(html, "html.parser")
Comment thread
mikesiez marked this conversation as resolved.
# removing elements by class
for el in html.select(
".footer, .global-nav, .navigation, " ".topbar, .content__meta, .visuallyhidden"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting this and the tag list on line 18 to separate constant lists at the top can make it easier to maintain if the lists grow in the future, something like

_BOILERPLATE_TAGS = ["header", "footer", "nav", "script", "style", "noscript"]
_BOILERPLATE_SELECTORS = (
    ".navbar-container, .navbar, .footer, .global-nav, "
    ".navigation, .topbar, .content__meta, .visuallyhidden"
)

):
el.decompose()

# removing elements by tag
for tag in html(["header", "footer", "nav"]):
tag.decompose()

# formatting links after clearing having cleared clutter
for link in html.find_all("a"):
href = link.get("href")
if href and href[0] != "#":
if href[-3:] == "pdf":
link.string = f"{link.string} [PDF: {href}]"
else:
link.string = f"{link.string} [Link: {href}]"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • href[-3:] == "pdf" can miss cases like ".PDF"
  • link.string seems to be None when an tag wraps nested html tags (e.g. FAQ page's dates and regulations link renders as None [Link: <the actual link>]
  • Using urllib can be useful here

This snippet should help:

from urllib.parse import urlsplit  # at the top with the rest of the imports

for link in root.find_all("a"):
    href = (link.get("href") or "").strip()
    if href and not href.startswith("#"):
        text = link.get_text(" ", strip=True)
        is_pdf = urlsplit(href).path.lower().endswith(".pdf")
        marker = "PDF" if is_pdf else "Link"
        link.replace_with(f"{text} [{marker}: {href}]")


text = html.get_text("\n", strip=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_text("\n", strip=True) seems to put every text node (including tags like <a>, <b>, <i> etc on their own new line, which can lead to sentences getting fragmented across lines. This could interact badly with the chunker's \n splitter and affect answer formatting.

I tried this and it normalizes the text content well:

Add a _BLOCK_TAGS list constant to the top for tags that can define new text blocks but skips common inline tags like <strong>, <i> etc. , along with the boilerplate list constants:

_BLOCK_TAGS = ["p", "div", "li", "h1", "h2", "h3", "h4", "h5", "h6",
               "tr", "section", "article", "ul", "ol", "table", "br", "blockquote"]

(there may be more tags worth including)

Then:

for tag in root.find_all(_BLOCK_TAGS):                
        tag.append("\n")                                          # append a newline as each block's last child

    text = content_root.get_text(" ", strip=False)                # flatten: space within blocks, \n between blocks
    text = re.sub(r"[ \t]+", " ", text)                           # collapse runs of spaces/tabs (keep \n)
    text = re.sub(r" *\n *", "\n", text)                          # strip spaces around each newline
    text = re.sub(r"\n{3,}", "\n\n", text)                        # limit to 2 \n
    text = text.strip()


title = (html("title")[0]).get_text()
Comment thread
mikesiez marked this conversation as resolved.

return text, title


def scrape(url: str) -> tuple[str, str | None]:
"""Fetch a URL and extract clean text + title.

Returns (text, title). Title may be None if not found.
"""
with httpx.Client(timeout=30, follow_redirects=True) as client:
response = client.get(url)
response.raise_for_status()

return _extract_from_html(response.text)
789 changes: 789 additions & 0 deletions tests/ingestion/fixtures/courseTest.html

Large diffs are not rendered by default.

947 changes: 947 additions & 0 deletions tests/ingestion/fixtures/faqTest.html

Large diffs are not rendered by default.

60 changes: 60 additions & 0 deletions tests/ingestion/scraper_offline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from pathlib import Path

from src.ingestion.scrapers.html_scraper import _extract_from_html

FIXTURES = Path(__file__).parent / "fixtures"

# run with python -m tests.ingestion.scraper_offline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest's test file discovery expects filenames to start with test, e.g. test_repository.py, which is why running make test currently skips this file. Test functions also don't need to be called at the end of the file

  • Rename this file to test_scraper.py
  • Remove the function calls on lines 59-60, the functions just needs to be defined
  • In addition to the SCS page fixtures, add fixtures for 1-2 ccss resource pages too (e.g. one faq and one article) since they have a different structure



def load_fixture(name: str) -> str:
return (FIXTURES / name).read_text(encoding="utf-8")


def test_bcyber_page():
html = load_fixture("courseTest.html")

text, title = _extract_from_html(html)

assert title == "Courses and Registration (B.Cyber.) - School of Computer Science"
assert "Electives and Prohibited Courses" in text
assert "Skip to Main Content" not in text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion could be a bit more broad (e.g. some pages say "Skip to Content" instead of "Skip to Main Content", so even if that line isn't skipped, the assertion passes)

A better approach may be having a list of "junk" phrases that should be asserted to be skipped, e.g. :

for junk in ["Skip to Content", "Skip to Main Content", "Search this website"... <any other commonly present phrases>]:
    assert junk not in text


# Optional if your extraction removes nav correctly
assert "Search this website" not in text

assert "[PDF:" in text

assert (
"[PDF: https://carleton.ca/scs/wp-content/uploads/BCyber-Course-Map-202630-3.pdf]" in text
)

assert (
"[PDF: https://carleton.ca/scs/wp-content/uploads/FINAL2-BCyber-Course-Map-202530.pdf]"
in text
)


def test_new_student_faq():
html = load_fixture("faqTest.html")

text, title = _extract_from_html(html)

question = "How do I build a timetable?"
answer = "Log into Carleton Central"
assert question in text
assert answer in text

question = "How do I build a timetable?"
answer = "Log into Carleton Central"
assert text.index(question) < text.index(answer)

assert "How do I view my grades or exam schedule" in text

# Use a phrase that appears in that answer block
assert "Information about how to view your grades or exam schedule" in text
assert "Skip to Main Content" not in text


test_new_student_faq()
test_bcyber_page()
24 changes: 24 additions & 0 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading