-
Notifications
You must be signed in to change notification settings - Fork 4
Improve scrapers content extraction #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8ebe583
b61e058
69b7f5b
b6feb41
89af866
6394893
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,43 +1,46 @@ | ||
| import httpx | ||
| import trafilatura | ||
|
|
||
| from src.config.logger import get_logger | ||
|
|
||
| log = get_logger(__name__) | ||
|
|
||
|
|
||
| def scrape(url: str) -> tuple[str, str | None]: | ||
| """Fetch a URL and extract clean text + title. | ||
|
|
||
| Returns (text, title). Title may be None if not found. Text may be empty | ||
| if trafilatura couldn't extract anything; the caller should check and skip. | ||
| """ | ||
| log.info("scrape_started", url=url) | ||
| with httpx.Client(timeout=30, follow_redirects=True) as client: | ||
| response = client.get(url) | ||
| response.raise_for_status() | ||
|
|
||
| # Known limitations (tracked as a separate ticket): | ||
| # - Misses content inside aria-hidden="true" accordions (common on FAQ pages) | ||
| # - Misses question text but gets answer text on other FAQ pages, we should get both | ||
| # - Link formatting is markdown-style and may need cleanup downstream | ||
| # Improving extraction quality is a tuning ticket, not a blocker for development. | ||
| text = ( | ||
| trafilatura.extract( | ||
| response.text, | ||
| favor_recall=True, | ||
| include_links=True, | ||
| include_tables=True, | ||
| include_formatting=False, | ||
| include_comments=False, | ||
| deduplicate=True, | ||
| output_format="txt", | ||
| ) | ||
| or "" | ||
| ) | ||
|
|
||
| metadata = trafilatura.extract_metadata(response.text) | ||
| title = metadata.title if metadata else None | ||
|
|
||
| log.info("scrape_complete", url=url, chars=len(text), title=title) | ||
| return text, title | ||
| import httpx | ||
| from bs4 import BeautifulSoup | ||
|
|
||
|
|
||
| def _extract_from_html(html: str) -> tuple[str, str | None]: | ||
| cut_marker = "<!-- close main-wrapper" | ||
| if cut_marker in html: | ||
| html = html.split(cut_marker)[0] | ||
|
Comment on lines
+6
to
+8
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cut_marker approach may not always work (e.g. some SCS pages or CCSS pages don't have the From looking at a few pages, carleton webpages seem to always have a CCSS resource pages however don't have a This snippet should help: This would let you drop the cut_marker string. Out of curiosity though did you try and find issues with scoping extraction to the |
||
|
|
||
| html = BeautifulSoup(html, "html.parser") | ||
|
mikesiez marked this conversation as resolved.
|
||
| # removing elements by class | ||
| for el in html.select( | ||
| ".footer, .global-nav, .navigation, " ".topbar, .content__meta, .visuallyhidden" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Extracting this and the tag list on line 18 to separate constant lists at the top can make it easier to maintain if the lists grow in the future, something like |
||
| ): | ||
| el.decompose() | ||
|
|
||
| # removing elements by tag | ||
| for tag in html(["header", "footer", "nav"]): | ||
| tag.decompose() | ||
|
|
||
| # formatting links after clearing having cleared clutter | ||
| for link in html.find_all("a"): | ||
| href = link.get("href") | ||
| if href and href[0] != "#": | ||
| if href[-3:] == "pdf": | ||
| link.string = f"{link.string} [PDF: {href}]" | ||
| else: | ||
| link.string = f"{link.string} [Link: {href}]" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This snippet should help: |
||
|
|
||
| text = html.get_text("\n", strip=True) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I tried this and it normalizes the text content well: Add a (there may be more tags worth including) Then: |
||
|
|
||
| title = (html("title")[0]).get_text() | ||
|
mikesiez marked this conversation as resolved.
|
||
|
|
||
| return text, title | ||
|
|
||
|
|
||
| def scrape(url: str) -> tuple[str, str | None]: | ||
| """Fetch a URL and extract clean text + title. | ||
|
|
||
| Returns (text, title). Title may be None if not found. | ||
| """ | ||
| with httpx.Client(timeout=30, follow_redirects=True) as client: | ||
| response = client.get(url) | ||
| response.raise_for_status() | ||
|
|
||
| return _extract_from_html(response.text) | ||
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| from pathlib import Path | ||
|
|
||
| from src.ingestion.scrapers.html_scraper import _extract_from_html | ||
|
|
||
| FIXTURES = Path(__file__).parent / "fixtures" | ||
|
|
||
| # run with python -m tests.ingestion.scraper_offline | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pytest's test file discovery expects filenames to start with
|
||
|
|
||
|
|
||
| def load_fixture(name: str) -> str: | ||
| return (FIXTURES / name).read_text(encoding="utf-8") | ||
|
|
||
|
|
||
| def test_bcyber_page(): | ||
| html = load_fixture("courseTest.html") | ||
|
|
||
| text, title = _extract_from_html(html) | ||
|
|
||
| assert title == "Courses and Registration (B.Cyber.) - School of Computer Science" | ||
| assert "Electives and Prohibited Courses" in text | ||
| assert "Skip to Main Content" not in text | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This assertion could be a bit more broad (e.g. some pages say "Skip to Content" instead of "Skip to Main Content", so even if that line isn't skipped, the assertion passes) A better approach may be having a list of "junk" phrases that should be asserted to be skipped, e.g. : |
||
|
|
||
| # Optional if your extraction removes nav correctly | ||
| assert "Search this website" not in text | ||
|
|
||
| assert "[PDF:" in text | ||
|
|
||
| assert ( | ||
| "[PDF: https://carleton.ca/scs/wp-content/uploads/BCyber-Course-Map-202630-3.pdf]" in text | ||
| ) | ||
|
|
||
| assert ( | ||
| "[PDF: https://carleton.ca/scs/wp-content/uploads/FINAL2-BCyber-Course-Map-202530.pdf]" | ||
| in text | ||
| ) | ||
|
|
||
|
|
||
| def test_new_student_faq(): | ||
| html = load_fixture("faqTest.html") | ||
|
|
||
| text, title = _extract_from_html(html) | ||
|
|
||
| question = "How do I build a timetable?" | ||
| answer = "Log into Carleton Central" | ||
| assert question in text | ||
| assert answer in text | ||
|
|
||
| question = "How do I build a timetable?" | ||
| answer = "Log into Carleton Central" | ||
| assert text.index(question) < text.index(answer) | ||
|
|
||
| assert "How do I view my grades or exam schedule" in text | ||
|
|
||
| # Use a phrase that appears in that answer block | ||
| assert "Information about how to view your grades or exam schedule" in text | ||
| assert "Skip to Main Content" not in text | ||
|
|
||
|
|
||
| test_new_student_faq() | ||
| test_bcyber_page() | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rewrite dropped the logger and logs for scrape starting/completing. For observability, please re add the logger and appropriate logs throughout the scraper code