Skip to content

fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs#442

Open
evilhamsterman wants to merge 1 commit into
arabold:mainfrom
evilhamsterman:fix/browser-fetcher-nonnavigable
Open

fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs#442
evilhamsterman wants to merge 1 commit into
arabold:mainfrom
evilhamsterman:fix/browser-fetcher-nonnavigable

Conversation

@evilhamsterman

Copy link
Copy Markdown

Fixes #441.

Summary

Makes BrowserFetcher robust to the two failure modes described in #441:

  • networkidleload. page.goto now gates on "load" and waits for networkidle only on a best-effort basis (with a short timeout, swallowed on failure), matching what HtmlPlaywrightMiddleware already does. Sites that never go network-idle (Cloudflare telemetry, analytics, websockets) no longer time out the navigation.
  • Non-navigable resources fall back to the request API. When page.goto rejects with ERR_INVALID_ARGUMENT / ERR_ABORTED (Markdown/JSON/plain-text resources the browser tries to download), the fetcher loads the origin first (so any JS/anti-bot challenge is solved and clearance cookies are set on the context) and then retrieves the bytes via page.request.get, which reuses those cookies but does not try to render the response. A still-blocked fetch surfaces as a retryable ScraperError rather than crashing the job.

Why

The llms.txt feature seeds Markdown (.md) URL variants at depth 0. Behind a Cloudflare Managed Challenge these get routed to the browser fallback, where page.goto on a Markdown URL throws ERR_INVALID_ARGUMENT; since depth-0 failures are fatal in BaseScraperStrategy, one such seed aborts the whole scrape. The networkidle gate independently caused navigation timeouts on the same class of sites.

Changes

  • src/scraper/fetcher/BrowserFetcher.tsload gate + best-effort networkidle; new fetchViaRequest() fallback for non-navigable URLs.
  • src/scraper/fetcher/BrowserFetcher.test.ts — refactored mocks into a mockBrowser() helper; added tests for the request-API fallback (success path) and for a still-blocked fallback raising a ScraperError.

Testing

  • npx vitest run src/scraper/fetcher/BrowserFetcher.test.ts — passes (incl. new cases).
  • npx tsc --noEmit and biome check — clean.
  • Verified end-to-end against a Cloudflare-challenged Read-the-Docs site (docs.vyos.io): the scrape that previously aborted in ~30s now completes the full tree with no ERR_INVALID_ARGUMENT or networkidle timeouts; pages that genuinely can't be cleared fail individually (non-fatal) instead of killing the job.

Notes

The depth-0-seed-fatal behavior in BaseScraperStrategy (any single llms.txt seed failure aborting the whole job) is a related but separate issue; this PR makes the browser path degrade gracefully so it no longer triggers that path, but the underlying strategy behavior is left for a follow-up.

…dle hangs

BrowserFetcher gated page.goto on "networkidle", but many sites (analytics,
Cloudflare telemetry, websockets) never reach network idle, so navigation timed
out after the full browser timeout even though the document was ready. Gate on
"load" instead, then wait for networkidle only on a best-effort basis.

Also handle resources the browser cannot navigate to (Markdown, JSON, plain
text): page.goto rejects them with ERR_INVALID_ARGUMENT / ERR_ABORTED because
the browser starts a download. These now fall back to the browser context's
request API, which reuses cookies (so any anti-bot clearance carries over) but
does not try to render the response. A still-blocked fallback surfaces as a
retryable ScraperError rather than crashing the job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BrowserFetcher: networkidle navigation hangs and non-navigable (.md) URLs abort scrapes

1 participant