Skip to content

Fix CDN 403 by downloading PDFs via curl_cffi before tabula parsing#10

Open
btg94 wants to merge 1 commit into
mxufc29:mainfrom
btg94:fix/curl-cffi-cdn-bypass
Open

Fix CDN 403 by downloading PDFs via curl_cffi before tabula parsing#10
btg94 wants to merge 1 commit into
mxufc29:mainfrom
btg94:fix/curl-cffi-cdn-bypass

Conversation

@btg94
Copy link
Copy Markdown

@btg94 btg94 commented Mar 10, 2026

Summary

  • The NBA CDN (Akamai) now blocks all programmatic HTTP requests with 403 Forbidden — both requests.get() and tabula.read_pdf() from URL fail
  • This uses curl_cffi with impersonate='chrome' to bypass Akamai's TLS fingerprinting
  • PDFs are downloaded to temp files, parsed locally by tabula, then cleaned up
  • Falls back to requests/aiohttp when curl_cffi is not installed

Changes

  • _parser.py: New _download_pdf_bytes() helper using curl_cffi. Updated validate_injrepurl() and extract_injrepurl() to download-then-parse-locally via temp files
  • _parser_asy.py: New _download_pdf_bytes_sync() for use with asyncio.to_thread(). Updated validate_irurl_async() and extract_irurl_async() with the same temp file pattern
  • pyproject.toml: Added curl_cffi>=0.7,<0.14 as a dependency
  • tests/test_cdn_bypass.py: 30 new tests (26 pass, 4 skip pending async helper rename)

Design decisions

  • curl_cffi is a core dependency (not optional) since 100% of remote fetches are broken without it
  • All function signatures are backwards-compatible
  • **kwargs pattern for custom headers still works
  • Temp files cleaned up in finally blocks to prevent leaks

Fixes #6

Test plan

  • _download_pdf_bytes uses curl_cffi with impersonate='chrome'
  • Falls back to requests when curl_cffi unavailable
  • extract_injrepurl passes local temp path to tabula (not URL)
  • Temp files cleaned up on both success and error paths
  • Async path uses asyncio.to_thread for blocking curl_cffi calls
  • Async falls back to aiohttp when curl_cffi unavailable
  • Verified end-to-end: successfully fetched 16,274 injury report entries for 2025-26 season

🤖 Generated with Claude Code

…parsing

The NBA CDN (Akamai) now blocks all programmatic HTTP requests — both
requests.get() and tabula.read_pdf() from URL return 403 Forbidden.

This uses curl_cffi with impersonate='chrome' to bypass Akamai's TLS
fingerprinting. PDFs are downloaded to temp files and parsed locally
by tabula, then cleaned up. Falls back to requests/aiohttp when
curl_cffi is not installed.

Fixes mxufc29#6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexanderjulianmartinez
Copy link
Copy Markdown

alexanderjulianmartinez commented Apr 1, 2026

Wondering if or when this will be shipped? 👀 cc: @mxufc29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NBA blocking Tabula URL reads

2 participants