Crawl any list of websites and pull every email address, phone number, LinkedIn, Twitter, Facebook, Instagram, YouTube, TikTok, and Telegram handle in Python — for lead generation, recruiting, journalism, or technical SEO contact discovery.
This Python project drives the Website Contact Scraper Apify actor — a focused crawler that visits a website's homepage plus its contact / about / team / press / careers pages (configurable depth and breadth), then pulls every email, phone, and social link the site reveals. Up to 100 starting domains per run, $1 per 1,000 results, no rate limit on your end.
Building an SDR / outreach pipeline always hits the same wall: you have a list of prospect domains, but their contact pages bury emails behind images, JavaScript, or info[at]example.com obfuscation. Generic regex scrapers miss the obfuscated ones; commercial enrichment APIs (Apollo, Hunter, Lusha) start at $99+/month per seat and gate the data behind a credits system. This actor instead just does the boring crawling work — visits a few key pages per domain, runs robust extraction regexes against deobfuscated HTML, and returns clean lists per domain.
- B2B outbound prospecting — convert a list of company domains into emails + phones for SDR outreach.
- Recruiting & headhunting — pull contact info from career pages of target hire companies.
- Journalism & PR — find press@ / media@ contacts for outlets you want to pitch.
- Technical SEO outreach — broken-link building or guest-post pitching requires a real contact, not a contact form.
- Influencer outreach — extract emails from creator websites linked in social media bios.
- Sales territory enrichment — feed your CRM with phone numbers for cold-call follow-up.
- Compliance & due diligence — verify that a domain has working contact info before signing a partnership.
- Python 3.10+
- A free Apify account
- No third-party enrichment account needed
git clone https://github.com/pro100chok/website-email-phone-extractor-python.git
cd website-email-phone-extractor-python
pip install -r requirements.txt
cp .env.example .env
# paste your APIFY_API_TOKEN
python main.pymain.py enriches ten developer-tools websites (Linear, Airtable, Retool, Hex, Mintlify, Vercel, Fly.io, Supabase, PlanetScale, Dagster), picks the best generic email per domain (hello@/contact@/sales@ first), and saves to JSON + CSV.
For each startUrls entry the crawler:
- Visits the URL and parses links matching contact / about / team / press / careers / support routes (multiple language variants).
- Crawls those pages depth-first, up to
maxDepthlevels andmaxPagesPerDomainpages total per domain. - Extracts emails from
mailto:links, plain text (deobfuscating[at],(at),at,@), and structured data (Person/OrganizationJSON-LD blocks). - Extracts phone numbers from
tel:links and free text using locale-aware regex (US, EU, intl formats). - Detects social handles for LinkedIn, X / Twitter, Facebook, Instagram, YouTube, TikTok, Telegram.
The output is one dataset record per starting domain. Multiple emails / phones / socials are returned as arrays so you can pick the one that matches your outreach style.
import os
from apify_client import ApifyClient
client = ApifyClient(os.environ["APIFY_API_TOKEN"])
run = client.actor("pro100chok/extract-emails").call(run_input={
"startUrls": [
{"url": "https://www.linear.app"},
{"url": "https://www.airtable.com"},
],
"maxDepth": 2,
"maxPagesPerDomain": 15,
})
for it in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{it['domain']}: {it['emails'][:3]} {it.get('phones', [])[:1]}"){
"url": "https://www.linear.app",
"domain": "linear.app",
"emails": ["hello@linear.app", "press@linear.app", "careers@linear.app"],
"phones": [],
"socials": {
"twitter": "https://twitter.com/linear",
"linkedin": "https://www.linkedin.com/company/linear-app/",
"youtube": "https://www.youtube.com/@linear"
},
"pagesCrawled": 12
}| Parameter | Type | Required | Description |
|---|---|---|---|
startUrls |
object[] | yes | Up to 100 starting URLs. Accepts full URLs, bare domains, with or without www/protocol. |
maxDepth |
integer | no | Crawl depth from each start page. 0 = start page only. Default 2. |
maxPagesPerDomain |
integer | no | Max pages crawled per domain. Default 10. |
concurrency |
integer | no | Number of domains processed in parallel. Default 10. |
useProxy |
boolean | no | Enable proxy. Off by default — most public sites don't need it. |
proxyConfiguration |
object | no | Used only when useProxy=true. Supports Apify Proxy or custom URLs. |
| File | Demonstrates |
|---|---|
examples/01_basic_usage.py |
Single-website crawl. |
examples/02_from_search_results.py |
Multi-domain bulk crawl. |
examples/03_role_email_filter.py |
Filter to role inboxes (sales@, support@, etc.). |
examples/04_export_to_csv.py |
Pandas export + best-contact-per-domain logic. |
examples/05_export_to_google_sheets.py |
Append fresh enrichment to a shared Sheet. |
How much does it cost? $1 per 1,000 domains crawled (each domain = one result item). Apify's free $5/month covers ~5,000 enrichments before you pay anything.
Will it find every email on the site?
It finds every email present in HTML, deobfuscated forms ([at], (at), HTML entities), mailto: links, and JSON-LD schemas across the homepage + a small set of high-value subpages. Emails buried in images, PDFs, or third-party form widgets won't be extracted — those usually require either OCR or an expensive headless-browser pass.
What's maxPagesPerDomain for?
Most sites have all useful contact info reachable from the homepage and one click away (contact / about / team / press). Setting maxPagesPerDomain: 15 lets you reach those routes without blowing through credits crawling every blog post.
Is this GDPR-compliant? The actor extracts publicly-published contact information from public websites. Whether it's legal to use those contacts for cold outreach depends on your jurisdiction (B2B is generally fine under GDPR's legitimate-interest basis if you offer opt-out; consumer/personal emails are stricter). Consult your legal team before bulk outreach in EU markets.
Can I run this with proxies?
Yes. Set useProxy: true and pass proxyConfiguration. Most public marketing sites don't need proxies, but Cloudflare-protected sites occasionally do.
Can I extract emails from a single page only?
Yes. Set maxDepth: 0 — the actor will only crawl the URL you give it, no link discovery.
Does it work with subdomains?
Yes. Pass https://subdomain.example.com and the crawler will crawl that subdomain (not the parent domain) up to maxPagesPerDomain.
What social platforms does it detect? LinkedIn, X / Twitter, Facebook, Instagram, YouTube, TikTok, Telegram, GitHub, Discord, Crunchbase, AngelList — anywhere a recognizable canonical URL appears on the page.
Can I verify the emails I get? Yes. Pipe the output through the Email Verifier actor to check deliverability and risk scores before you send your outreach.
- Email Verifier — Bulk Email Validation — validate the emails this actor produces.
- Google Maps Scraper — Emails, Reviews & Photos — local-business contact extraction.
- Clutch.co Scraper — agency contact directory enriched with emails.
See all my actors at apify.com/pro100chok.
MIT — see LICENSE.
Built on top of the Website Contact Scraper Apify actor.